Displaying XML using CSS: How to handle &nbsp?

ggeoff

New Member
I'm dealing with a lot of .xml files. (Millions - an .xml formatted dump of Wikipedia) and they're a lot more unreadable than I imagined. For the time being, I've written a .css file to display them in a readable manner in a browser, and wrote a script to plug a reference to this .css into all the files. (I know there's other solutions, like XSLT - but all the information I found made it seem document-level which didn't suit - I'm really trying not to expand the size of these files if possible)The .css works fine for some of the files, but many contain entities like &nbsp and I get errors like: "XML Parsing Error: undefined entity" with a nice little illustration pointing to &nbsp or it's kin within a quote.There is an articles.dtd file, which seems like it should connect the dots ( keyword -> Unicode ) for the browser. It is referenced in each file like: \[code\] <!DOCTYPE article SYSTEM "../article.dtd">\[/code\]and contains a lot of entries like: \[code\]<!ENTITY nbsp " "> <!-- no-break space = non-breaking space, U+00A0 ISOnum -->\[/code\]but either I'm entirely misunderstanding what this file is for, or it's not working correctly. In any case; How can I make these documents display; Either by:
  • displaying the entities (like "&nbSp" as plain-text)
  • removing the entities altogether (by any means other than just a linear search/removal of them in the actual files)
  • Interpreting the entities as unicode, as they were intended
Naturally, the latter being preferable; absolutely ideally, by referencing some sort of external file that maps identities to Unicode (if that's not what the articles.dtd file is for....) EDIT: I'm not working with a powerful machine here.. extracting the .rars took days. Any sort of edits to each file would take a very long time.
 
Back
Top