Parsing a DTD to reveal hierarchy of elements

08imamm · Jul 19, 2012

My goal is to parse several relatively complex DTDs to reveal the hierarchy of elements. The only distinction between DTDs is the version, but each version made no attempt to remain backwards compatible--that would be too easy! As such, I intend to visualize the structure of elements defined by each DTD so that I can design a database model suitable for uniformly storing the data.Because most solutions I've investigated in Python will only validate against external DTDs, I've decided to start my efforts from the beginning. Python's \[code\]xml.parsers.expat\[/code\] only parses XML files and implements very basic DTD callbacks, so I've decided to check out the original version, which was written in C and claims to fully comport with the XML 1.0 specifications. However, I have the following questions about this approach:[*]Will expat (in C) parse external entity references in a DTD file and follow those references, parse their elements, and add those elements to the hierarchy?[*]Can expat generalize and handle SGML, or will it fail after encountering an invalid DTD yet valid SGML file?My requirements may lead to the conclusion that expat is inappropriate. If that's the case, I'm considering writing a lexer/parser for XML 1.0 DTDs. Are there any other options I should consider?The following illustrates more succinctly my intent:Input DTD Excerpt\[code\]<!ELEMENT abstract (doc-page+ | (abst-problem , abst-solution) | p+)>\[/code\]Object Created from DTD Excerpt (pseudocode)\[code\]class abstract: member doc_page_array[] member abst_problem member abst_solution member paragraph_array[] member description = "A concise summary of the disclosure."\[/code\]One challenging aspect is to attribute to the \[code\]<!ELEMENT>\[/code\] tag the comment appearing above it. Hence, a homegrown parser might be necessary if I cannot use expat to accomplish this.Another problem is that some parsers have problems processing DTDs that use unicode characters greater than #xFFFF, so that might be another factor that favors creating my own.If it turns out that the lexer/parser route is better suited for my task, does anyone happen to know of a good way to convert these EBNF expressions to something capable of being parsed? I suppose the "best" approach might be to use regular expressions.Anyway, these are just the thoughts I've had regarding my issue. Any answers to the above questions or suggestions on alternative approaches would be appreciated.

Parsing a DTD to reveal hierarchy of elements

08imamm

New Member