Most efficient method to extract data from this XML file

moonpeach · Jul 19, 2012

XML File Sample\[code\]<GateDocument> <GateDocumentFeatures> ... </GateDocumentFeatures> <TextWithNodes> <Node id="0"/> MESSAGE SET <Node id="19"/> <Node id="20"/> 1. 1/1/09 - sample text 1 <Node id="212"/> sample text 2 <Node id="223"/> sample text 3 ... <Node id="160652"/> </TextWithNodes> <AnnotationSet></AnnotationSet> <AnnotationSet Name="SomeName"> ... </AnnotationSet></GateDocument>\[/code\]Just to start off, this is the first I'm coding in Python and dealing with XML, so sorry if I miss really obvious things!My goal is to extract the sample text at specific node ids.First attempt - I used minidom, which did not give me the correct methods in dealing with the extraction (http://stackoverflow.com/questions/11122736/extracting-text-from-xml-node-with-minidom) due to this weird format of the node ids in self-closing tags.Second attempt - I took up suggestions in looking at lxml, I have successfully extracted the text to something like this:\[code\]['\n\t\t','n\t\tMESSAGE SET\n\t\t','\n\t\t','\n\t\t1. 1/1/09 - sample text 1,....,'\n\t']\[/code\]With some clean up, I think I can get the text fine, however, I lose the associated node id value. with the code: \[code\]from lxml import etreefrom StringIO import StringIOxmlfile = ('C:\...AnnotationsXML.xml')xmldoc = etree.parse(xmlfile) print xmldoc.xpath("//TextWithNodes/text()")\[/code\]So I guess my questions is:[*]Is there a way to extract the above without the \n\t\t? I read that it is the space formating (ie tab) but I am not sure where the \[code\]<Node id = 0>\[/code\] went. [*]Is there perhaps a better or more efficient method in extraction for this file? Thanks!

Most efficient method to extract data from this XML file

moonpeach

New Member