Parsing huge, badly encoded XML files in Python

Lumo · Jul 18, 2012

I have been working on code that parses external XML-files. Some of these files are huge, up to gigabytes of data. Needless to say, these files need to be parsed as a stream because loading them into memory is much too inefficient and often leads to OutOfMemory troubles.I have used the libraries miniDOM, ElementTree, cElementTree and I am currently using lxml.Right now I have a working, pretty memory-efficient script, using \[code\]lxml.etree.iterparse\[/code\]. The problem is that some of the XML files I need to parse contain encoding errors (they advertise as UTF-8, but contain differently encoded characters). When using \[code\]lxml.etree.parse\[/code\] this can be fixed by using the \[code\]recover=True\[/code\] option of a custom parser, but \[code\]iterparse\[/code\] does not accept a custom parser. (see also: this question)My current code looks like this:\[code\]from lxml import etreeevents = ("start", "end")context = etree.iterparse(xmlfile, events=events)event, root_element = context.next() # <items>for action, element in context: if action == 'end' and element.tag == 'item': # <parse> root_element.clear() \[/code\]Error when \[code\]iterparse\[/code\] encounters a bad character (in this case, it's a \[code\]^Y\[/code\]):\[code\]lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding !Bytes: 0x19 0x73 0x20 0x65, line 949490, column 25\[/code\]I don't even wish to decode this data, I can just drop it. However I don't know any way to skip the element - I tried \[code\]context.next\[/code\] and \[code\]continue\[/code\] in try/except statements.Any help would be appreciated!UpdateSome additional info:This is the line where iterparse fails:\[code\]<description><![CDATA:[musea de la photographie fonds mercator. Met meer dan 80.000 foto^Ys en 3 miljoen negatieven is het Muse de la...]]></description>\[/code\]According to etree, the error occurs at bytes \[code\]0x19 0x73 0x20 0x65\[/code\].
According to hexedit, \[code\]19 73 20 65\[/code\] translates to ASCII \[code\].s e\[/code\]
The \[code\].\[/code\] in this place should be an apostrophe (foto's).I also found this question, which does not provide a solution.

Parsing huge, badly encoded XML files in Python

Lumo

New Member