ElementTree iterparse strategy

wademadrid

New Member
I have to handle xml documents that are big enough (up to 1GB) and parse them with python. I am using the iterparse() function (SAX style parsing).My concern is the following, imagine you have an xml like this\[code\]<?xml version="1.0" encoding="UTF-8" ?><families> <family> <name>Simpson</name> <members> <name>Hommer</name> <name>Marge</name> <name>Bart</name> </members> </family> <family> <name>Griffin</name> <members> <name>Peter</name> <name>Brian</name> <name>Meg</name> </members> </family></families>\[/code\]The problem is, of course to know when I am getting a family name (as Simpsons) and when I am getting the name of one of that family member (for example Hommer)What I have been doing so far is to use "switches" which will tell me if I am inside a "members" tag or not, the code will look like this\[code\]import xml.etree.cElementTree as ET__author__ = 'moriano'file_path = "test.xml"context = ET.iterparse(file_path, events=("start", "end"))# turn it into an iteratorcontext = iter(context)on_members_tag = Falsefor event, elem in context: tag = elem.tag value = http://stackoverflow.com/questions/12792998/elem.text if value : value = value.encode('utf-8').strip() if event == 'start' : if tag == "members" : on_members_tag = True elif tag == 'name' : if on_members_tag : print "The member of the family is %s" % value else : print "The family is %s " % value if event == 'end' and tag =='members' : on_members_tag = False elem.clear()\[/code\]And this works fine as the output is \[code\]The family is Simpson The member of the family is HommerThe member of the family is MargeThe member of the family is BartThe family is Griffin The member of the family is PeterThe member of the family is BrianThe member of the family is Meg\[/code\]My concern is that with this (simple) example i had to create an extra variable to know in which tag i was (on_members_tag) imagine with the true xml examples that I have to handle, they have more nested tags. Also note that this is a very reduced example, so you can assume that i may be facing an xml with more tags, more inner tags and trying to get different tag names, attributes and so on.So question is. Am I doing something horribly stupid here? I feel like there must be a more elegant solution to this.
 
Back
Top