Python: Parsing through XML with mini dom

epherfeip · Mar 29, 2013

I'm parsing through a decent sized xml file, and I ran into a problem. For some reason I cannot extract data even though I have done the exact same thing on different xml files before.Here's a snippet of my code: (rest of the program, I've tested and they work fine)EDIT: changed to include a testing try&except block def parseXML(): file = open(str(options.drugxml),'r') data = http://stackoverflow.com/questions/15709499/file.read() file.close() dom = parseString(data) druglist = dom.getElementsByTagName('drug')\[code\] with codecs.open(str(options.csvdata),'w','utf-8') as csvout, open('DrugTargetRel.csv','w') as dtout: for entry in druglist: count = count + 1 try: drugtype = entry.attributes['type'].value print count except: print count print entry drugidObj = entry.getElementsByTagName('drugbank-id')[0] drugid = drugidObj.childNodes[0].nodeValue drugnameObj = entry.getElementsByTagName('name')[0] drugname = drugnameObj.childNodes[0].nodeValue targetlist = entry.getElementsByTagName('target') for target in targetlist: targetid = target.attributes['partner'].value dtout.write((','.join((drugid,targetid)))+'\n') csvout.write((','.join((drugid,drugname,drugtype)))+'\n')\[/code\]In case you're wondering what the XML file's schema roughly looks like, here's a rough god-awful sketch of the levels:\[code\]<drugs> <drug type='something' ...> <drugbank-id> <name> ... <targets> <target partner='something'>\[/code\]Those that I typed in here, I need to extract from the XML file and stick it in csv files (as the code above shows), and the code has worked for different xml files before, not sure why it's not working on this one. I've gotten KeyError on 'type', I've also gotten indexing errors on line that extracts drugid even though EVERY drug has a drugid. What am I screwing up here?EDIT: the stuff I'm extracting are guaranteed to be in each drug.For anyone who cares, here's the link to the XML file I'm parsing:http://www.drugbank.ca/system/downloads/current/drugbank.xml.zipEDIT: After implementing a try & except block (see above) here's what I found out:In the schema, there are sections called "drug interactions" that also have a subfield called drug. So like this:\[code\] <drugs> <drug type='something' ...> <drugbank-id> <name> ... <targets> <target partner='something'> <drug-interactions> <drug>\[/code\]I think that my line druglist = dom.getElementsByTagName('drug') is unintentionally picking those up as well -- I don't know how I could fix this... any suggestions?

Python: Parsing through XML with mini dom

epherfeip

New Member