beautifulsoup findall

paintrick

New Member
I have some xml:\[code\]<article><uselesstag></uslesstag><topic>oil, gas</topic><body>body text</body></article><article><uselesstag></uslesstag><topic>food</topic><body>body text</body></article><article><uselesstag></uslesstag><topic>cars</topic><body>body text</body></article>\[/code\]There are many, many useless tags.I want to use beautifulsoup to collect all of the text in the body tags and their associated topic text to create some new xml.I am new to python, but I suspect that some form of\[code\]import arfffrom xml.etree import ElementTreeimport refrom StringIO import StringIOimport BeautifulSoupfrom BeautifulSoup import BeautifulSouptotstring=""with open('reut2-000.sgm', 'r') as inF: for line in inF: string=re.sub("[^0-9a-zA-Z<>/\s=!-\"\"]+","", line) totstring+=stringsoup = BeautifulSoup(totstring)body = soup.find("body")for anchor in soup.findAll('body'): #Stick body and its topics in an associated array?file.close\[/code\]will work.1) How do I do it?2) Should I add a root node to the XML? otherwise it's not proper XML is it?Thanks very muchEdit: What i want to end up with is:\[code\]<article><topic>oil, gas</topic><body>body text</body></article><article><topic>food</topic><body>body text</body></article><article><topic>cars</topic><body>body text</body></article>\[/code\]There are many, many useless tags.
 
Back
Top