BeautifulSoup get_text does not strip all tags and JavaScript

aliyes · Jul 21, 2012

I am trying to use BeautifulSoup to get text from web pages.Below is a script I've written to do so. It takes two arguments, first is the input HTML or XML file, the second output file.\[code\]import sysfrom bs4 import BeautifulSoupdef stripTags(s): return BeautifulSoup(s).get_text()def stripTagsFromFile(inFile, outFile): open(outFile, 'w').write(stripTags(open(inFile).read()).encode("utf-8"))def main(argv): if len(sys.argv) <> 3: print 'Usage:\t\t', sys.argv[0], 'input.html output.txt' return 1 stripTagsFromFile(sys.argv[1], sys.argv[2]) return 0if __name__ == "__main__": sys.exit(main(sys.argv))\[/code\]Unfortunately, for many web pages, for example: http://www.greatjobsinteaching.co.uk/career/134112/Education-Manager-LocationI get something like this (I'm showing only few first lines):\[code\]html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" Education Manager Job In London With Caleeda | Great Jobs In Teachingvar _gaq = _gaq || [];_gaq.push(['_setAccount', 'UA-15255540-21']);_gaq.push(['_trackPageview']);_gaq.push(['_trackPageLoadTime']);\[/code\]Is there anything wrong with my script? I was trying to pass 'xml' as the second argument to BeautifulSoup's constructor, as well as 'html5lib' and 'lxml', but it doesn't help.Is there an alternative to BeautifulSoup which would work better for this task? All I want is to extract the text which would be rendered in a browser for this web page.Any help will be much appreciated.

BeautifulSoup get_text does not strip all tags and JavaScript

aliyes

New Member