How do I parse Wikipedia XML dumps into one document per line?

picarcayl · Jan 18, 2013

For a project, I need to convert a Wikipedia XML dump into a plain text corpus file which keeps one document per line. I have found several tools for splitting the XML dump into several different files, but I this is not the needed format and I fear that managing millions of small files will add unnecessary work to my already slow HDD.Any suggestions of good programs for this?

How do I parse Wikipedia XML dumps into one document per line?

picarcayl

New Member