xml to json preserving document order

while this subject has been covered multiple times, I can't seem to easily find something that deals with how to maintain document order with semi-structured data as discused in this articleHere is the example from the article: But how do we convert textual content mixed up with elements? For example:<e> some <a>textual</a> content</e> It obviously doesn't make sense in most cases to collect all text nodes in an array,"e": { "#text": ["some", "content"], "a": "textual"} that doesn't preserve order or semantics.Is there a simple way to deal with this problem? ideally, if I take some XML, can I supply an xpath switch that would make some sections CDATA?For example, I have the following quick script:#!/usr/bin/env pythonimport simplejson as jsonimport xmltodictdef main(): xml = ''' <root> <p id="para-1"> Bacon ipsum dolor sit amet boudin jerky fatback corned beef, beef ribs salami chicken frankfurter shankle sirloin shoulder short loin pork chop. T-bone hamburger pastrami turducken ball tip. Shoulder fatback strip steak kielbasa. Hamburger ground round ball tip sirloin biltong andouille. Strip steak chicken pancetta pork loin turducken. Ball tip filet mignon jerky boudin. <quote> Short loin corned beef andouille, pancetta rump drumstick t-bone bacon jerky. Brisket short loin meatball, turducken tail spare ribs frankfurter ground round. Pork belly pig chuck doner, swine ground round pork loin rump sausage ribeye frankfurter sirloin strip steak turducken. Short loin salami rump, chicken leberkas beef ribs pastrami. Bresaola leberkas venison sausage brisket frankfurter bacon. Pork loin short loin biltong jowl tongue. Ball tip doner sirloin pork belly beef cow. </quote> </p> <p id="para-2"> Ham hock pig filet mignon, ham jowl beef ribs prosciutto pork belly hamburger t-bone kielbasa. Chuck jowl shoulder pork. Tongue strip steak fatback cow prosciutto chicken. Fatback kielbasa flank, meatball ham frankfurter short ribs pastrami tri-tip beef ribs capicola brisket rump biltong swine. Brisket beef kielbasa pancetta andouille venison flank jowl ham hock jerky shankle ball tip shoulder. <note> 1. Yeah, right lisa. Some <em>Magical Animal </em></note> </p> </root> ''' o = xmltodict.parse(xml) print json.dumps(o,indent=2)if __name__ == '__main__': main() which produces the following:{ "root": { "p": [ { "@id": "para-1", "quote": "Short loin corned beef andouille, pancetta rump drumstick\n t-bone bacon jerky. Brisket short loin meatball, turducken\n tail spare ribs frankfurter ground round. Pork belly pig\n chuck doner, swine ground round pork loin rump sausage\n ribeye frankfurter sirloin strip steak turducken. Short\n loin salami rump, chicken leberkas beef ribs pastrami.\n Bresaola leberkas venison sausage brisket frankfurter\n bacon. Pork loin short loin biltong jowl tongue. Ball tip\n doner sirloin pork belly beef cow.", "#text": "Bacon ipsum dolor sit amet boudin jerky fatback\n corned beef, beef ribs salami chicken frankfurter shankle sirloin\n shoulder short loin pork chop. T-bone hamburger pastrami turducken\n ball tip. Shoulder fatback strip steak kielbasa. Hamburger ground\n round ball tip sirloin biltong andouille. Strip steak chicken\n pancetta pork loin turducken. Ball tip filet mignon jerky boudin." }, { "@id": "para-2", "note": { "em": "Magical Animal", "#text": "1. Yeah, right lisa. Some" }, "#text": "Ham hock pig filet mignon, ham jowl beef ribs prosciutto pork belly hamburger\n t-bone kielbasa. Chuck jowl shoulder pork. Tongue strip steak fatback cow\n prosciutto chicken. Fatback kielbasa flank, meatball ham frankfurter short ribs\n pastrami tri-tip beef ribs capicola brisket rump biltong swine. Brisket beef\n kielbasa pancetta andouille venison flank jowl ham hock jerky shankle ball tip\n shoulder." } ] }}ideally I could call something like print json.dumps(o,indent=2,cdata='http://stackoverflow.com/questions/15651923//node()[ancestors::p])(i'm tired, this xpath may be off -- I want all nodes who have an ancestor of p)which would than produce the following (notice this preserves text order as it directly puts xml into the json. { "root": { "p": [ { "@id": "para-1", "#text": "Bacon ipsum dolor sit amet boudin jerky fatback\n corned beef, beef ribs salami chicken frankfurter shankle sirloin\n shoulder short loin pork chop. T-bone hamburger pastrami turducken\n ball tip. Shoulder fatback strip steak kielbasa. Hamburger ground\n round ball tip sirloin biltong andouille. Strip steak chicken\n pancetta pork loin turducken. Ball tip filet mignon jerky boudin. <quote>Short loin corned beef andouille, pancetta rump drumstick\n t-bone bacon jerky. Brisket short loin meatball, turducken\n tail spare ribs frankfurter ground round. Pork belly pig\n chuck doner, swine ground round pork loin rump sausage\n ribeye frankfurter sirloin strip steak turducken. Short\n loin salami rump, chicken leberkas beef ribs pastrami.\n Bresaola leberkas venison sausage brisket frankfurter\n bacon. Pork loin short loin biltong jowl tongue. Ball tip\n doner sirloin pork belly beef cow.</quote>", }, { "@id": "para-2", "#text": "Ham hock pig filet mignon, ham jowl beef ribs prosciutto pork belly hamburger\n t-bone kielbasa. Chuck jowl shoulder pork. Tongue strip steak fatback cow\n prosciutto chicken. Fatback kielbasa flank, meatball ham frankfurter short ribs\n pastrami tri-tip beef ribs capicola brisket rump biltong swine. Brisket beef\n kielbasa pancetta andouille venison flank jowl ham hock jerky shankle ball tip\n shoulder. <note> 1. Yeah, right Lisa. Some <em>Magical Animal </em></note> } ] }}My questions areDoes something like this exist? to convert xml/html (yes, my real data is both) to JSONwhile preserving semantics by supplying an XPath? If not, I can roll my own, but it just seems like this is a common enough problem it would be addressed before.What is would be considered best practices here? is my idea of converting to mixed content like this acceptable/desirable?
 
Back
Top