Removing jquery and CSS from an Xml Document

mirou93

New Member
I'm using sgmlreader to convert HTML to XML. The output goes into a XmlDocument object, which I can then use the InnerText method to extract the plain text from the website. I'm trying to get the text to look as clean as possible, by removing any javascript. Looping through the xml and removing any \[code\]<script type="text/javascript">\[/code\] is easy enough, but I've hit a brick wall when any jquery or styling isn't encapsulated in any tags. Can anybody help me out? Sample Code:Step one:Once I use the webclient class to download the HTML, I save it, then open the file with the text reader class.Step two:Create sgmlreader class and set the input stream to the text reader:\[code\] // setup SGMLReader Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader(); sgmlReader.DocType = "HTML"; sgmlReader.WhitespaceHandling = WhitespaceHandling.All; sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower; sgmlReader.InputStream = reader; // create document doc = new XmlDocument(); doc.PreserveWhitespace = true; doc.XmlResolver = null; doc.Load(sgmlReader);\[/code\]Step three:Once I have a xmldocument, I use the doc.InnerText to get my plain text.Step four:I can easy remove JavaScript tags like so:\[code\] XmlNodeList nodes = document.GetElementsByTagName("text/javascript"); for (int i = nodes.Count - 1; i >= 0; i--) { nodes.ParentNode.RemoveChild(nodes); }\[/code\]Some stuff still slips through. Heres an example of an ouput for one particular website I'm scriping:\[code\]Criminal and Civil Enforcement | Fraud | Office of Inspector General | U.S. Department of Health and Human Services#fancybox-right { right:-20px; } #fancybox-left { left:-20px; } #fancybox-right:hover span, #fancybox-right span #fancybox-right:hover span, #fancybox-right span { left:auto; right:0; } #fancybox-left:hover span, #fancybox-left span #fancybox-left:hover span, #fancybox-left span { right:auto; left:0; } #fancybox-overlay { /* background: url('/connections/images/wc-overlay.png'); *//* background: url('/connections/images/banner.png') center center no-repeat; */} $(document).ready(function(){$("a[rel=photo-show]").fancybox({'titlePosition' : 'over','overlayColor' : '#000','overlayOpacity' : 0.9});$(".title-under").fancybox({'titlePosition' : 'outside','overlayColor' : '#000','overlayOpacity' : 0.9}) }); \[/code\]That jquery and styling needs to be removed.
 
Back
Top