extract “style” information when scraping data in html using XML in R

user60 · Jul 21, 2012

I used the script below to try to extract the data from a HTML file converted from PDF.\[code\]temp.html <- scan(file=filename,what="character")pagetree <- htmlTreeParse(temp.html, error=function(...){}, useInternalNodes = TRUE)tx.raw <- getNodeSet(pagetree,"//div")\[/code\]The \[code\]tx.raw\[/code\] create a list and one of them is shown as below:\[code\]tx[[170]][[170]]<div style="position:absolute;top:985;left:748"> <nobr> <span class="ft03"> 971.72 </span> </nobr></div> \[/code\]The information I need is inside \[code\]span\[/code\] (i.e. \[code\]971.72\[/code\]), but I also need to \[code\]style\[/code\] in \[code\]div\[/code\] to let me know where exactly the piece is data in \[code\]span\[/code\] is located in the pdf file. How can I extract the style information also? Thanks.

extract &ldquo;style&rdquo; information when scraping data in html using XML in R

user60

New Member

extract “style” information when scraping data in html using XML in R