html scraping and css queries

guguygugu

New Member
what are the advantages and disadvantages of the following libraries?PHP Simple HTML DOM ParserQPphpQueryFrom the above i've used QP and it failed to parse invalid HTML, and simpleDomParser, that does a good job, but it kinda leaks memory because of the object model. But you may keep that under control by calling $object->clear(); unset($object); when you dont need an object anymore.Are there any more scrapers? What are your experiences with them? I'm going to make this a community wiki, may we'll build a useful list of libraries that can be useful when scraping.i did some tests based Byron's answer: <? include("lib/simplehtmldom/simple_html_dom.php"); include("lib/phpQuery/phpQuery/phpQuery.php"); echo "<pre>"; $html = file_get_contents("http://stackoverflow.com/search?q=favorite+programmer+cartoon"); $data['pq'] = $data['dom'] = $data['simple_dom'] = array(); $timer_start = microtime(true); $dom = new DOMDocument(); @$dom->loadHTML($html); $x = new DOMXPath($dom); foreach($x->query("//a") as $node) { $data['dom'][] = $node->getAttribute("href"); } foreach($x->query("//img") as $node) { $data['dom'][] = $node->getAttribute("src"); } foreach($x->query("//input") as $node) { $data['dom'][] = $node->getAttribute("name"); } $dom_time = microtime(true) - $timer_start; echo "dom: \t\t $dom_time . Got ".count($data['dom'])." items \n"; $timer_start = microtime(true); $doc = phpQuery::newDocument($html); foreach( $doc->find("a") as $node) { $data['pq'][] = $node->href; } foreach( $doc->find("img") as $node) { $data['pq'][] = $node->src; } foreach( $doc->find("input") as $node) { $data['pq'][] = $node->name; } $time = microtime(true) - $timer_start; echo "PQ: \t\t $time . Got ".count($data['pq'])." items \n"; $timer_start = microtime(true); $simple_dom = new simple_html_dom(); $simple_dom->load($html); foreach( $simple_dom->find("a") as $node) { $data['simple_dom'][] = $node->href; } foreach( $simple_dom->find("img") as $node) { $data['simple_dom'][] = $node->src; } foreach( $simple_dom->find("input") as $node) { $data['simple_dom'][] = $node->name; } $simple_dom_time = microtime(true) - $timer_start; echo "simple_dom: \t $simple_dom_time . Got ".count($data['simple_dom'])." items \n"; echo "</pre>";and got dom: 0.00359296798706 . Got 115 items PQ: 0.010568857193 . Got 115 items simple_dom: 0.0770139694214 . Got 115 items
 
Back
Top