CakePHP Web Crawler Memory Leak

ClolaLoaltpat · Jun 10, 2012

I am developing an application that is, to put it simply, a niche based search engine. Within the application I have include a function crawl() which crawls a website and then uses the collectData() function to store the correct data from the site in the "products" table as described in the function. The visited pages are stored in a databaseThe crawler works pretty well, just as described except for two things: Timeout and Memory. I've managed to correct the timeout error but the memory remains. I know simply increasing the memory_limit is not actually fixing the problem.The function is run by visiting "EXAMPLE.COM/products/crawl".Is a memory leak inevitable with a PHP Web crawler? OR is there something I'm doing wrong/not doing.Thanks in advance. (CODE BELOW)\[code\]function crawl() { $this->_crawl('http://www.example.com/','http://www.example.com'); } /*** * * This function finds all link in $start and collects * data from them as well as recursively crawling them * * @ param $start, the webpage where the crawler starts * * @ param $domain, the domain in which to stay * ***/ function _crawl($start, $domain) { $dom = new DOMDocument(); @$dom->loadHTMLFile($start); $xpath = new DOMXPath($dom); $hrefs = $xpath->evaluate("/html/body//a");//get all <a> elements for ($i = 0; $i < $hrefs->length; $i++) { $href = http://stackoverflow.com/questions/2107076/$hrefs->item($i); $url = $href->getAttribute('href'); // get href value if(!(strpos($url, 'http') !== false)) { //check for relative links $url = $domain . '/' . $url; } if($this->Page->find('count', array('conditions' => array('Page.url' => $url))) < 1 && (strpos($url, $domain) !== false)) { // if this link has not already been crawled ( exists in database) $this->Page->create(); $this->Page->set('url',$url); $this->Page->set('indexed',date('Y-m-d H:i:s')); $this->Page->save(); // add this url to database $this->_collectData($url); //collect this links data $this->_crawl($url, $domain); //crawl this link } } }\[/code\]

CakePHP Web Crawler Memory Leak

ClolaLoaltpat

New Member