Is there a way to keep entities intact while parsing html with DomDocument?

lerbubeschab

New Member
I have this function to ensure every img tag has absolute URL:\[code\]function absoluteSrc($html, $encoding = 'utf-8'){ $dom = new DOMDocument(); // Workaround to use proper encoding $prehtml = "<html><head><meta http-equiv=\"Content-Type\" content=\"text/html; charset={$encoding}\"></head><body>"; $posthtml = "</body></html>"; if($dom->loadHTML( $prehtml . trim($html) . $posthtml)){ foreach($dom->getElementsByTagName('img') as $img){ if($img instanceof DOMElement){ $src = http://stackoverflow.com/questions/3730933/$img->getAttribute('src'); if( strpos($src, 'http://') !== 0 ){ $img->setAttribute('src', 'http://my.server/' . $src); } } } $html = $dom->saveHTML(); // Remove remains of workaround / DomDocument additions $cut_start = strpos($html, '<body>') + 6; $cut_length = -1 * (1+strlen($posthtml)); $html = substr($html, $cut_start, $cut_length); } return $html;}\[/code\]It works fine, but it returns decoded entities as unicode characters\[code\]$html = <<< EOHTML<p><img src="http://stackoverflow.com/questions/3730933/images/lorem.jpg" alt="lorem" align="left">Lorem ipsum dolor sit amet consectetuer Nullam felis laoreetCum magna. Suscipit sed vel tincidunt urna.<br>Vel consequat pretium Curabitur faucibus justo adipiscing elit.<img src="http://stackoverflow.com/questions/3730933/others/ipsum.png" alt="ipsum" align="right"></p><center>&copy; Dr&nbsp;Jekyll & Mr&nbsp;Hyde</center>EOHTML;echo absoluteSrc($html);\[/code\]Outputs: \[code\]<p><img src="http://my.server/images/lorem.jpg" alt="lorem" align="left">Lorem ipsum dolor sit amet consectetuer Nullam felis laoreetCum magna. Suscipit sed vel tincidunt urna.<br>Vel consequat pretium Curabitur faucibus justo adipiscing elit.<img src="http://my.server/others/ipsum.png" alt="ipsum" align="right"></p><center>? Dr Jekyll & Mr Hyde</center>\[/code\]As you can see in the last line
  • &copy; is translated to ? (U+00A9),
  • &nbsp; to non-breaking space (U+00A0),
  • & to &
I would like them to remain the same as in input string.
 
Back
Top