Hello,
I am trying to load html document into DOM using PHP and then convert it into some xml creating my own type of xml doc (using string concatenation). I only search for specific tags inside the document and those tags become part of the resultant xml (in CDATA).
I am facing problem with entities. I can load any document from web using this script. So those html documents may have any of these entities:
¡
¢
£
¤
....
My problem is that dom converts those entities into their printable value.
I tried setting
$doc->substituteEntities = false;
If in the document I output the xml encoding is utf-8 there is no problem it just prints xml well in browser or is saved to a file but only thing is it converts entities for example to space. I want that it should not touch entities in document as I traverse the document. All the entities in a document are returned as XML_TEXT_NODE by dom.
So if I try to use php htmlentities($nodeValue) to convert them back to their entity equivalent it attaches meaningless characters to it. For example:
Â
¡
¢
is the result when passed through htmlentities. See  added.
So this is my problem. I have tried for few hours but haven't found any solution to this.
Also is there any simpler solution to dump just everything which is inside an element rather than traversing each child node recursively?Have you tried using DOMEntityReference? Maybe if the entities are appearing as an individual text node, you could strip that out and replace it with a DOMEntityReference?
hth
--RobinThanks for the reply.
I had already found the solution the next day.
Dom was converting entities to their equivlant printable values. I passed them through htmlentities to convert them back, but it was adding those additional useless enttities.
To solve this, the additional thing I had to do was pass a third parameter to htmlentities which was the encoding. I passed 'utf-8' and it correctly converted them back to their entity equivlants without adding anything additional.
I am trying to load html document into DOM using PHP and then convert it into some xml creating my own type of xml doc (using string concatenation). I only search for specific tags inside the document and those tags become part of the resultant xml (in CDATA).
I am facing problem with entities. I can load any document from web using this script. So those html documents may have any of these entities:
¡
¢
£
¤
....
My problem is that dom converts those entities into their printable value.
I tried setting
$doc->substituteEntities = false;
If in the document I output the xml encoding is utf-8 there is no problem it just prints xml well in browser or is saved to a file but only thing is it converts entities for example to space. I want that it should not touch entities in document as I traverse the document. All the entities in a document are returned as XML_TEXT_NODE by dom.
So if I try to use php htmlentities($nodeValue) to convert them back to their entity equivalent it attaches meaningless characters to it. For example:
Â
¡
¢
is the result when passed through htmlentities. See  added.
So this is my problem. I have tried for few hours but haven't found any solution to this.
Also is there any simpler solution to dump just everything which is inside an element rather than traversing each child node recursively?Have you tried using DOMEntityReference? Maybe if the entities are appearing as an individual text node, you could strip that out and replace it with a DOMEntityReference?
hth
--RobinThanks for the reply.
I had already found the solution the next day.
Dom was converting entities to their equivlant printable values. I passed them through htmlentities to convert them back, but it was adding those additional useless enttities.
To solve this, the additional thing I had to do was pass a third parameter to htmlentities which was the encoding. I passed 'utf-8' and it correctly converted them back to their entity equivlants without adding anything additional.