[Resolved] Frustration dealing with XML fragments

liunx · Jun 1, 2008

I've lost the day trying to do something that I thought was going to be completely simple to do, and it's been one headbanging experience after another...

I'm getting a ZIP file from a Java program. The file is POSTed to my system, and will be sitting in a directory for me to process. Inside the ZIP file is a single text file, which is actually an XML "fragment" - there's no <?xml ?> header, and not base node.

It's basically a set of data readings

<data time="hh:mm:ss" date="yyyymmdd" value=""><someinfo></someinfo><otherinfo></otherinfo></data>

Just rows and rows of that.

So I thought I'd use the zip_* functions and read the file in, slap an <? xml ?> and <root> </root> around the returned info, and load that into a DomDocument() and have an XML file to play with.

Nope.

I seem to be able to read the data from the zip file. But it turns out the file coming back is encoded in either UTF-8 or UTF-16 ... there's a Byte Order Mark that I have to ignore on the return from the zip_* functions.

When I plunk the extra info on and try to do a loadXML() call on the string I build, I'm getting parse errors.

I finally gave up on this approach, and thought I'd try a more brute force and ignorance approach by turning the returned string into an array of records by splitting on "<data".

That's gotten me nowhere, either. I can't figure out the incantation to break the string up when it's encoded in UTF-8 (or is it really UTF-16??). I tried split() and preg_split(), but I only ever end up with one huge record.

I've been Googling and searching php.net. All I've found are a large number of articles and blog entries bemoaning the poor support for Unicode in PHP in general, and tutorials and examples that have either full-fledged XML files to load, or are only using ASCII characters.

So ... Can someone point me to something that can either tell me how to get the XML file I want, or tell me how to split the string into an array of records, each on "<data"?

Or point me to something that will explain how to work with Unicode/UTF-8/UTF-16 better so that I can beat on this some more?

Thanks.Since its coming from Java I suspect its been saved or what have you in Unicode[/url format.

You firstly should probably put the encoding type in there i think its UTF-16 with Java (I maybe mistaken). However seeing exactly what is being added would be appreciated.

Also is this specific PHP 5 related, or more Coding as im thinking the latter.Your reply made me try Googling (is that a verb now?) with some different keywords, and I found a blog entry that helped me solve the problem. Entry is at PHP, XML, and Character Encodings: a tale of sadness, rage, and (data-)loss (<a class="postlink" href="http://minutillo.com/steve/weblog/2004/6/17/php-xml-and-character-encodings-a-tale-of-sadness-rage-and-data-loss">http://minutillo.com/steve/weblog/2004/ ... -data-loss</a>)

The key was to use mb_convert_encoding($filecontentstr, "UTF-8", "UTF-16LE");

I have to use UTF-16LE - using UTF-16 doesn't work right. I know it's little-endian because I'm on Windows, and the Byte Order Mark indicates it too.

Anyway, with that in place, I get the file to work properly and can now proceed to the part that I thought was going to be hard (and is turning out to be easier than I thought...)

[Resolved] Frustration dealing with XML fragments

liunx

Guest