I've lost the day trying to do something that I thought was going to be completely simple to do, and it's been one headbanging experience after another...
I'm getting a ZIP file from a Java program. The file is POSTed to my system, and will be sitting in a directory for me to process. Inside the ZIP file is a single text file, which is actually an XML "fragment" - there's no <?xml ?> header, and not base node.
It's basically a set of data readings
<data time="hh:mm:ss" date="yyyymmdd" value=""><someinfo></someinfo><otherinfo></otherinfo></data>
Just rows and rows of that.
So I thought I'd use the zip_* functions and read the file in, slap an <? xml ?> and <root> </root> around the returned info, and load that into a DomDocument() and have an XML file to play with.
Nope.
I seem to be able to read the data from the zip file. But it turns out the file coming back is encoded in either UTF-8 or UTF-16 ... there's a Byte Order Mark that I have to ignore on the return from the zip_* functions.
When I plunk the extra info on and try to do a loadXML() call on the string I build, I'm getting parse errors.
I finally gave up on this approach, and thought I'd try a more brute force and ignorance approach by turning the returned string into an array of records by splitting on "<data".
That's gotten me nowhere, either. I can't figure out the incantation to break the string up when it's encoded in UTF-8 (or is it really UTF-16??). I tried split() and preg_split(), but I only ever end up with one huge record.
I've been Googling and searching php.net. All I've found are a large number of articles and blog entries bemoaning the poor support for Unicode in PHP in general, and tutorials and examples that have either full-fledged XML files to load, or are only using ASCII characters.
So ... Can someone point me to something that can either tell me how to get the XML file I want, or tell me how to split the string into an array of records, each on "<data"?
Or point me to something that will explain how to work with Unicode/UTF-8/UTF-16 better so that I can beat on this some more?
Thanks.Since its coming from Java I suspect its been saved or what have you in Unicode[/url format.
You firstly should probably put the encoding type in there i think its UTF-16 with Java (I maybe mistaken). However seeing exactly what is being added would be appreciated.
Also is this specific PHP 5 related, or more Coding as im thinking the latter.Your reply made me try Googling (is that a verb now?) with some different keywords, and I found a blog entry that helped me solve the problem. Entry is at PHP, XML, and Character Encodings: a tale of sadness, rage, and (data-)loss (<!-- m --><a class="postlink" href="http://minutillo.com/steve/weblog/2004/6/17/php-xml-and-character-encodings-a-tale-of-sadness-rage-and-data-loss">http://minutillo.com/steve/weblog/2004/ ... -data-loss</a><!-- m -->)
The key was to use mb_convert_encoding($filecontentstr, "UTF-8", "UTF-16LE");
I have to use UTF-16LE - using UTF-16 doesn't work right. I know it's little-endian because I'm on Windows, and the Byte Order Mark indicates it too.
Anyway, with that in place, I get the file to work properly and can now proceed to the part that I thought was going to be hard (and is turning out to be easier than I thought...)
I'm getting a ZIP file from a Java program. The file is POSTed to my system, and will be sitting in a directory for me to process. Inside the ZIP file is a single text file, which is actually an XML "fragment" - there's no <?xml ?> header, and not base node.
It's basically a set of data readings
<data time="hh:mm:ss" date="yyyymmdd" value=""><someinfo></someinfo><otherinfo></otherinfo></data>
Just rows and rows of that.
So I thought I'd use the zip_* functions and read the file in, slap an <? xml ?> and <root> </root> around the returned info, and load that into a DomDocument() and have an XML file to play with.
Nope.
I seem to be able to read the data from the zip file. But it turns out the file coming back is encoded in either UTF-8 or UTF-16 ... there's a Byte Order Mark that I have to ignore on the return from the zip_* functions.
When I plunk the extra info on and try to do a loadXML() call on the string I build, I'm getting parse errors.
I finally gave up on this approach, and thought I'd try a more brute force and ignorance approach by turning the returned string into an array of records by splitting on "<data".
That's gotten me nowhere, either. I can't figure out the incantation to break the string up when it's encoded in UTF-8 (or is it really UTF-16??). I tried split() and preg_split(), but I only ever end up with one huge record.
I've been Googling and searching php.net. All I've found are a large number of articles and blog entries bemoaning the poor support for Unicode in PHP in general, and tutorials and examples that have either full-fledged XML files to load, or are only using ASCII characters.
So ... Can someone point me to something that can either tell me how to get the XML file I want, or tell me how to split the string into an array of records, each on "<data"?
Or point me to something that will explain how to work with Unicode/UTF-8/UTF-16 better so that I can beat on this some more?
Thanks.Since its coming from Java I suspect its been saved or what have you in Unicode[/url format.
You firstly should probably put the encoding type in there i think its UTF-16 with Java (I maybe mistaken). However seeing exactly what is being added would be appreciated.
Also is this specific PHP 5 related, or more Coding as im thinking the latter.Your reply made me try Googling (is that a verb now?) with some different keywords, and I found a blog entry that helped me solve the problem. Entry is at PHP, XML, and Character Encodings: a tale of sadness, rage, and (data-)loss (<!-- m --><a class="postlink" href="http://minutillo.com/steve/weblog/2004/6/17/php-xml-and-character-encodings-a-tale-of-sadness-rage-and-data-loss">http://minutillo.com/steve/weblog/2004/ ... -data-loss</a><!-- m -->)
The key was to use mb_convert_encoding($filecontentstr, "UTF-8", "UTF-16LE");
I have to use UTF-16LE - using UTF-16 doesn't work right. I know it's little-endian because I'm on Windows, and the Byte Order Mark indicates it too.
Anyway, with that in place, I get the file to work properly and can now proceed to the part that I thought was going to be hard (and is turning out to be easier than I thought...)