How to convert an originally Latin-1 char[] from SAX parser to a proper UTF-8 String?

shortimer

New Member
I've been trying to use the Java SAX parser to parse an XML file in the ISO-8859-1 character encoding. This goes otherwise well, but the special characters such as ? and ? are giving me a headache. In short, the ContentHandler.characters(...) method gives me weird characters, and you cannot even use a char array to construct a String with a specified encoding. Here's a complete minimum working example in two files:latin1.xml:\[code\]<?xml version='1.0' encoding='ISO-8859-1' standalone='no' ?><x>Mot?rhead</x>\[/code\]That file is saved in the said Latin-1 format, so hexdump gives this:\[code\]$ hexdump -C latin1.xml 00000000 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 27 31 |<?xml version='1|00000010 2e 30 27 20 65 6e 63 6f 64 69 6e 67 3d 27 49 53 |.0' encoding='IS|00000020 4f 2d 38 38 35 39 2d 31 27 20 73 74 61 6e 64 61 |O-8859-1' standa|00000030 6c 6f 6e 65 3d 27 6e 6f 27 20 3f 3e 0a 3c 78 3e |lone='no' ?>.<x>|00000040 4d 6f 74 f6 72 68 65 61 64 3c 2f 78 3e |Mot.rhead</x>|\[/code\]So the "?" is encoded with a single byte, f6, as you'd expect.Then, here's the Java file, saved in the UTF-8 format:MySAXHandler.java:\[code\]import java.io.File;import java.io.FileReader;import javax.xml.parsers.SAXParser;import javax.xml.parsers.SAXParserFactory;import org.xml.sax.InputSource;import org.xml.sax.XMLReader;import org.xml.sax.helpers.DefaultHandler;public class MySAXHandler extends DefaultHandler {private static final String FILE = "latin1.xml"; // Edit this to point to the correct file@Overridepublic void characters(char[] ch, int start, int length) { char[] dstCharArray = new char[length]; System.arraycopy(ch, start, dstCharArray, 0, length); String strValue = http://stackoverflow.com/questions/10451033/new String(dstCharArray); System.out.println("Read: '"+strValue+"'"); assert("Mot?rhead".equals(strValue));}private XMLReader getXMLReader() { try { SAXParser saxParser = SAXParserFactory.newInstance().newSAXParser(); XMLReader xmlReader = saxParser.getXMLReader(); xmlReader.setContentHandler(new MySAXHandler()); return xmlReader; } catch (Exception ex) { throw new RuntimeException("Epic fail.", ex); }}public void go() { try { XMLReader reader = getXMLReader(); reader.parse(new InputSource(new FileReader(new File(FILE)))); } catch (Exception ex) { throw new RuntimeException("The most epic fail.", ex); }}public static void main(String[] args) { MySAXHandler tester = new MySAXHandler(); tester.go();}}\[/code\]The result of running this program is that it outputs \[code\]Read: 'Mot?rhead'\[/code\] (? replaced with a "? in a box") and then crashes due to an assertion error. If you look into the char array, you'll see that the char that encodes the letter ? consists of three bytes. They don't make any sense to me, as in UTF-8 an ? should be encoded with two bytes.What I have triedI have tried converting the character array to a string, then getting the bytes of that string to pass to another string constructor with a charset encoding parameter. I have also played with CharBuffers and tried to find something that might possibly work with the Locale class to solve this problem, but nothing I try seems to work.
 
Back
Top