Parsing PDF file using Apache PDFBox

Jonah · Apr 1, 2013

I am trying to modify the contents of a PDF document using PDFBox. I used this example as it is, but observed that the text it my PDF file is getting split at character level (or worse). For example, a string,\[code\]EM? what it is:\[/code\] gets split into:\[code\]COSString{E}COSString{M?}COSString{ }COSString{w}COSString{hat }COSString{it }COSString{is}COSString{:}\[/code\](when checked by printing the \[code\]cosString\[/code\] in the above mentioned code). As far as I can see, there are only Latin characters in the file, and the encoding is also ISO-8859-1. Any ideas?Regards,Salil

Parsing PDF file using Apache PDFBox

Jonah

New Member