Java - PDFBox - ReplaceString - Issues with parsed tokens (possibly encoding?)

Java - PDFBox - ReplaceString - Issues with parsed tokens (possibly encoding?) - java

I've been struggling with an issue related to PDFBox and PDF editing. I have been assigned the task to edit a couple of strings given a PDF file, and to output a mirrored version of the files with the edited strings into it. I've been told that the problem has been solved in the past using this tool, so I have been told to do the same. The function I am using is this :
public void doIt( String inputFile, String outputFile, String strToFind, String message)
throws IOException, COSVisitorException
{
// the document
PDDocument doc = null;
try
{
doc = PDDocument.load( inputFile );
List pages = doc.getDocumentCatalog().getAllPages();
for( int i=0; i<pages.size(); i++ )
{
PDPage page = (PDPage)pages.get( i );
PDStream contents = page.getContents();
PDFStreamParser parser = new PDFStreamParser(contents.getStream() );
parser.parse();
List tokens = parser.getTokens();
for( int j=0; j<tokens.size(); j++ )
{
Object next = tokens.get( j );
if( next instanceof PDFOperator )
{
PDFOperator op = (PDFOperator)next;
//Tj and TJ are the two operators that display
//strings in a PDF
if( op.getOperation().equals( "Tj" ) )
{
//Tj takes one operator and that is the string
//to display so lets update that operator
COSString previous = (COSString)tokens.get( j-1 );
String string = previous.getString();
string = string.replaceFirst( strToFind, message );
previous.reset();
previous.append( string.getBytes("ISO-8859-1") );
}
else if( op.getOperation().equals( "TJ" ) )
{
COSArray previous = (COSArray)tokens.get( j-1 );
for( int k=0; k<previous.size(); k++ )
{
Object arrElement = previous.getObject( k );
if( arrElement instanceof COSString )
{
COSString cosString = (COSString)arrElement;
String string = cosString.getString();
string = string.replaceFirst( strToFind, message );
cosString.reset();
cosString.append( string.getBytes("ISO-8859-1") );
}
}
}
}
}
//now that the tokens are updated we will replace the
//page content stream.
PDStream updatedStream = new PDStream(doc);
OutputStream out = updatedStream.createOutputStream();
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens( tokens );
page.setContents( updatedStream );
}
doc.save( outputFile );
}
finally
{
if( doc != null )
{
doc.close();
}
}
}
Which is the code that is being used in a file contained into the PDFBox examples (https://svn.apache.org/repos/asf/pdfbox/tags/1.5.0/pdfbox/src/main/java/org/apache/pdfbox/examples/pdmodel/ReplaceString.java).
The file I have been given, however, is not being modified at all from this function. Nothing happens at all. Upon further inspection, I decided to analyze the sequencing of the tokens produced from the parser. The file is being parsed correctly in everything other than the COSString elements, which contain gibberish characters that look like they have been wrongly encoded (bunch of random symbols and numbers). I tried parsing other documents, and the function works with some of them, but not on everything I passed as input (a latex output file was modified correctly and had correctly encoded COSStrings, whereas other automatically generated pdfs produced no results with gibberish COSString content). I am also fairly sure the rest of the structure is being read correctly, since I rebuild the output on a different file, and the output file looks exactly the same as the input, which seems to mean that the file structure is being analyzed correctly.The file contains Identity-H encoded fonts.
I tried parsing the very same file using the PDFTextStripper (which extracts text from PDFs), and the parsing output from there returns the correct text output, using this:
PDFTextStripper pdfStripper = new PDFTextStripper("UTF-8");
String result = pdfStripper.getText(doc);
System.out.println(result);
Could it be an encoding issue? Can I tell the PDFStreamParser (or whoever holds the responsability) to force an encoding on read? Is it even an encoding issue, since the text extraction is working correctly?
Thanks in advance for the help.

Some files use font subsets. Lets say that the subset uses only the characters E, G, L, and O. So GOOGLE would appear in the file as hex byte values 2, 4, 4, 2, 3, and 1.
Now if you want to change GOOGLE into APPLE you'll have three problems:
1) your subset doesn't contain the characters A, L and P
2) the size will be different
3) It is quite possible that the string you're searching is splitted in several parts.
Btw the current version is 1.8.10. The ReplaceString utility has been removed in the upcoming 2.0 version to avoid giving the illusion that characters can easily be replaced.
This answer is somewhat speculative, because you haven't linked to a PDF.

Inside PDF text can be stored at two places:
Content stream
X Object inside Resource
Inside content stream mostly text are associated with TJ or Tj operator. But texts associated with Tj or TJ are not always in ASCII format, it may be some byte values. We can extract text from these byte value after mapping character codes to unicode values using proper encoding and mapping. While extracting text we use mapping and encoding, but we do not have a reverse mapping to check if a glyph belong to which character code. So basically we should replace character codes of string to be replaced with character codes of new string.
Example:
1. (Text) Tj
2. (12 45 5 3)Tj
Also we should replace string in content stream as well as X Object (if present) inside resource.
So I think this might be helpful.
GoodLuck!

Related

IText Unable to read whitespace in PDF using Java

I am trying to read a PDF file trough IText,
Program successfully read pdf file but unable to include spaces.
program:
public void parse(String filename) throws IOException {
PdfReader reader = new PdfReader(filename);
PdfReaderContentParser pdfReaderContentParser = new PdfReaderContentParser(reader);
TextExtractionStrategy strategy = null;
for (int i=1; i<= reader.getNumberOfPages(); i++) {
String text = PdfTextExtractor.getTextFromPage(reader, i, new LocationTextExtractionStrategy());
System.out.println(text);
}
}
here is data need to get from pdf
When program is reading the pdf then output is:
DATE MODE PARTICULARS DEPOSITS WITHDRAWALS BALANCE
01-04-2017 B/F 54,396.82
if you see in image Date is 01-04-2017 , MODE have empty PARTICULARS value is B/F, DEPOSITS and WITHDRAWALS is also empty value and BALANCE is 54,396.82
same data i need in text form
e.g.-->
DATE MODE PARTICULARS DEPOSITS WITHDRAWALS BALANCE
01-04-2017 B/F 54,396.82
Need help, thanks in advance.

You are extracting text from the PDF, the result is correct, it is not missing spaces, as there are no spaces in the raw text.
However (I missed that earlier, so I'm editing), you are using a LocationTextExtractionStrategy, which is "table-aware". This is good, but at the end getTextFromPage discards that table-aware information.
So instead you could create your own strategy implementation that would extend LocationTextExtractionStrategy, add a getTabulatedText() method to spit out the text with spaces inserted where you want them. Take inspiration from getResultantText(), see how it inserts a single space between each cell... In your code you would insert as many spaces (or tabs) as needed. See this answer for an example.
MyTextExtractionStrategy strategy = new MyTextExtractionStrategy();
for (int i=1; i<= reader.getNumberOfPages(); i++) {
String rawText = PdfTextExtractor.getTextFromPage(reader, i, strategy);
String tabulatedText = strategy.getTabulatedText();
System.out.println(text);
}
(maybe there is a "strategy" implementation that already does that, but I don't know it)

reading UTF-16 produces unexpected results

I use the beaglebuddy Java library in an Android project for reading/writing ID3 tags of mp3 files. I'm having an issue with reading the text that was previously written using the same library and could not find anything related in their docs.
Assume I write the following info:
MP3 mp3 = new MP3(pathToFile);
mp3.setLeadPerformer("Jon Skeet");
mp3.setTitle("A Million Rep");
mp3.save();
Looking at the source code of the library, I see that UTF-16 encoding is explicitly set, internally it calls
protected ID3v23Frame setV23Text(String text, FrameType frameType) {
return this.setV23Text(Encoding.UTF_16, text, frameType);
}
and
protected ID3v23Frame setV23Text(Encoding encoding, String text, FrameType frameType) {
ID3v23FrameBodyTextInformation frameBody = null;
ID3v23Frame frame = this.getV23Frame(frameType);
if(frame == null) {
frame = this.addV23Frame(frameType);
}
frameBody = (ID3v23FrameBodyTextInformation)frame.getBody();
frameBody.setEncoding(encoding);
frameBody.setText(encoding == Encoding.UTF_16?Utility.getUTF16String(text):text);
return frame;
}
At a later point, I read the data and it gives me some weird Chinese characters:
mp3.getLeadPerformer(); // 䄀 䴀椀氀氀椀漀渀 刀攀瀀
mp3.getTitle(); // 䨀漀渀 匀欀攀攀琀
I took a look at the built-in Utility.getUTF16String(String) method:
public static String getUTF16String(String string) {
String text = string;
byte[] bytes = string.getBytes(Encoding.UTF_16.getCharacterSet());
if(bytes.length < 2 || bytes[0] != -2 || bytes[1] != -1) {
byte[] bytez = new byte[bytes.length + 2];
bytes[0] = -2;
bytes[1] = -1;
System.arraycopy(bytes, 0, bytez, 2, bytes.length);
text = new String(bytez, Encoding.UTF_16.getCharacterSet());
}
return text;
}
I'm not quite getting the point of setting the first 2 bytes to -2 and -1 respectively, is this a pattern stating that the string is UTF-16 encoded?
However, I tried to explicitly call this method when reading the data, that seems to be readable, but always prepends some cryptic characters at the start:
Utility.getUTF16String(mp3.getLeadPerformer()); // ��Jon Skeet
Utility.getUTF16String(mp3.getTitle()); // ��A Million Rep
Since the count of those characters seems to be constant, I created a temporary workaround by simply cutting them off.
Fields like "comments" where the author does not explicitly enforce UTF-16 when writing are read without any issues.
I'm really curious about what's going on here and appreciate any suggestions.

JESS Userfunction writes "BS" instead of "/home" to a file

I'm using JESS for my expert system implementation and I have a userfunction. It writes some strings to a text file.
public Value call(ValueVector vv, Context context) throws JessException {
Rete engine = context.getEngine();
int size = vv.size();
for(i = 0; i < size-1; i++)
params[i] = vv.get(i+1).stringValue(context);
engine.eval("(printout file " + params[2] + ")");
return new Value(params[1], RU.STRING);
}
params[2] has /home/username/folder as content. When it prints out to a file I get the following in the file. BS has black background btw.
BSusername/folder
I'm not sure what's going on here. Any ideas?
In addition, I've never had this problem when I print out from JESS code.

The unquoted text /home/ is being parsed as a regular expression; the printed value is somewhat unpredictable. You need to include double quotes in your built-up command so the path is seen as a quoted string.

Apache POI: find characters in Word document without spaces

I want to read the number of characters without spaces in a Word document using Apache POI.
I can get the number of characters with spaces using the SummaryInformation.getCharCount() method as in the following code:
public void countCharacters() throws FileNotFoundException, IOException {
File wordFile = new File(BASE_PATH, "test.doc");
POIFSFileSystem p = new POIFSFileSystem(new FileInputStream(wordFile));
HWPFDocument doc = new HWPFDocument(p);
SummaryInformation props = doc.getSummaryInformation();
int numOfCharsWithSpaces = props.getCharCount();
System.out.println(numOfCharsWithSpaces);
}
However there seems to be no method for returning the number of characters without spaces.
How do I find this value?

If you want to base this on the metadata of the document, all you will get is estimates (according to the Microsoft specs). There are essentially two values which you can play around with:
GKPIDSI_CHARCOUNT (which is what you already accessed in your own code sample)
GKPIDDSI_CCHWITHSPACES
Don't ask me about the exact differences of those two values, though. I haven't designed this stuff...
Below is a code sample to illustrate the access to them (GKPIDDSI_CCHWITHSPACES is a little awkward):
HWPFDocument document = [...];
SummaryInformation summaryInformation = document.getSummaryInformation();
System.out.println("GKPIDSI_CHARCOUNT: " + summaryInformation.getCharCount());
DocumentSummaryInformation documentSummaryInformation = document.getDocumentSummaryInformation();
Integer count = null;
for (Property property : documentSummaryInformation.getProperties()) {
if (property.getID() == 0x11) {
count = (Integer) property.getValue();
break;
}
}
System.out.println("GKPIDDSI_CCHWITHSPACES: " + count);
The moment at which Word's internal algorithm that updates those values kicks in is rather unpredictable to me. So what you see in Word's own statistics may not necessarily be the same as when running the above code.

How can i add all elements in html with Jsoup?

File input = new File("1727209867.htm");
Document doc = Jsoup.parse(input, "UTF-8","http://www.facebook.com/people/Alison-Vella/1727209867");
I am trying to parse this html file which is saved and using in local system. But parsing dont parse all html. So i cant reach information that i need. Parse only work for 6k char with this code but actually html file has 60k char.

This is not possible in jsoup, but with a workaround:
final File input = new File("example.html");
final int maxLength = 6000; // Limit of char's to read
InputStream is = new FileInputStream(input); // Open file for reading
StringBuilder sb = new StringBuilder(maxLength); // Init the "buffer" with the size required
int count = 0; // Count of chars readen
int c; // Char for reading
while( ( c = is.read() ) != -1 && count < maxLength ) // Read a single char until limit is reached
{
sb.append((char) c); // Save the char into the buffer
count++; // increment the chars readen
}
Document doc = Jsoup.parse(sb.toString()); // Parse the Html from buffer
Explained:
Read the file char-by-char into a buffer until you reach the limit
Parse the text from buffer and process it with jsoup
Problem: This wont take care about closing tags etc - it will stop reading exactly if you are on the limit.
(Possible) Solutions:
ignore this and stop exactly where you are, parse this and "fix" or drop the hanging html
if you are at the end, read until you reach next closing tag or > char
if you are at the end, read until you reach next block-tag
if you are at the end, read until a specific tag or comment

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java - PDFBox - ReplaceString - Issues with parsed tokens (possibly encoding?) - java

Related

IText Unable to read whitespace in PDF using Java

reading UTF-16 produces unexpected results

JESS Userfunction writes "BS" instead of "/home" to a file

Apache POI: find characters in Word document without spaces

How can i add all elements in html with Jsoup?

Categories

Resources