I met some problems when I used PDFBOX to extract text. There are Tyep3 embedded fonts in my PDF, but the numbers cannot be displayed normally when extracting this part. Can someone give me some guidance? thank you
My version is 2.0.22
The correct output is [USD-001], the wrong output is [USD- ]
public static String readPDF(File file) throws IOException {
RandomAccessBufferedFileInputStream rbi = null;
PDDocument pdDocument = null;
String text = "";
try {
rbi = new RandomAccessBufferedFileInputStream(file);
PDFParser parser = new PDFParser(rbi);
parser.setLenient(false);
parser.parse();
pdDocument = parser.getPDDocument();
PDFTextStripper textStripper = new PDFTextStripper();
text = textStripper.getText(pdDocument);
} catch (IOException e) {
e.printStackTrace();
} finally {
rbi.close();
}
return text;
}
I tried to use PDFBOX to convert the PDF to an image and found that everything was fine. I just wanted to get it as normal text
PDFDebugger output
The pdf file : http://tmp.link/f/6249a07f6e47f
There are a number of aspects of this file making text extraction difficult.
First of all the font itself boycotts text extraction. In its ToUnicode stream we find the mappings:
1 begincodespacerange
<00> <ff> endcodespacerange
2 beginbfchar
<22> <0000> <23> <0000> endbfchar
I.e. the two character codes of interest both are mapped to U+0000, not to U+0030 ('0') and U+0031 ('1') as they should have been.
Also the Encoding is not helping at all:
<</Type/Encoding/Differences[ 0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g121/g122]>>
The glyph names /g121 and /g122 don't have a standardized meaning either.
PdfBox for text extraction works with these two properties of a font and, therefore, fails here.
Adobe Acrobat, on the other hand, also makes use of ActualText during text extraction.
In the file there are such entries. Unfortunately, though, they are erroneous, like this for the digit '0':
/P <</MCID 23>>/Span <</ActualText<FEFF0030>>>BDC
The BDC instruction only expects a single name and a single dictionary. The above sequence of name, dictionary, name, and dictionary, therefore, is invalid.
Due to that Adobe Acrobat also used to not extract the actual text here. Only recently, probably as recently as the early 2022 releases, Acrobat started extracting a '0' here.
Actually one known "trick" to prevent one's PDFs to be text extracted by regular text extractor programs is to add incorrect ToUnicode and Encoding information but correct ActualText entries.
So it's possible the error in your file is actually an application of this trick, maybe even by design with the erroneous ActualText twist to lead text extractors with some ActualText support astray while still allowing copy&paste from Adobe Acrobat.
Related
How can I get infromation about the structure of pdf, I mean text or pic? I need my programm to move pdf without text in other folder, but now I'm getting just an empty txt file.
try (FileWriter writer = new FileWriter(outputFile)) {
PDDocument document = new PDDocument().load(file);
PDFTextStripper pdfTextStripper = new PDFTextStripper();
String text = pdfTextStripper.getText(document);
writer.write(text);
document.close();
} catch (IOException e){
e.printStackTrace();
}
Also, have a problem with getting text from saved in pdf web-pages. It looks like:
I think there is something wrong with encoding, but don't know what to do
Your code works alright, your text viewer assumes a wrong encoding.
Using your code and the same PDFBox version as you I get proper extracted text:
But when I force my viewer to assume UTF-16 encoding, I get something very similar to what you get:
The file itself does not indicate any specific encoding by a BOM or anything:
Thus, your text viewer either incorrectly guesses UTF-16 encoding or is configured to use it.
Thus, either switch your text viewer to use UTF-8 or explicitly tell your FileWriter to use UTF-16.
Depending on your specific installation, the file encoding might actually be different. As my UTF-16 view looks so very much like yours, though, the encoding very likely is at least similar to UTF-8, probably some ISO 8859-x...
I am using PDFBOX as a utility in my selenium automation for export testing . We are comparing actual exported pdf file with the expected ones using pdfbox and then pass/fail test accordingly. This works pretty much smoothly . However recently I came across actual exported file , which looks as same as expected one (as far as data is concerned) , however when comparing it with pdfbox , it is failing
Expected pdf file
Actual pdf file
Below is the general utility i am using to compare pdf files
private static void arePDFFilesEqual(File pdfFile1, File pdfFile2) throws IOException
{
LOG.info("Comparing PDF files ("+pdfFile1+","+pdfFile2+")");
PDDocument pdf1 = PDDocument.load(pdfFile1);
PDDocument pdf2 = PDDocument.load(pdfFile2);
PDPageTree pdf1pages = pdf1.getDocumentCatalog().getPages();
PDPageTree pdf2pages = pdf2.getDocumentCatalog().getPages();
try
{
if (pdf1pages.getCount() != pdf2pages.getCount())
{
String message = "Number of pages in the files ("+pdfFile1+","+pdfFile2+") do not match. pdfFile1 has "+pdf1pages.getCount()+" no pages, while pdf2pages has "+pdf2pages.getCount()+" no of pages";
LOG.debug(message);
throw new TestException(message);
}
PDFTextStripper pdfStripper = new PDFTextStripper();
LOG.debug("pdfStripper is :- " + pdfStripper);
LOG.debug("pdf1pages.size() is :- " + pdf1pages.getCount());
for (int i = 0; i < pdf1pages.getCount(); i++)
{
pdfStripper.setStartPage(i + 1);
pdfStripper.setEndPage(i + 1);
String pdf1PageText = pdfStripper.getText(pdf1);
String pdf2PageText = pdfStripper.getText(pdf2);
if (!pdf1PageText.equals(pdf2PageText))
{
String message = "Contents of the files ("+pdfFile1+","+pdfFile2+") do not match on Page no: " + (i + 1)+" pdf1PageText is : "+pdf1PageText+" , while pdf2PageText is : "+pdf2PageText;
LOG.debug(message);
System.out.println("fff");
LOG.debug("pdf1PageText is " + pdf1PageText);
LOG.debug("pdf2PageText is " + pdf2PageText);
String difference = StringUtils.difference(pdf1PageText, pdf2PageText);
LOG.debug("difference is "+difference);
throw new TestException(message+" [[ Difference is ]] "+difference);
}
}
LOG.info("Returning True , as PDF Files ("+pdfFile1+","+pdfFile2+") get matched");
} finally {
pdf1.close();
pdf2.close();
}
}
Eclipse shows this differences in console
https://s3.amazonaws.com/uploads.hipchat.com/95223/845692/9Ex0QW2fFeRqu8s/upload.png
I can see it is failing because of symbols like (curley braces , {} , hash # , exclamation mark !) however i don't know how to fix this one ..
Can anyone please tell me how to fix this one ?
However recently I came across actual exported file , which looks as same as expected one (as far as data is concerned) , however when comparing it with pdfbox , it is failing
That this might happen, should not surprise you. After all your test does not compare the looks of the pages in question but the results of text extraction.
While the look of textual data on the pages depends on the drawing instructions for the glyphs in question in the respective (in case of your files) embedded font file, the result of text extraction of the same textual data on the pages depends on the ToUnicode table or Encoding value of the PDF font information structures for that font file.
And indeed, while the textual data of the expected and the actual document use the same glyphs of the respective fonts, the ToUnicode tables in the expected and the actual document for one font claim that certain glyphs represent different Unicode code points.
The font in question has these three glyphs:
The ToUnicode map for that font in your expected document contains the mappings
<0000> <0000> <0000>
<0001> <0002> [<F125> <F128> ]
which claim that these three characters correspond to U+0000, U+F125, and U+F128.
The ToUnicode map for that font in your actual document contains the mappings
<0000> <0000> <0000>
<0001> <0002> [<F126> <F129> ]
which claim that these three characters correspond to U+0000, U+F126, and U+F129.
Thus, your test correctly has found a difference between expected and actual document, so its failure result is correct. Thus, you don't have to fix anything, the software producing the actual document has an issue!
(One could argue that the differences are inside Unicode private use areas and don't matter. In that case you'd have to update your test to ignore differences of characters from Unicode private use areas. But that should have been told you before you started creating tests.)
This is a tough one, since similar or even the same Unicode characters might have different byte representation, depending on font, encoding and other factors during PDF generation.
A possible solution I can think of if you can safely assume that the relevant text pieces are represented by 8 bit characters:
String stripUnicode(String s) {
StringBuilder sb = new StringBuilder(s.length());
for (char c : s.toCharArray()) {
if (c <= 0xFF) {
sb.append(c);
}
}
return sb.toString();
}
...
String pdf1PageText = pdfStripper.getText(pdf1);
String pdf2PageText = pdfStripper.getText(pdf2);
if (!stripUnicode(pdf1PageText).equals(stripUnicode(pdf2PageText)))
...
If you need Unicode support, you need to implement your own custom comparison algorithm that is able to identify similar characters and treat them as equal.
I require to replace a HWPFDocument paragraph text of .doc file if it contains a particular text using java. It replaces the text. But the process writes the output text in a strange way. Please help me to rectify this issue.
Code snippet used:
public static HWPFDocument processChange(HWPFDocument doc)
{
try
{
Range range = doc.getRange();
for (int i = 0; i < range.numParagraphs(); i++)
{
Paragraph paragraph = range.getParagraph(i);
if (paragraph.text().contains("Place Holder"))
{
String text = paragraph.text();
paragraph.replaceText(text, "*******");
}
}
}
catch (Exception ex)
{
ex.printStackTrace();
}
return doc;
}
Input:
Place Holder
Textvalue1
Textvalue2
Textvalue3
Output:
*******Textvalue1
Textvalue1
Textvalue2
Textvalue3
The HWPF library is not in a perfect state for changing / writing .doc files. (At least at the last time that I looked. Some time ago I developed a custom variant of HWPF for my client which - among many other things - provides correct replace and save operations, but that library is not publicly available.)
If you absolutely must use .doc files and Java you may get away by replacing with strings of exactly same length. For instance "12345" -> "abc__" (_ being spaces or whatever works for you). It might make sense to find the absolute location of the to be replaced string in the doc file (using HWPF) and then changing it in the doc file directly (without using HWPF).
Word file format is very complicated and "doing it right" is not a trivial task. Unless you are willing to spend many man months, it will also not be possible to fix part of the library so that just saving works. Many data structures must be handled very precisely and a single "slip up" lets Word crash on the generated output file.
I am using pdfbox to fill up a form in my pdf file, application is able to show number of available fields on the form but it returns the following error
Messages:
Error: Don't know how to calculate the position for non-simple fonts
File: org/apache/pdfbox/pdmodel/interactive/form/PDAppearance.java
Line number: 616
Code
.....
while (fieldsIter.hasNext()) {
PDField field = (PDField) fieldsIter.next();
setField(pdf, field.getPartialName(), "My input");
//setField(pdf, field.getFullyQualifiedName(), "My input");
}
.....
public void setField(PDDocument pdfDocument, String name, String value) throws
IOException {
PDDocumentCatalog docCatalog = pdfDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
PDField field = acroForm.getField(name);
if (field != null) {
field.setValue(value);
} else {
System.err.println("No field found with name:" + name);
}
}
Please let me know if you need any other part of the code.
There seems to be a bug in pdfbox that occurs in various circumstances where it can't find the relevant font. The only workaround I've been able to find is just skipping the update appearance code that's run in PDTextbox.setValue. You can "force" the value to be updated by just doing:
COSString fieldValue = new COSString("Awesome field value");
textbox.getDictionary().setItem(COSName.V, fieldValue);
Presumably, in all but corner cases, the PDF viewer can handle rendering the font, and just setting the V item for the field should suffice. Subjectively the documents I've generated open fine in Acrobat and OS X preview.
Related issue:
https://issues.apache.org/jira/browse/PDFBOX-1550
Edit to add: by default acrobat seems to create fields with a visible text area size of 0. For documents with this problem, you can get around this by adding textbox.getDictionary().setItem(COSName.AP, null); and hoping that the reader can handle rendering the appearance correctly.
Use another PDF if that works means PDFBox is not compatible with the first pdf.
I had the same issue with a PDF and I ended up solving it by editing the PDF form field (with Abobe Acrobat Pro) and setting an specific font.
The problem was that the problematic field didn't specify any font at all.
Hope it helps!
I also had this issue, the problem was that the PDF did not embed the font used.
Moreover the Preflight tools from Acrobat pro seemed unable to fix the PDF. I ended up recreating the PDF and now it is working fine.
This happend to me by using Adobe Pro and merging to PDFs into one. After this the resulting PDF was not able to be used with PDFBox also the original PDFs worked. After a little research i found out, that the merge process destryos the font information. Just reset the fonts and it should work!
Best regards!
I have just started working with PDFBox, extracting text and so on. One thing I am interested in is the colour of the text itself that I am extracting. However I cannot seem to find any way of getting that information.
Is it possible at all to use PDFBox to get the colour information of a document and if so, how would I go about doing so?
Many thanks.
All color informations should be stored in the class PDGraphicsState and the used color (stroking/nonstroking etc.) depends on the used text rendering mode (via pdfbox mailing list).
Here is a small sample I tried:
After creating a pdf with just one line ("Sample" written in RGB=[146,208,80]), the following program will output:
DeviceRGB
146.115
208.08
80.07
Here's the code:
PDDocument doc = null;
try {
doc = PDDocument.load("C:/Path/To/Pdf/Sample.pdf");
PDFStreamEngine engine = new PDFStreamEngine(ResourceLoader.loadProperties("org/apache/pdfbox/resources/PageDrawer.properties"));
PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get(0);
engine.processStream(page, page.findResources(), page.getContents().getStream());
PDGraphicsState graphicState = engine.getGraphicsState();
System.out.println(graphicState.getStrokingColor().getColorSpace().getName());
float colorSpaceValues[] = graphicState.getStrokingColor().getColorSpaceValue();
for (float c : colorSpaceValues) {
System.out.println(c * 255);
}
}
finally {
if (doc != null) {
doc.close();
}
Take a look at PageDrawer.properties to see how PDF operators are mapped to Java classes.
As I understand it, as PDFStreamEngine processes a page stream, it sets various variable states depending on what operators it is processing at the moment. So when it hits green text, it will change the PDGraphicsState because it will encounter appropriate operators. So for CS it calls org.apache.pdfbox.util.operator.SetStrokingColorSpace as defined by mapping CS=org.apache.pdfbox.util.operator.SetStrokingColorSpace in the .properties file. RG is mapped to org.apache.pdfbox.util.operator.SetStrokingRGBColor and so on.
In this case, the PDGraphicsState hasn't changed because the document has just text and the text it has is in just one style. For something more advanced, you would need to extend PDFStreamEngine (just like PageDrawer, PDFTextStripper and other classes do) to do something when color changes. You could also write your own mappings in your own .properties file.