Preflight validate() is invalid but in Console valid - java

Currently I'm decoding a Base64 with Console:
base64 -di "myfile.txt" > mypdf.pdf
Which returns a valid pdf file.
But when I try this
DataSource dataSource = new ByteArrayDataSource(
new ByteArrayInputStream(Base64.getDecoder().decode(pdf.getEncodedContent())));
PreflightParser parser = new PreflightParser(dataSource);
parser.parse();
try (PreflightDocument document = parser.getPreflightDocument()) {
document.validate();
return !document.isEncrypted();
}
catch (ValidationException ex) {
return false;
}
I always get a validationException (pdf is not valid).
I think I need to change the configuration. I've already tried the following but that doesn't seem to help:
PreflightConfiguration config = document.getContext().getConfig();
config.setLazyValidation(true);
Stacktrace:
test.pdf is not valid: Unable to parse font metadata due to : Excepted xpacket 'end' attribute (must be present and placed in first)

I've solved this ticket. For those who are interested:
The validation worked perfectly and the pdf files where not correct even if the reader /browser could open it (the pdf reader/browser did not show any warnings or error messages).
Try to convert your pdfs to binary text and check at least if your first two lines and the last line are 'pdf default' like:
%PDF-1.7
%µµµµ
...
%%EOF
if not, then the pdf has been generated wrong and the validation will fail.

Related

PDFBOX digit garble

I met some problems when I used PDFBOX to extract text. There are Tyep3 embedded fonts in my PDF, but the numbers cannot be displayed normally when extracting this part. Can someone give me some guidance? thank you
My version is 2.0.22
The correct output is [USD-001], the wrong output is [USD- ]
public static String readPDF(File file) throws IOException {
RandomAccessBufferedFileInputStream rbi = null;
PDDocument pdDocument = null;
String text = "";
try {
rbi = new RandomAccessBufferedFileInputStream(file);
PDFParser parser = new PDFParser(rbi);
parser.setLenient(false);
parser.parse();
pdDocument = parser.getPDDocument();
PDFTextStripper textStripper = new PDFTextStripper();
text = textStripper.getText(pdDocument);
} catch (IOException e) {
e.printStackTrace();
} finally {
rbi.close();
}
return text;
}
I tried to use PDFBOX to convert the PDF to an image and found that everything was fine. I just wanted to get it as normal text
PDFDebugger output
The pdf file : http://tmp.link/f/6249a07f6e47f
There are a number of aspects of this file making text extraction difficult.
First of all the font itself boycotts text extraction. In its ToUnicode stream we find the mappings:
1 begincodespacerange
<00> <ff> endcodespacerange
2 beginbfchar
<22> <0000> <23> <0000> endbfchar
I.e. the two character codes of interest both are mapped to U+0000, not to U+0030 ('0') and U+0031 ('1') as they should have been.
Also the Encoding is not helping at all:
<</Type/Encoding/Differences[ 0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g121/g122]>>
The glyph names /g121 and /g122 don't have a standardized meaning either.
PdfBox for text extraction works with these two properties of a font and, therefore, fails here.
Adobe Acrobat, on the other hand, also makes use of ActualText during text extraction.
In the file there are such entries. Unfortunately, though, they are erroneous, like this for the digit '0':
/P <</MCID 23>>/Span <</ActualText<FEFF0030>>>BDC
The BDC instruction only expects a single name and a single dictionary. The above sequence of name, dictionary, name, and dictionary, therefore, is invalid.
Due to that Adobe Acrobat also used to not extract the actual text here. Only recently, probably as recently as the early 2022 releases, Acrobat started extracting a '0' here.
Actually one known "trick" to prevent one's PDFs to be text extracted by regular text extractor programs is to add incorrect ToUnicode and Encoding information but correct ActualText entries.
So it's possible the error in your file is actually an application of this trick, maybe even by design with the erroneous ActualText twist to lead text extractors with some ActualText support astray while still allowing copy&paste from Adobe Acrobat.

MalformedByteSequenceException is thrown when trying to print jasper report [duplicate]

public void openReport() {
try {
HashMap params = new HashMap();
params.put("aapor", 19);
JasperReport jasperReport1 = JasperCompileManager.compileReport("C:/Users/emidemi.emidemi-PC/Documents/NetBeansProjects/FleetManager/src/FleetManager/newReport5.jasper");
JasperPrint jasperPrint1 = JasperFillManager.fillReport(jasperReport1, params, conn.getConn());
JRViewer viewer = new JRViewer(jasperPrint1);
} catch (Exception e) {
System.out.println(e.getMessage());
}
}
Above is my script.
This is my error:
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
BUILD SUCCESSFUL (total time: 7 seconds)
Does anyone know why this is occurring and how to fix it?
It's a problem with the character codification. Have you tried changing the encoding line at the beginning of the report?
i.e. for central european alphabet, change:
<?xml version="1.0" encoding="UTF-8"?>
by
<?xml version="1.0" encoding="CP1250"?>
You have a list of different character encoding standards here:
http://en.wikipedia.org/wiki/Character_encoding#Common_character_encodings
Hope it works
You are trying to compile a jasper file already compiled. Replace newReport5.jasper by newReport5.jrxml.
If you want to work with jasper file directly, you have to do like this :
JasperReport jasperReport = (JasperReport)JRLoader.loadObject(new File("filename.jasper"));
When does this exception occur (Compile or Execution?). Usually that problem means that your input IS NOT UTF-8.
If you are entirely sure that it should be UTF-8 try this:
1. Create a NEW EMPTY file and encode it as UTF-8.
2. Copy the whole text from your old file to the new one.
3. Save the new one and check if it works with the new file. If it does, your old file was not proper UTF-8.
4. If not, post your input file (the jrxml.)
When I have problems like this I try to find the offending character, a HEX Editor helps.

Java - Issue with data extraction from PDF (PDFBox - 2.02)

I am trying to extract data from a PDF file which contains data in separate tables & convert to excel. Based on this link as my need is more or less the same, I am using PDFBOX jar to do the extraction.
To test whether I can first extract the data from different tables in the pdf, tried with the code specified below. But it does not extract & gives an error stating Corrupt object reference, don't know what it means.
To see if there was any issue with the pdf itself, I checked with https://online2pdf.com & it successfully converted the pdf file to excel, so I believe there is no issue with the pdf file.
Hope the issue I face is clear & await inputs on what needs to be done to extract the data from the pdf
Error message:
2016-07-21 13:49:11 WARN BaseParser:682 - Corrupt object reference at offset 6371
2016-07-21 13:49:11 WARN BaseParser:682 - Corrupt object reference at offset 6373
java.io.IOException: Expected string 'null' but missed at character 'u' at offset 6376
at org.apache.pdfbox.pdfparser.BaseParser.readExpectedString(BaseParser.java:1017)
at org.apache.pdfbox.pdfparser.BaseParser.readExpectedString(BaseParser.java:1000)
at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:879)
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:651)
at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:175)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:479)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:446)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
at org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:136)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:227)
at main.Test.readPDF(Test.java:170)
at main.Test.main(Test.java:76)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Code :
public static void main(String[] args){
try {
File filePDF = new File("C:\\test.pdf");
PDDocument document = PDDocument.load(filePDF);
PDFTextStripper s = new PDFTextStripper();
String content = s.getText(document);
System.out.println(content);
} catch (IOException e) {
e.printStackTrace();
}
}
Finally found a jar (PDFxStream) file which extracts all the data from the PDF in this case. Although its a paid version, but its able to extract the complete info which the other paid ones was not able to extract.
The only thing is, it extracts as a String & I would need to parse this String & extract the specific info from it.

How to avoid parsing strange characters

While I am processing XML file, the Stax parser encountered the following line:
<node id="281224530" lat="48.8975614" lon="8.7055191" version="8" timestamp="2015-06-07T22:47:39Z" changeset="31801740" uid="272351" user="Krte�?ek">
and as you see there is a strange character at the end of the line, and when the parser reaches that line the program stops and gives me the following error:
Exception in thread "main" javax.xml.stream.XMLStreamException: ParseError
at [row,col]:[338019,145]
Message: Ungültiges Byte 2 von 2-Byte-UTF-8-Sequenz.
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(Unknown
Source)
at com.example.Main.main(Main.java:46)
Is there any thing I should change in the settings of Eclipse to avoid that error?
Update
code:
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLStreamReader parser = null;
try {
parser = factory.createXMLStreamReader(in);
} catch (XMLStreamException e) {
// TODO Auto-generated catch block
e.printStackTrace();
Log.d(TAG, "newParser",
"e/createXMLStreamReader: " + e.getMessage());
}
It is not about eclipse, but it is about encoding of your file. There are two cases:
1) file is corrupted, i.e. it contains incorrect symbols, not from defined encoding
2) file is not in utf-8 encoding and it is defined in xml header. So you should check, that you are reading file contents appropriately.
If you edited and saved your XML file in eclipse, this can be a problem in case your eclipse is not configured to use UTF-8. Check this question: How to support UTF-8 encoding in Eclipse
Otherwise you probably don't need to do anything about your code. You just need a correctly UTF-8-encoded content.

Freemarker converting HTML ISO tags when reading ftl file

I am trying to output curly quotes in an HTML file that I am generating in Freemarker. The template file contains:
Kevin’s
When the HTML file is generated, it comes out as:
Kevin?s
At first I thought that the issue was happening during the generation of the HTML file. But I was able to track down the conversion to when the template was read in. Does anyone know how to prevent Freemarker from doing this conversion when reading the template? My code for the conversion:
// Freemarker configuration object
Configuration cfg = new Configuration(new Version(2, 3, 21));
try
{
cfg.setDirectoryForTemplateLoading(new File("templates"));
cfg.setDefaultEncoding("UTF-8");
cfg.setTemplateExceptionHandler(TemplateExceptionHandler.HTML_DEBUG_HANDLER);
// Load template from source folder
Template template = cfg.getTemplate("curly.html");
template.setEncoding("UTF-8");
// Build the data-model
Map<String, Object> data = new HashMap<String, Object>();
// Console output
Writer out = new OutputStreamWriter(System.out);
template.process(data, out);
out.flush();
}
catch (IOException e)
{
e.printStackTrace();
}
catch (TemplateException e)
{
e.printStackTrace();
}
If the template file indeed contains Kevin’s, then the out would be Kevin’s too (as FreeMarker doesn't resolve HTML entities), so I suppose you mean that the character with that code is there as one character. In that case, the most probable culprit has nothing to do with FreeMarker: new OutputStreamWriter(System.out). You have omitted the encoding parameter of the constructor there, so it will use the system default encoding. Even if you do specify that, your console have a fixed encoding (which is not necessarily the system default BTW). So try to write the output into a file by explicitly specifying UTF-8 for the OutputStreamWriter. If the output will be still wrong, then check if you have indeed used UTF-8 to create the template file, and for reading the output file.
BTW, that template.setEncoding is not necessary. Remove it.

Categories

Resources