I have a tibetan pdf file, and I want to extract its content. But I tried following three codes to read the file, I got code that isn't what I wanted.
code1:
import java.io.IOException;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
public class iTextReadDemo {
public static void main(String[] args) {
try {
PdfReader reader = new PdfReader("");
String page = PdfTextExtractor.getTextFromPage(reader, 1);
System.out.println("Page Content:\n\n" + page + "\n\n");
} catch (IOException e) {
e.printStackTrace();
}
}
}// - See more at:
// http://www.quicklyjava.com/read-pdf-file-in-java-using-itext/#sthash.iAhF00Kj.dpuf
code2 :
import java.io.FileOutputStream;
import com.lowagie.text.Document;
import com.lowagie.text.PageSize;
import com.lowagie.text.Paragraph;
import com.lowagie.text.pdf.PdfReader;
import com.lowagie.text.pdf.PdfStamper;
import com.lowagie.text.pdf.PdfWriter;
public class MainClass {
public static void main(String[] args) throws Exception {
PdfReader reader = new PdfReader("");
byte[] bs = new byte[100];
byte[] streamBytes = reader.getPageContent(1);
for(byte b: streamBytes){
System.out.print((char)b);
}
}
}
code3:
package pdfBox;
import java.io.BufferedWriter;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
public class PDFTest {
public static void main(String[] args) throws Exception {
PDDocument pd;
File input = new File("C:\\Users\\Administrator\\Desktop\\tibetan Dictionary pdf/藏英英藏词典 - 副本.pdf");
pd = PDDocument.load(input);
PDFTextStripper reader = new PDFTextStripper("utf-8");
String pageText = reader.getText(pd);
System.out.println(pageText);
}
}
and this is the part of the maven pom dependency
<dependency>
<groupId>com.itextpdf</groupId>
<artifactId>itextpdf</artifactId>
<version>5.5.3</version>
</dependency>
<dependency>
<groupId>com.lowagie</groupId>
<artifactId>itext</artifactId>
<version>4.2.1</version>
</dependency>
<dependency>
<groupId>org.swinglabs</groupId>
<artifactId>pdf-renderer</artifactId>
<version>1.0.5</version>
</dependency>
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>1.8.7</version>
</dependency>
what is wrong ?
is he said right?
https://answers.acrobatusers.com/Can-I-convert-PDF-Word-Doc-Tibetan-script-addition-English-language-q219757.aspx
The quality of exported content from a PDF is directly related to the quality of the PDF's "build" (what is under the hood, not what you "see"). Poor quality export indicates a poorly built PDF. Nothing you can do other that ask the originator of the PDF to do a better job.
Related
In a Java backend server application I want to decode a QR code embedded into a PDF file using the zxing library.
I adapted the example from https://gist.github.com/JoelGeraci-Datalogics/dd9e214d4c584d61f5b1 to work with the pdfbox library as follows:
ReadBarcodeFromPDF.java
import com.google.zxing.*;
import com.google.zxing.client.j2se.BufferedImageLuminanceSource;
import com.google.zxing.common.HybridBinarizer;
import java.awt.image.BufferedImage;
import java.io.IOException;
import java.io.InputStream;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.util.Arrays;
import java.util.Hashtable;
import java.util.List;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.rendering.ImageType;
import org.apache.pdfbox.rendering.PDFRenderer;
public class ReadBarcodeFromPDF {
private static final List<String> urlList = Arrays.asList(
"http://dev.datalogics.com/cookbook/forms/ReadBarcodeImage_QRCode.pdf",
"http://dev.datalogics.com/cookbook/forms/ReadBarcodeImage_DataMatrix.pdf",
"http://dev.datalogics.com/cookbook/forms/ReadBarcodeImage_PDF417.pdf"
);
public static void main(String[] args) throws Exception {
for (String url : urlList) {
readCode(url);
}
}
private static String readCode(String url) throws MalformedURLException, IOException, NotFoundException {
URLConnection connection = new URL(url).openConnection();
connection.connect();
try (InputStream inputStream = connection.getInputStream()) {
try (PDDocument document = PDDocument.load(inputStream)) {
PDFRenderer pdfRenderer = new PDFRenderer(document);
BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(0, 300, ImageType.BINARY);
LuminanceSource luminanceSource = new BufferedImageLuminanceSource(bufferedImage);
HybridBinarizer hybridBinarizer = new HybridBinarizer(luminanceSource);
BinaryBitmap bitmap = new BinaryBitmap(hybridBinarizer);
MultiFormatReader reader = new MultiFormatReader();
Hashtable<DecodeHintType, Object> hints = new Hashtable<>();
hints.put(DecodeHintType.POSSIBLE_FORMATS, Arrays.asList(
BarcodeFormat.QR_CODE,
BarcodeFormat.DATA_MATRIX,
BarcodeFormat.PDF_417
));
hints.put(DecodeHintType.PURE_BARCODE, Boolean.FALSE);
hints.put(DecodeHintType.TRY_HARDER, Boolean.TRUE);
reader.setHints(hints);
Result result = reader.decodeWithState(bitmap);
return result.getText();
}
}
}
}
The example is supposed to recognize codes from PDF documents with different code types:
QR Code: http://dev.datalogics.com/cookbook/forms/ReadBarcodeImage_QRCode.pdf
Data Matrix: http://dev.datalogics.com/cookbook/forms/ReadBarcodeImage_DataMatrix.pdf
PDF417: http://dev.datalogics.com/cookbook/forms/ReadBarcodeImage_PDF417.pdf
But it fact it recognizes only the PDF417. The QR Code (in which I am interested) and the data matrix (in which I am not interested) are not recognized.
The output is
com.google.zxing.NotFoundException
com.google.zxing.NotFoundException
11/15/2010 Tony Blue
I also tried
Different arguments in line
pdfRenderer.renderImageWithDPI(0, 300, ImageType.BINARY);
Different values for dpi (2nd argument) between 100 and 300
All available values for imageType (3rd argument)
Newest version of the library (3.5.0)
pom.xml
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.26</version>
</dependency>
<dependency>
<groupId>com.google.zxing</groupId>
<artifactId>javase</artifactId>
<version>3.3.3</version>
</dependency>
Your code works if one removes
hints.put(DecodeHintType.PURE_BARCODE, Boolean.FALSE);
I suspect that ZXIng only checks whether there is an entry for the key and not its value. The javadoc mentions "Doesn't matter what it maps to; use Boolean.TRUE."
I see this is a tough question for Play. A lot of people asked that question, but still it is not clear how to get bytes from request body if content type is not set in Java.
There is a solution in Scala, but it is not working for my case.
I want to use Play to built a http mock server in a test in Java.
#BodyParser.Of(BodyParser.Raw.class) has no sense
package org.dan;
import org.junit.Test;
import play.mvc.BodyParser;
import play.mvc.Result;
import play.server.Server;
//import static play.mvc.Controller.request;
import static play.mvc.Results.ok;
import static play.routing.RoutingDsl.fromComponents;
import static play.server.Server.forRouter;
import static play.mvc.Http.Context.Implicit.request;
public class DemoPlayTest {
#Test
public void run() throws InterruptedException {
Server server = forRouter(
9001,
(components) ->
fromComponents(components)
.POST("/echo")
.routeTo(DemoPlayTest::action)
.build());
Thread.sleep(1111111);
}
#BodyParser.Of(BodyParser.Raw.class)
public static Result action() {
return ok("Gut: " + request().body().asRaw() + "\n");
}
}
Testing:
$ curl -v -X POST -d hello http://localhost:9001/echo
Gut: null
Dependencies:
<play.version>2.6.17</play.version>
<dependency>
<groupId>com.typesafe.play</groupId>
<artifactId>play-server_2.11</artifactId>
<version>${play.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.typesafe.play</groupId>
<artifactId>play-akka-http-server_2.11</artifactId>
<version>${play.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.typesafe.play</groupId>
<artifactId>play-java_2.11</artifactId>
<version>${play.version}</version>
<scope>test</scope>
</dependency>
package org.dan;
import org.junit.Test;
import play.mvc.Result;
import play.server.Server;
import static play.mvc.Controller.request;
import static play.mvc.Results.ok;
import static play.routing.RoutingDsl.fromComponents;
import static play.server.Server.forRouter;
public class DemoPlayTest {
#Test
public void run() throws InterruptedException {
Server server = forRouter(
9001,
(components) ->
fromComponents(components)
.POST("/echo")
.routeTo(DemoPlayTest::action)
.build());
Thread.sleep(1111111);
}
public static Result action() {
final String body = new String(request().body().asBytes().toArray());
return ok("Gut: " + body + "\n");
}
}
I don't know why this code is wrong. I am using a code from itext, but is giving error evin with all dependencies imported. Below is the codes that I am using in the project. Plase help me, someone.
https://developers.itextpdf.com/examples/security/digital-signatures-white-paper/digital-signatures-chapter-5
import java.io.IOException;
import java.security.GeneralSecurityException;
import java.security.Security;
import java.util.ArrayList;
import org.bouncycastle.jce.provider.BouncyCastleProvider;
import com.itextpdf.text.log.LoggerFactory;
import com.itextpdf.text.log.SysoLogger;
import com.itextpdf.text.pdf.AcroFields;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.security.PdfPKCS7;
public class PdfReaderExample {
public static final String EXAMPLE1 = "/opt/doc.pdf";
public PdfPKCS7 verifySignature(AcroFields fields, String name) throws GeneralSecurityException, IOException {
System.out.println("Signature covers whole document: " + fields.signatureCoversWholeDocument(name));
System.out.println("Document revision: " + fields.getRevision(name) + " of " + fields.getTotalRevisions());
PdfPKCS7 pkcs7 = fields.verifySignature(name);
System.out.println("Integrity check OK? " + pkcs7.verify());
return pkcs7;
}
public void verifySignatures(String path) throws IOException, GeneralSecurityException {
System.out.println(path);
PdfReader reader = new PdfReader(path);
AcroFields fields = reader.getAcroFields();
ArrayList<String> names = fields.getSignatureNames();
for (String name : names) {
System.out.println("===== " + name + " =====");
verifySignature(fields, name);
}
System.out.println();
}
public static void main(String[] args) throws IOException, GeneralSecurityException {
LoggerFactory.getInstance().setLogger(new SysoLogger());
BouncyCastleProvider provider = new BouncyCastleProvider();
Security.addProvider(provider);
PdfReaderExample app = new PdfReaderExample();
app.verifySignatures(EXAMPLE1);
}
}
===== Signature2 =====
Signature covers whole document: true
Document revision: 1 of 1
Exception in thread "main" java.lang.VerifyError: (class: org/bouncycastle/cms/CMSSignedHelper, method: <clinit> signature: ()V) Incompatible argument to function
at org.bouncycastle.cms.CMSSignedData.<clinit>(Unknown Source)
at org.bouncycastle.tsp.TimeStampToken.getSignedData(Unknown Source)
at org.bouncycastle.tsp.TimeStampToken.<init>(Unknown Source)
at com.itextpdf.text.pdf.security.PdfPKCS7.<init>(PdfPKCS7.java:402)
at com.itextpdf.text.pdf.AcroFields.verifySignature(AcroFields.java:2419)
at com.itextpdf.text.pdf.AcroFields.verifySignature(AcroFields.java:2372)
at PdfReaderExample.verifySignature(PdfReaderExample.java:20)
at PdfReaderExample.verifySignatures(PdfReaderExample.java:32)
at PdfReaderExample.main(PdfReaderExample.java:42)
Process finished with exit code 1
File POM is this.
<dependencies>
<dependency>
<groupId>com.sparkjava</groupId>
<artifactId>spark-core</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>com.itextpdf</groupId>
<artifactId>itextpdf</artifactId>
<version>5.5.12</version>
</dependency>
<dependency>
<groupId>org.bouncycastle</groupId>
<artifactId>bcprov-jdk15on</artifactId>
<version>1.58</version>
</dependency>
</dependencies>
I am using this sample PDFBox code to encrypt and disable printing of a pdf file. Encryption happens successfully, but printing is not disabled.
What could be the issue?
Here's the dependencies section of my pom.xml
<dependencies>
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.6</version>
</dependency>
<dependency>
<groupId>org.bouncycastle</groupId>
<artifactId>bcprov-jdk15</artifactId>
<version>1.46</version>
</dependency>
</dependencies>
and below is the source code
import java.io.File;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.encryption.AccessPermission;
import org.apache.pdfbox.pdmodel.encryption.StandardProtectionPolicy;
public class Test {
public static void main(String[] args) throws Exception {
PDDocument doc = PDDocument.load(new File("/tmp/Test.pdf"));
int keyLength = 128;
AccessPermission ap = new AccessPermission();
ap.setCanPrint(false);
StandardProtectionPolicy spp = new StandardProtectionPolicy("Admin", "Password", ap);
spp.setEncryptionKeyLength(keyLength);
spp.setPermissions(ap);
doc.protect(spp);
doc.save("/tmp/Test-Encrypted.pdf");
doc.close();
}
}
i need to extract text from a pdf file using java. I found iText but it doesn't work the way i wanted it to. Here's my code
package com.itextpdf.mavenproject1;
import com.itextpdf.forms.PdfAcroForm;
import com.itextpdf.forms.fields.PdfButtonFormField;
import com.itextpdf.forms.fields.PdfFormField;
import com.itextpdf.io.font.FontConstants;
import com.itextpdf.kernel.font.PdfFontFactory;
import com.itextpdf.kernel.geom.Rectangle;
import com.itextpdf.kernel.pdf.PdfDocument;
import com.itextpdf.kernel.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
import com.itextpdf.kernel.pdf.PdfString;
import com.itextpdf.kernel.pdf.PdfWriter;
import com.itextpdf.kernel.pdf.action.PdfAction;
import com.itextpdf.kernel.pdf.annot.PdfAnnotation;
import com.itextpdf.kernel.pdf.annot.PdfTextAnnotation;
import com.itextpdf.kernel.pdf.canvas.PdfCanvas;
import com.itextpdf.kernel.pdf.canvas.parser.PdfTextExtractor;
import com.itextpdf.test.annotations.WrapToTest;
import java.io.File;
import java.io.IOException;
public class zczytywanie {
public static void main(String args[]) throws IOException {
PdfDocument pdfDoc = new PdfDocument(new PdfReader("D:/pdf/pdf"));
String page= PdfTextExtractor.getTextFromPage(pdfDoc, 1);
System.out.println(page);
}
}
And it tells me that there is an error in the line where i try to use PDdfTextExtractor (PdfDocument can not be converted to pdfPage, although i found that pdfDoc has to be PdfReader)
It doesn't work with
PdfReader pdfDoc = new PdfReader("D:/pdf/pdf");
either.
You can try PDFBox or Tikka. But here I am giving an example for PDFBox
Add the PDFBox jar dependency to your pom.xml.
<dependencies>
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.23</version>
</dependency>
</dependencies>
Java class
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;
import java.io.File;
import java.io.IOException;
public class TestPDF {
public static void main(String[] args) {
try (PDDocument document = PDDocument.load(new File("/path_to_your_pdf_file"))) {
document.getClass();
if(!document.isEncrypted()){
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
System.out.println("Text:" + pdfFileInText);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}