How do I use iText7 to OCR a multi page PDF?

How do I use iText7 to OCR a multi page PDF? - java

I have used iText PdfRender, which converts a non-OCR PDF to an image, after which I used iText PdfOcr to convert that image to an OCR'd PDF. Is there a tool that lets me perform this process in one step?
It would also be helpful if there was some documentation on how I can process multi-page PDFs using PdfRender, which I can't seem to find. The following is the code I used to convert 1 image to an OCR PDF document.
import com.itextpdf.pdfocr.OcrPdfCreator;
import com.itextpdf.pdfocr.tesseract4.Tesseract4LibOcrEngine;
import com.itextpdf.pdfocr.tesseract4.Tesseract4OcrEngineProperties;
import java.io.File;
import java.io.IOException;
import java.util.Arrays;
import java.util.List;
public class img2pdfocr {
static final Tesseract4OcrEngineProperties tesseract4OcrEngineProperties = new Tesseract4OcrEngineProperties();
private static List LIST_IMAGES_OCR = Arrays.asList(new File("image1.jpg"));
private static String OUTPUT_PDF = "F:\\ITEXT_workspace\\jumpstart\\bizdoc.pdf";
public static void main(String[] args) throws IOException {
final Tesseract4LibOcrEngine tesseractReader = new Tesseract4LibOcrEngine(tesseract4OcrEngineProperties);
tesseract4OcrEngineProperties.setPathToTessData(new File("F:\\ITEXT_workspace\\jumpstart\\TESS_DATA_FOLDER"));
OcrPdfCreator ocrPdfCreator = new OcrPdfCreator(tesseractReader);
try (PdfWriter writer = new PdfWriter(OUTPUT_PDF)) {
ocrPdfCreator.createPdf(LIST_IMAGES_OCR, writer).close();
}
}
}
EDIT
As pointed out in the comments, I need not use pdfRender, iText core itself can be used to extract images from a PDF. Use this for the code. You can check on this documentation

Related

How to generate QR code with some text using JAVA?

I want to generate QR code with some text using JAVA like this.
please check this image. This is how I want to generate my QR code.
(with user name and event name text)
This is my code and this generate only (QR) code, (not any additional text). If anyone know how to generate QR code with text please help me.
import java.io.File;
import java.util.HashMap;
import java.util.Map;
import com.google.zxing.BarcodeFormat;
import com.google.zxing.EncodeHintType;
import com.google.zxing.MultiFormatWriter;
import com.google.zxing.client.j2se.MatrixToImageWriter;
import com.google.zxing.common.BitMatrix;
import com.google.zxing.qrcode.decoder.ErrorCorrectionLevel;
public class Create_QR {
public static void main(String[] args) {
try {
String qrCodeData = "This is the text";
String filePath = "C:\\Users\\Nirmalw\\Desktop\\Projects\\QR\\test\\test_img\\my_QR.png";
String charset = "UTF-8"; // or "ISO-8859-1"
Map < EncodeHintType, ErrorCorrectionLevel > hintMap = new HashMap < EncodeHintType, ErrorCorrectionLevel > ();
hintMap.put(EncodeHintType.ERROR_CORRECTION, ErrorCorrectionLevel.L);
BitMatrix matrix = new MultiFormatWriter().encode(new String(qrCodeData.getBytes(charset), charset),
BarcodeFormat.QR_CODE, 500, 500, hintMap);
MatrixToImageWriter.writeToFile (matrix, filePath.substring(filePath.lastIndexOf('.') + 1), new File(filePath));
System.out.println("QR Code created successfully!");
} catch (Exception e) {
System.err.println(e);
}
}
}

You can generate a QR code with text in Java using Free Spire.Barcode for Java API. First, download the API's jar from this link or install it from Maven Repository:
<repositories>
<repository>
<id>com.e-iceblue</id>
<name>e-iceblue</name>
<url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>e-iceblue</groupId>
<artifactId>spire.barcode.free</artifactId>
<version>5.1.1</version>
</dependency>
</dependencies>
Next, refer to the following code sample:
import com.spire.barcode.BarCodeGenerator;
import com.spire.barcode.BarCodeType;
import com.spire.barcode.BarcodeSettings;
import com.spire.barcode.QRCodeECL;
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
public class GenerateQRCode {
public static void main(String []args) throws IOException {
//Instantiate a BarcodeSettings object
BarcodeSettings settings = new BarcodeSettings();
//Set barcode type
settings.setType(BarCodeType.QR_Code);
//Set barcode data
String data = "https://stackoverflow.com/";
settings.setData(data);
//Set barcode module width
settings.setX(2);
//Set error correction level
settings.setQRCodeECL(QRCodeECL.M);
//Set top text
settings.setTopText("User Name");
//Set bottom text
settings.setBottomText("Event Name");
//Set text visibility
settings.setShowText(false);
settings.setShowTopText(true);
settings.setShowBottomText(true);
//Set border visibility
settings.hasBorder(false);
//Instantiate a BarCodeGenerator object based on the specific settings
BarCodeGenerator barCodeGenerator = new BarCodeGenerator(settings);
//Generate QR code image
BufferedImage bufferedImage = barCodeGenerator.generateImage();
//save the image to a .png file
ImageIO.write(bufferedImage,"png",new File("QR_Code.png"));
}
}
The following is the generated QR code image with text:

To generate a QR code in Java, we need to use a third-party library named ZXing (Zebra Crossing). It is a popular API that allows us to process with QR code. With the help of the library, we can easily generate and read the QR code. Before moving towards the Java program, we need to add the ZXing library to the project. We can download it from the official site.
zxing core-3.3.0.jar
zxing javase-3.3.0.jar
After downloading, add it to the classpath. Or add the following dependency in pom.xml file.
<dependencies>
<dependency>
<groupId>com.google.zxing</groupId>
<artifactId>core</artifactId>
<version>3.3.0</version>
</dependency>
<dependency>
<groupId>com.google.zxing</groupId>
<artifactId>javase</artifactId>
<version>3.3.0</version>
</dependency>
</dependencies>
Let's create a Java program that generates a QR code.
GenerateQrCode.java
import java.io.File;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import com.google.zxing.BarcodeFormat;
import com.google.zxing.EncodeHintType;
import com.google.zxing.MultiFormatWriter;
import com.google.zxing.NotFoundException;
import com.google.zxing.WriterException;
import com.google.zxing.client.j2se.MatrixToImageWriter;
import com.google.zxing.common.BitMatrix;
import com.google.zxing.qrcode.decoder.ErrorCorrectionLevel;
public class GenerateQRCode
{
//static function that creates QR Code
public static void generateQRcode(String data, String path, String charset, Map map, int h, int w) throws WriterException, IOException
{
//the BitMatrix class represents the 2D matrix of bits
//MultiFormatWriter is a factory class that finds the appropriate Writer subclass for the BarcodeFormat requested and encodes the barcode with the supplied contents.
BitMatrix matrix = new MultiFormatWriter().encode(new String(data.getBytes(charset), charset), BarcodeFormat.QR_CODE, w, h);
MatrixToImageWriter.writeToFile(matrix, path.substring(path.lastIndexOf('.') + 1), new File(path));
}
//main() method
public static void main(String args[]) throws WriterException, IOException, NotFoundException
{
//data that we want to store in the QR code
String str= "THE HABIT OF PERSISTENCE IS THE HABIT OF VICTORY.";
//path where we want to get QR Code
String path = "C:\\Users\\Anubhav\\Desktop\\QRDemo\\Quote.png";
//Encoding charset to be used
String charset = "UTF-8";
Map<EncodeHintType, ErrorCorrectionLevel> hashMap = new HashMap<EncodeHintType, ErrorCorrectionLevel>();
//generates QR code with Low level(L) error correction capability
hashMap.put(EncodeHintType.ERROR_CORRECTION, ErrorCorrectionLevel.L);
//invoking the user-defined method that creates the QR code
generateQRcode(str, path, charset, hashMap, 200, 200);//increase or decrease height and width accodingly
//prints if the QR code is generated
System.out.println("QR Code created successfully.");
}
}

Decoding QR Code embedded into PDF with ZXing on Java backend

In a Java backend server application I want to decode a QR code embedded into a PDF file using the zxing library.
I adapted the example from https://gist.github.com/JoelGeraci-Datalogics/dd9e214d4c584d61f5b1 to work with the pdfbox library as follows:
ReadBarcodeFromPDF.java
import com.google.zxing.*;
import com.google.zxing.client.j2se.BufferedImageLuminanceSource;
import com.google.zxing.common.HybridBinarizer;
import java.awt.image.BufferedImage;
import java.io.IOException;
import java.io.InputStream;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.util.Arrays;
import java.util.Hashtable;
import java.util.List;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.rendering.ImageType;
import org.apache.pdfbox.rendering.PDFRenderer;
public class ReadBarcodeFromPDF {
private static final List<String> urlList = Arrays.asList(
"http://dev.datalogics.com/cookbook/forms/ReadBarcodeImage_QRCode.pdf",
"http://dev.datalogics.com/cookbook/forms/ReadBarcodeImage_DataMatrix.pdf",
"http://dev.datalogics.com/cookbook/forms/ReadBarcodeImage_PDF417.pdf"
);
public static void main(String[] args) throws Exception {
for (String url : urlList) {
readCode(url);
}
}
private static String readCode(String url) throws MalformedURLException, IOException, NotFoundException {
URLConnection connection = new URL(url).openConnection();
connection.connect();
try (InputStream inputStream = connection.getInputStream()) {
try (PDDocument document = PDDocument.load(inputStream)) {
PDFRenderer pdfRenderer = new PDFRenderer(document);
BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(0, 300, ImageType.BINARY);
LuminanceSource luminanceSource = new BufferedImageLuminanceSource(bufferedImage);
HybridBinarizer hybridBinarizer = new HybridBinarizer(luminanceSource);
BinaryBitmap bitmap = new BinaryBitmap(hybridBinarizer);
MultiFormatReader reader = new MultiFormatReader();
Hashtable<DecodeHintType, Object> hints = new Hashtable<>();
hints.put(DecodeHintType.POSSIBLE_FORMATS, Arrays.asList(
BarcodeFormat.QR_CODE,
BarcodeFormat.DATA_MATRIX,
BarcodeFormat.PDF_417
));
hints.put(DecodeHintType.PURE_BARCODE, Boolean.FALSE);
hints.put(DecodeHintType.TRY_HARDER, Boolean.TRUE);
reader.setHints(hints);
Result result = reader.decodeWithState(bitmap);
return result.getText();
}
}
}
}
The example is supposed to recognize codes from PDF documents with different code types:
QR Code: http://dev.datalogics.com/cookbook/forms/ReadBarcodeImage_QRCode.pdf
Data Matrix: http://dev.datalogics.com/cookbook/forms/ReadBarcodeImage_DataMatrix.pdf
PDF417: http://dev.datalogics.com/cookbook/forms/ReadBarcodeImage_PDF417.pdf
But it fact it recognizes only the PDF417. The QR Code (in which I am interested) and the data matrix (in which I am not interested) are not recognized.
The output is
com.google.zxing.NotFoundException
com.google.zxing.NotFoundException
11/15/2010 Tony Blue
I also tried
Different arguments in line
pdfRenderer.renderImageWithDPI(0, 300, ImageType.BINARY);
Different values for dpi (2nd argument) between 100 and 300
All available values for imageType (3rd argument)
Newest version of the library (3.5.0)
pom.xml
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.26</version>
</dependency>
<dependency>
<groupId>com.google.zxing</groupId>
<artifactId>javase</artifactId>
<version>3.3.3</version>
</dependency>

Your code works if one removes
hints.put(DecodeHintType.PURE_BARCODE, Boolean.FALSE);
I suspect that ZXIng only checks whether there is an entry for the key and not its value. The javadoc mentions "Doesn't matter what it maps to; use Boolean.TRUE."

Using PdfCleanUpTool or PdfAutoSweep causes some text to change to bold, line weights increase and double points change to hearts

I have weird problem when I try to use iText 7. Some parts of the PDF are modified (text to change to bold, line weights increase and double points change to hearts). In iText version 5.4.4 this didn't happened, but every version since that I have tried cause this same problem (5 or 7).
Does anyone have a clued why this is happening and is there anything I could to do to bypass this problem? Any help would be appreciated!
If more information is needed, I will try to provide it.
Below is simple code that I used to test iText 7.
Example PDF Files
package javaapplication1;
import com.itextpdf.kernel.colors.ColorConstants;
import com.itextpdf.kernel.geom.Rectangle;
import com.itextpdf.kernel.pdf.PdfDocument;
import com.itextpdf.kernel.pdf.PdfReader;
import com.itextpdf.kernel.pdf.PdfWriter;
import com.itextpdf.pdfcleanup.PdfCleanUpLocation;
import com.itextpdf.pdfcleanup.PdfCleanUpTool;
import com.itextpdf.pdfcleanup.autosweep.ICleanupStrategy;
import com.itextpdf.pdfcleanup.autosweep.PdfAutoSweep;
import com.itextpdf.pdfcleanup.autosweep.RegexBasedCleanupStrategy;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Pattern;
public class JavaApplication1 {
public static final String DEST = "D:/TEMP/TEMP/PDF/result/orientation_result.pdf";
public static final String DEST2 = "D:/TEMP/TEMP/PDF/result/orientation_result2.pdf";
public static final String DEST3 = "D:/TEMP/TEMP/PDF/result/orientation_result3.pdf";
public static final String SRC = "D:/TEMP/TEMP/PDF/TEST_PDF.pdf";
public static void main(String[] args) throws IOException {
File file = new File(DEST);
file.getParentFile().mkdirs();
new JavaApplication1().manipulatePdf(DEST);
new JavaApplication1().manipulatePdf2(DEST2);
try (PdfDocument pdf = new PdfDocument(new PdfReader(SRC), new PdfWriter(DEST3))) {
final ICleanupStrategy cleanupStrategy = new RegexBasedCleanupStrategy(Pattern.compile("2019", Pattern.CASE_INSENSITIVE)).setRedactionColor(ColorConstants.PINK);
final PdfAutoSweep autoSweep = new PdfAutoSweep(cleanupStrategy);
autoSweep.cleanUp(pdf);
} catch (Exception e) {
System.out.println(e.toString());
}
}
protected void manipulatePdf(String dest) throws IOException {
PdfDocument pdfDoc = new PdfDocument(new PdfReader(SRC), new PdfWriter(dest));
List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>();
// The arguments of the PdfCleanUpLocation constructor: the number of page to be cleaned up,
// a Rectangle defining the area on the page we want to clean up,
// a color which will be used while filling the cleaned area.
PdfCleanUpLocation location = new PdfCleanUpLocation(1, new Rectangle(97, 405, 383, 40),
ColorConstants.GRAY);
cleanUpLocations.add(location);
PdfCleanUpTool cleaner = new PdfCleanUpTool(pdfDoc, cleanUpLocations);
cleaner.cleanUp();
pdfDoc.close();
}
protected void manipulatePdf2(String dest) throws IOException {
PdfDocument pdfDoc = new PdfDocument(new PdfReader(SRC), new PdfWriter(dest));
// If the second argument is true, then regions to be erased are extracted from the redact annotations
// contained inside the given document. If the second argument is false (that's default behavior),
// then use PdfCleanUpTool.addCleanupLocation(PdfCleanUpLocation)
// method to set regions to be erased from the document.
PdfCleanUpTool cleaner = new PdfCleanUpTool(pdfDoc, true);
cleaner.cleanUp();
pdfDoc.close();
}
}

Sorry for late response. I managed now verify that updating to latest versions corrected this problem. Thanks to #mkl pointing this out.
Problem solved

How can I add pTab elements to docx4j while converting document to pdf

I'm getting some error while converting document to pdf using docx4j library in Java. Sadly, my error is this
NOT IMPLEMENTED support for w:pict without v:imagedata
and it's showing up on the converted pdf instead of displaying the error in my java terminal.
I have gone through some article and questions,thus found this converting docx to pdf . However, I am uncertain how to use this in my code or convert it. This is my code :
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.List;
import java.util.Map;
import org.docx4j.convert.out.pdf.viaXSLFO.PdfSettings;
import org.docx4j.fonts.PhysicalFont;
import org.docx4j.fonts.PhysicalFonts;
import org.docx4j.model.structure.SectionWrapper;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
public class docTopdf {
public static void main(String[] args) {
try {
InputStream is = new FileInputStream(
new File(
"test.docx"));
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage
.load(is);
List<SectionWrapper> sections = wordMLPackage.getDocumentModel().getSections();
for (int i = 0; i < sections.size(); i++) {
wordMLPackage.getDocumentModel().getSections().get(i)
.getPageDimensions();
}
PhysicalFonts.discoverPhysicalFonts();
#Deprecated
Map<String, PhysicalFont> physicalFonts = PhysicalFonts.getPhysicalFonts();
// 2) Prepare Pdf settings
#Deprecated
PdfSettings pdfSettings = new PdfSettings();
// 3) Convert WordprocessingMLPackage to Pdf
#Deprecated
org.docx4j.convert.out.pdf.PdfConversion conversion = new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(
wordMLPackage);
#Deprecated
OutputStream out = new FileOutputStream(
new File(
"test.pdf"));
conversion.output(out, pdfSettings);
} catch (Throwable e) {
e.printStackTrace();
}
}
}
And my pom.xml
<dependency>
<groupId>org.docx4j</groupId>
<artifactId>docx4j</artifactId>
<version>3.2.1</version>
</dependency>
any help would be appreciated as I am noob to this conversion. Thanks in advance

Creating a PDF via XSL FO doesn't support w:pict without v:imagedata (ie a graphic which isn't a simple image).
Whilst you could suppress the message by configuring logging appropriately, your PDF output would be lossy.
Your options are to correct the input docx (ie use an image instead of whatever you currently have), or to use a PDF converter with appropriate support. For one option, see https://www.docx4java.org/blog/2020/03/documents4j-for-pdf-output/

Identifying file type in Java

Please help me to find out the type of the file which is being uploaded.
I wanted to distinguish between excel type and csv.
MIMEType returns same for both of these file. Please help.

I use Apache Tika which identifies the filetype using magic byte patterns and globbing hints (the file extension) to detect the MIME type. It also supports additional parsing of file contents (which I don't really use).
Here is a quick and dirty example on how Tika can be used to detect the file type without performing any additional parsing on the file:
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.util.HashMap;
import org.apache.tika.metadata.HttpHeaders;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.metadata.TikaMetadataKeys;
import org.apache.tika.mime.MediaType;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.xml.sax.helpers.DefaultHandler;
public class Detector {
public static void main(String[] args) throws Exception {
File file = new File("/pats/to/file.xls");
AutoDetectParser parser = new AutoDetectParser();
parser.setParsers(new HashMap<MediaType, Parser>());
Metadata metadata = new Metadata();
metadata.add(TikaMetadataKeys.RESOURCE_NAME_KEY, file.getName());
InputStream stream = new FileInputStream(file);
parser.parse(stream, new DefaultHandler(), metadata, new ParseContext());
stream.close();
String mimeType = metadata.get(HttpHeaders.CONTENT_TYPE);
System.out.println(mimeType);
}
}

I hope this will help. Taken from an example not from mine:
import javax.activation.MimetypesFileTypeMap;
import java.io.File;
class GetMimeType {
public static void main(String args[]) {
File f = new File("test.gif");
System.out.println("Mime Type of " + f.getName() + " is " +
new MimetypesFileTypeMap().getContentType(f));
// expected output :
// "Mime Type of test.gif is image/gif"
}
}
Same may be true for excel and csv types. Not tested.

I figured out a cheaper way of doing this with java.nio.file.Files
public String getContentType(File file) throws IOException {
return Files.probeContentType(file.toPath());
}
- or -
public String getContentType(Path filePath) throws IOException {
return Files.probeContentType(filePath);
}
Hope that helps.
Cheers.

A better way without using javax.activation.*:
URLConnection.guessContentTypeFromName(f.getAbsolutePath()));

If you are already using Spring this works for csv and excel:
import org.springframework.mail.javamail.ConfigurableMimeFileTypeMap;
import javax.activation.FileTypeMap;
import java.io.IOException;
public class ContentTypeResolver {
private FileTypeMap fileTypeMap;
public ContentTypeResolver() {
fileTypeMap = new ConfigurableMimeFileTypeMap();
}
public String getContentType(String fileName) throws IOException {
if (fileName == null) {
return null;
}
return fileTypeMap.getContentType(fileName.toLowerCase());
}
}
or with javax.activation you can update the mime.types file.

The CSV will start with text and the excel type is most likely binary.
However the simplest approach is to try to load the excel document using POI. If this fails try to load the file as a CSV, if that fails its possibly neither type.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How do I use iText7 to OCR a multi page PDF? - java

Related

How to generate QR code with some text using JAVA?

Decoding QR Code embedded into PDF with ZXing on Java backend

Using PdfCleanUpTool or PdfAutoSweep causes some text to change to bold, line weights increase and double points change to hearts

How can I add pTab elements to docx4j while converting document to pdf

Identifying file type in Java

Categories

Resources