Converting Scanned Image In PDF to Text using Tesseract OCR

Converting Scanned Image In PDF to Text using Tesseract OCR - java

PDF document is loaded and get scanned page content as a BufferedImage. When I am doing OCR of this image it is showing empty in result.
Code is pasted below
public static void main(String[] args) {
PDDocument document = null;
try {
//mini-cog.pdf Optometry.pdf
document = PDDocument.load(new File("D:\\McLaren\\Optometry.pdf"));
PDPageTree pages = document.getPages();
Iterator iter = pages.iterator();
while (iter.hasNext()) {
PDPage page = (PDPage) iter.next();
PDResources resources = page.getResources();
for (COSName c : resources.getXObjectNames()) {
PDXObject o = resources.getXObject(c);
if (o instanceof PDImageXObject) {
BufferedImage image = ((PDImageXObject) o).getImage();
System.out.println("Width ====>> "+image.getWidth());
System.out.println("Height ====>> "+image.getHeight());
ocr(image);
}
}
} // end while loop
}
catch (IOException ex) {
System.out.println("" + ex);
}
}
public static void ocr(BufferedImage image) {
try {
System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
System.load("C:\\Program Files\\Tesseract-OCR\\gsdll64.dll");
Tesseract tessInst = new Tesseract();
tessInst.setDatapath("D:\\tesseract\\");
tessInst.setLanguage("eng");
String result = tessInst.doOCR(image);
System.out.println(result);
}
catch (TesseractException e) {
e.printStackTrace();
}
}
BufferedImage is showing empty in result after converting Image into Text using OCR.

Related

Java Unable to load arabic character U+062C into PDF on PDFBox even though it exists on the font

The font I am using is an arial.tff. Aside from this specific character (U+062C) all the other characters seem to be working fine.
Somehow when it reaches this point it just throws:
java.lang.IllegalArgumentException: U+062C ('afii57420') is not available in this font Helvetica (generic: ArialMT) encoding: StandardEncoding with differences
Code below:
PDFont font = PDType0Font.load(pdfDocumentTemplate, this.resourceLoader.getResource("classpath:arial.ttf").getInputStream(), false);
PDAcroForm pdAcroForm = pdAcroFormOptional.get();
String fontName = pdAcroForm.getDefaultResources()
.add(font)
.getName();
for (GenerateDocumentPlaceholder placeholder : command.getPlaceholderList()) {
PDTextField field = (PDTextField) pdAcroForm.getField(placeholder.getName());
try {
if (field != null) {
field.setDefaultAppearance("/" + fontName + " 0 Tf 0 g");
field.setValue(placeholder.getValue());
}
} catch (Exception ex) {
log.error("Error while updating field {}", placeholder.getValue(), ex);
}
}
And here's the font with the character:
Any ideas?

Fixed by replacing the embedded resources within pdf by doing this:
private String getPdfFontName(PDAcroForm pdAcroForm, PDDocument pdfDocumentTemplate, String resource) throws ConstraintViolatedException {
String fontName = "";
try {
PDResources pdResources = new PDResources();
pdAcroForm.setDefaultResources(pdResources);
PDFont font = PDType0Font.load(pdfDocumentTemplate, this.resourceLoader.getResource(resource).getInputStream());
pdResources.put(COSName.getPDFName("Helv"), font);
fontName = pdResources.add(font).getName();
} catch (Exception ex) {
log.warn("Error while adding arabic font", ex);
}
return fontName;
}
And then using it like this:
String fontName = this.getPdfFontName(pdAcroForm, pdfDocumentTemplate, "classpath:Arial.ttf");
for (GenerateDocumentPlaceholder placeholder : command.getPlaceholderList()) {
PDTextField field = (PDTextField) pdAcroForm.getField(placeholder.getName());
try {
if (field != null) {
field.setDefaultAppearance("/" + fontName + " 0 Tf 0 g");
field.setValue(placeholder.getValue());
} catch (Exception ex) {
log.error("Error while updating field {}", placeholder.getValue(), ex);
}
}

Flatten signatures in pdf with PDFBOX java

I want to flatten a pdf with signatures from a form but I am using this code and when I generate the final pdf I can still delete the signature. What I want is that when I generate the final pdf, I cannot delete anything at all from the pdf
private static void flattenPDF(String src, String dst) throws IOException {
PDDocument doc = null;
try {
doc = PDDocument.load(new File(src));
} catch (IOException e) {
System.out.println("Exception: " + e.getMessage());
}
PDDocumentCatalog catalog = doc.getDocumentCatalog();
PDAcroForm acroForm = catalog.getAcroForm();
if(acroForm == null) acroForm = new PDAcroForm(PDDocument.load(new File(src)));
PDResources resources = new PDResources();
List<PDField> fields = new ArrayList<>(acroForm.getFields());
processFields(fields, resources);
acroForm.setDefaultResources(resources);
try {
acroForm.flatten();
doc.save(dst);
doc.close();
} catch (IOException e) {
System.out.println("Exception: " + e.getMessage());
}
}
private static void processFields(List<PDField> fields, PDResources resources) {
fields.stream().forEach(f -> {
f.setReadOnly(true);
COSDictionary cosObject = f.getCOSObject();
String value = cosObject.getString(COSName.DV) == null ?
cosObject.getString(COSName.V) : cosObject.getString(COSName.DV);
System.out.println("Setting " + f.getFullyQualifiedName() + ": " + value);
try {
f.setValue(value);
} catch (IOException e) {
if (e.getMessage().matches("Could not find font: /.*")) {
String fontName = e.getMessage().replaceAll("^[^/]*/", "");
System.out.println("Adding fallback font for: " + fontName);
resources.put(COSName.getPDFName(fontName), PDType1Font.HELVETICA);
try {
f.setValue(value);
} catch (IOException e1) {
e1.printStackTrace();
}
} else {
e.printStackTrace();
}
}
if (f instanceof PDNonTerminalField) {
processFields(((PDNonTerminalField) f).getChildren(), resources);
}
});
}

Printing plain text files to PDF printer using javax.print results in an empty file

I need to create a pdf file from plain text files. I supposed that the simplest method would be read these files and print them to a PDF printer.
My problem is that if I print to a pdf printer, the result will be an empty pdf file. If I print to Microsoft XPS Document Writer, the file is created in plain text format, not in oxps format.
I would be satisfied with a two or three step solution. (Eg. converting to xps first then to pdf using ghostscript, or something similar).
I have tried a couple of pdf printers such as: CutePDF, Microsoft PDF writer, Bullzip PDF. The result is the same for each one.
The environment is Java 1.7/1.8 Win10
private void print() {
try {
DocFlavor flavor = DocFlavor.SERVICE_FORMATTED.PRINTABLE;
PrintRequestAttributeSet patts = new HashPrintRequestAttributeSet();
PrintService[] ps = PrintServiceLookup.lookupPrintServices(flavor, patts);
if (ps.length == 0) {
throw new IllegalStateException("No Printer found");
}
System.out.println("Available printers: " + Arrays.asList(ps));
PrintService myService = null;
for (PrintService printService : ps) {
if (printService.getName().equals("Microsoft XPS Document Writer")) { //
myService = printService;
break;
}
}
if (myService == null) {
throw new IllegalStateException("Printer not found");
}
myService.getSupportedDocFlavors();
DocPrintJob job = myService.createPrintJob();
FileInputStream fis1 = new FileInputStream("o:\\k\\t1.txt");
Doc pdfDoc = new SimpleDoc(fis1, DocFlavor.INPUT_STREAM.AUTOSENSE, null);
HashPrintRequestAttributeSet pr = new HashPrintRequestAttributeSet();
pr.add(OrientationRequested.PORTRAIT);
pr.add(new Copies(1));
pr.add(MediaSizeName.ISO_A4);
PrintJobWatcher pjw = new PrintJobWatcher(job);
job.print(pdfDoc, pr);
pjw.waitForDone();
fis1.close();
} catch (PrintException ex) {
Logger.getLogger(Docparser.class.getName()).log(Level.SEVERE, null, ex);
} catch (Exception ex) {
Logger.getLogger(Docparser.class.getName()).log(Level.SEVERE, null, ex);
}
}
class PrintJobWatcher {
boolean done = false;
PrintJobWatcher(DocPrintJob job) {
job.addPrintJobListener(new PrintJobAdapter() {
public void printJobCanceled(PrintJobEvent pje) {
allDone();
}
public void printJobCompleted(PrintJobEvent pje) {
allDone();
}
public void printJobFailed(PrintJobEvent pje) {
allDone();
}
public void printJobNoMoreEvents(PrintJobEvent pje) {
allDone();
}
void allDone() {
synchronized (PrintJobWatcher.this) {
done = true;
System.out.println("Printing done ...");
PrintJobWatcher.this.notify();
}
}
});
}
public synchronized void waitForDone() {
try {
while (!done) {
wait();
}
} catch (InterruptedException e) {
}
}
}

If you can install LibreOffice, it is possible to use the Java UNO API to do this.
There is a similar example here which will load and save a file: Java Convert Word to PDF with UNO. This could be used to convert your text file to PDF.
Alternatively, you could take the text file and send it directly to the printer using the same API.
The following JARs give access to the UNO API. Ensure these are in your class path:
[Libre Office Dir]/URE/java/juh.jar
[Libre Office Dir]/URE/java/jurt.jar
[Libre Office Dir]/URE/java/ridl.jar
[Libre Office Dir]/program/classes/unoil.jar
[Libre Office Dir]/program
The following code will then take your sourceFile and print to the printer named "Local Printer 1".
import java.io.File;
import java.util.ArrayList;
import java.util.List;
import com.sun.star.beans.PropertyValue;
import com.sun.star.frame.XComponentLoader;
import com.sun.star.uno.UnoRuntime;
import com.sun.star.view.XPrintable;
public class DirectPrintTest
{
public static void main(String args[])
{
// set to the correct name of your printers
String printer = "Local Printer 1";// "Microsoft Print to PDF";
File sourceFile = new File("c:/projects/WelcomeTemplate.doc");
if (!sourceFile.canRead()) {
throw new RuntimeException("Can't read:" + sourceFile.getPath());
}
com.sun.star.uno.XComponentContext xContext = null;
try {
// get the remote office component context
xContext = com.sun.star.comp.helper.Bootstrap.bootstrap();
System.out.println("Connected to a running office ...");
// get the remote office service manager
com.sun.star.lang.XMultiComponentFactory xMCF = xContext
.getServiceManager();
Object oDesktop = xMCF.createInstanceWithContext(
"com.sun.star.frame.Desktop", xContext);
com.sun.star.frame.XComponentLoader xCompLoader = (XComponentLoader) UnoRuntime
.queryInterface(com.sun.star.frame.XComponentLoader.class,
oDesktop);
StringBuffer sUrl = new StringBuffer("file:///");
sUrl.append(sourceFile.getCanonicalPath().replace('\\', '/'));
List<PropertyValue> loadPropsList = new ArrayList<PropertyValue>();
PropertyValue pv = new PropertyValue();
pv.Name = "Hidden";
pv.Value = Boolean.TRUE;
loadPropsList.add(pv);
PropertyValue[] loadProps = new PropertyValue[loadPropsList.size()];
loadPropsList.toArray(loadProps);
// Load a Writer document, which will be automatically displayed
com.sun.star.lang.XComponent xComp = xCompLoader
.loadComponentFromURL(sUrl.toString(), "_blank", 0,
loadProps);
// Querying for the interface XPrintable on the loaded document
com.sun.star.view.XPrintable xPrintable = (XPrintable) UnoRuntime
.queryInterface(com.sun.star.view.XPrintable.class, xComp);
// Setting the property "Name" for the favoured printer (name of
// IP address)
com.sun.star.beans.PropertyValue propertyValue[] = new com.sun.star.beans.PropertyValue[2];
propertyValue[0] = new com.sun.star.beans.PropertyValue();
propertyValue[0].Name = "Name";
propertyValue[0].Value = printer;
// Setting the name of the printer
xPrintable.setPrinter(propertyValue);
propertyValue[0] = new com.sun.star.beans.PropertyValue();
propertyValue[0].Name = "Wait";
propertyValue[0].Value = Boolean.TRUE;
// Printing the loaded document
System.out.println("sending print");
xPrintable.print(propertyValue);
System.out.println("closing doc");
((com.sun.star.util.XCloseable) UnoRuntime.queryInterface(
com.sun.star.util.XCloseable.class, xPrintable))
.close(true);
System.out.println("closed");
System.exit(0);
} catch (Exception e) {
e.printStackTrace(System.err);
System.exit(1);
}
}
}

Thank you for all. After two days struggling with various type of printers (I gave a chance to CUPS PDF printer too but I could not make it to print in landscape mode) I ended up using the Apache PDFbox.
It's only a POC solution but works and fits to my needs. I hope it will be useful for somebody.
( cleanTextContent() method removes some ESC control characters from the line to be printed. )
public void txt2pdf() {
float POINTS_PER_INCH = 72;
float POINTS_PER_MM = 1 / (10 * 2.54f) * POINTS_PER_INCH;
SimpleDateFormat dateFormat = new SimpleDateFormat("yyyy.MM.dd HH:m.ss");
PDDocument doc = null;
try {
doc = new PDDocument();
PDPage page = new PDPage(new PDRectangle(297 * POINTS_PER_MM, 210 * POINTS_PER_MM));
doc.addPage(page);
PDPageContentStream content = new PDPageContentStream(doc, page);
//PDFont pdfFont = PDType1Font.HELVETICA;
PDFont pdfFont = PDTrueTypeFont.loadTTF(doc, new File("c:\\Windows\\Fonts\\lucon.ttf"));
float fontSize = 10;
float leading = 1.1f * fontSize;
PDRectangle mediabox = page.getMediaBox();
float margin = 20;
float startX = mediabox.getLowerLeftX() + margin;
float startY = mediabox.getUpperRightY() - margin;
content.setFont(pdfFont, fontSize);
content.beginText();
content.setLeading(leading);
content.newLineAtOffset(startX, startY);
BufferedReader fis1 = new BufferedReader(new InputStreamReader(new FileInputStream("o:\\k\\t1.txt"), "cp852"));
String inString;
//content.setRenderingMode(RenderingMode.FILL_STROKE);
float currentY = startY + 60;
float hitOsszesenOffset = 0;
int pageNumber = 1;
while ((inString = fis1.readLine()) != null) {
currentY -= leading;
if (currentY <= margin) {
content.newLineAtOffset(0, (mediabox.getLowerLeftX()-35));
content.showText("Date Generated: " + dateFormat.format(new Date()));
content.newLineAtOffset((mediabox.getUpperRightX() / 2), (mediabox.getLowerLeftX()));
content.showText(String.valueOf(pageNumber++)+" lap");
content.endText();
float yCordinate = currentY+30;
float sX = mediabox.getLowerLeftY()+ 35;
float endX = mediabox.getUpperRightX() - 35;
content.moveTo(sX, yCordinate);
content.lineTo(endX, yCordinate);
content.stroke();
content.close();
PDPage new_Page = new PDPage(new PDRectangle(297 * POINTS_PER_MM, 210 * POINTS_PER_MM));
doc.addPage(new_Page);
content = new PDPageContentStream(doc, new_Page);
content.beginText();
content.setFont(pdfFont, fontSize);
content.newLineAtOffset(startX, startY);
currentY = startY;
}
String ss = new String(inString.getBytes(), "UTF8");
ss = cleanTextContent(ss);
if (!ss.isEmpty()) {
if (ss.contains("JAN") || ss.contains("SUMMARY")) {
content.setRenderingMode(RenderingMode.FILL_STROKE);
}
content.newLineAtOffset(0, -leading);
content.showText(ss);
}
content.setRenderingMode(RenderingMode.FILL);
}
content.newLineAtOffset((mediabox.getUpperRightX() / 2), (mediabox.getLowerLeftY()));
content.showText(String.valueOf(pageNumber++));
content.endText();
fis1.close();
content.close();
doc.save("o:\\k\\t1.pdf");
} catch (IOException ex) {
Logger.getLogger(Document_Creation.class.getName()).log(Level.SEVERE, null, ex);
} finally {
if (doc != null) {
try {
doc.close();
} catch (IOException ex) {
Logger.getLogger(Document_Creation.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
}

How to get pictures and tables from .docx document using apache poi?

Dears, kindly i tried to extract whole document from .docx file to a text area in java, and What i only receive is text without images or tables, so any advice? Thanks in advance.
My code is :
try{
JFileChooser chooser = new JFileChooser();
chooser.showOpenDialog(null);
XWPFDocument doc = new XWPFDocument(new
FileInputStream(chooser.getSelectedFile()));
XWPFWordExtractor extract = new XWPFWordExtractor(doc);
content.setText(extract.getText());
content.setFont(new Font("Serif", Font.ITALIC, 16));
content.setLineWrap(true);
content.setWrapStyleWord(true);
content.setBackground(Color.white);
} catch(Exception e){
JOptionPane.showMessageDialog(null, e);
}
}

To extract tables use List<XWPFTable> table = doc.getTables()
The example below
public static void readWordDocument() {
try {
String fileName = "C:\\sample.docx";
if(!(fileName.endsWith(".doc") || fileName.endsWith(".docx"))) {
throw new FileFormatException();
} else {
XWPFDocument doc = new XWPFDocument(new FileInputStream(fileName));
List<XWPFTable> table = doc.getTables();
for (XWPFTable xwpfTable : table) {
List<XWPFTableRow> row = xwpfTable.getRows();
for (XWPFTableRow xwpfTableRow : row) {
List<XWPFTableCell> cell = xwpfTableRow.getTableCells();
for (XWPFTableCell xwpfTableCell : cell) {
if(xwpfTableCell!=null)
{
System.out.println(xwpfTableCell.getText());
List<XWPFTable> itable = xwpfTableCell.getTables();
if(itable.size()!=0)
{
for (XWPFTable xwpfiTable : itable) {
List<XWPFTableRow> irow = xwpfiTable.getRows();
for (XWPFTableRow xwpfiTableRow : irow) {
List<XWPFTableCell> icell = xwpfiTableRow.getTableCells();
for (XWPFTableCell xwpfiTableCell : icell) {
if(xwpfiTableCell!=null)
{
System.out.println(xwpfiTableCell.getText());
}
}
}
}
}
}
}
}
}
}
} catch(FileFormatException e) {
e.printStackTrace();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
To extarct images use List<XWPFPictureData> piclist=docx.getAllPictures()
See example below
public static void extractImages(String src){
try{
//create file inputstream to read from a binary file
FileInputStream fs=new FileInputStream(src);
//create office word 2007+ document object to wrap the word file
XWPFDocument docx=new XWPFDocument(fs);
//get all images from the document and store them in the list piclist
List<XWPFPictureData> piclist=docx.getAllPictures();
//traverse through the list and write each image to a file
Iterator<XWPFPictureData> iterator=piclist.iterator();
int i=0;
while(iterator.hasNext()){
XWPFPictureData pic=iterator.next();
byte[] bytepic=pic.getData();
BufferedImage imag=ImageIO.read(new ByteArrayInputStream(bytepic));
ImageIO.write(imag, "jpg", new File("D:/imagefromword"+i+".jpg"));
i++;
}
}catch(Exception e){System.exit(-1);}
}

edit .doc file header java

I need to edit .doc & .docx files header and maintain the style of the document.
I tried doing it by using:
poi api : I managed to read the file header but couldn't find how to replace a text in it and save the result with the original style .
public static void mFix(String iFilePath , HashMap<String, String> iOldNewCouples)
{
aOldNewCouples = iOldNewCouples;
try {
if(iFilePath==null)
return;
File file = new File(iFilePath);
FileInputStream fis=new FileInputStream(file.getAbsolutePath());
HWPFDocument document=new HWPFDocument(fis);
WordExtractor extractor = new WordExtractor(document); // read the doc as rtf
String fileData = extractor.getHeaderText();
String fileDataResult =fileData ;
for (Entry<String, String> entry : aOldNewCouples.entrySet())
{
if(fileData.contains(entry.getKey())) {
System.out.println("replace " +entry.getKey());
fileDataResult = fileData.replace(entry.getKey(), entry.getValue());
}
}
document.getHeaderStoryRange().replaceText(fileData, fileDataResult);
saveWord(iFilePath ,document);
fis.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace( );
}
}
private static void saveWord(String filePath, HWPFDocument doc) throws FileNotFoundException, IOException
{
FileOutputStream fileOutputStream = null;
try{
fileOutputStream = new FileOutputStream(new File(filePath.replace(".doc", "-test.doc")));
BufferedOutputStream buffOutputStream = new BufferedOutputStream(fileOutputStream);
doc.write(buffOutputStream);
buffOutputStream.close();
fileOutputStream.close();
}
finally{
if( fileOutputStream != null)
fileOutputStream.close();
}
}
I tried doc4j api for docx : I found how to edit the header but didn't found how to keep the style.
public static void mFix(String iFilePath , HashMap<String, String> iOldNewCouples) {
aOldNewCouples = iOldNewCouples;
WordprocessingMLPackage output;
try {
output = WordprocessingMLPackage.load(new java.io.File(iFilePath));
replaceText(output.getDocumentModel().getSections().get(0).getHeaderFooterPolicy().getDefaultHeader());
output.save(new File(iFilePath));
}
catch (Exception e) {
e.printStackTrace();
}
}
public static void replaceText(ContentAccessor c) throws Exception
{
for (Object p: c.getContent())
{
if (p instanceof ContentAccessor)
replaceText((ContentAccessor) p);
else if (p instanceof JAXBElement)
{
Object v = ((JAXBElement) p).getValue();
if (v instanceof ContentAccessor)
replaceText((ContentAccessor) v);
else if (v instanceof org.docx4j.wml.Text)
{
org.docx4j.wml.Text t = (org.docx4j.wml.Text) v;
String text = t.getValue();
if (text != null)
{
boolean flag = false;
for (Entry<String, String> entry : aOldNewCouples.entrySet())
{
if(text.contains(entry.getKey())) {
flag =true;
text = text.replaceAll(entry.getKey(), entry.getValue());
t.setSpace("preserve");
t.setValue(text);
}
}
}
}
}
}
}
I would like to have examples for those api.
If there is other free solution for this for Java projects , please write them with example.
thanks
Tami

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Converting Scanned Image In PDF to Text using Tesseract OCR - java

Related

Java Unable to load arabic character U+062C into PDF on PDFBox even though it exists on the font

Flatten signatures in pdf with PDFBOX java

Printing plain text files to PDF printer using javax.print results in an empty file

How to get pictures and tables from .docx document using apache poi?

edit .doc file header java

Categories

Resources