Java - Text Extraction from PDF using OCR - java

I have a pdf file (some part of it given below), and want to extract text from it. I have used PDFTextStream, but it doesn't work with this file. (However it worked with other file, that has simple text).
What other OCR libraries are capable of doing it?
Please Help.
Thank you.

I tried with PDFBox and it produced satisfactory results.
Here is the code to extract text from PDF using PDFBox:
import java.io.*;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.util.*;
public class PDFTest {
public static void main(String[] args){
PDDocument pd;
BufferedWriter wr;
try {
File input = new File("C:/BillOCR/data/bill.pdf"); // The PDF file from where you would like to extract
File output = new File("D:/SampleText.txt"); // The text file where you are going to store the extracted data
pd = PDDocument.load(input);
System.out.println(pd.getNumberOfPages());
System.out.println(pd.isEncrypted());
pd.save("CopyOfBill.pdf"); // Creates a copy called "CopyOfInvoice.pdf"
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1); //Start extracting from page 3
stripper.setEndPage(1); //Extract till page 5
wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output)));
stripper.writeText(pd, wr);
if (pd != null) {
pd.close();
}
// I use close() to flush the stream.
wr.close();
} catch (Exception e){
e.printStackTrace();
}
}
}

Related

how to judge if the file is doc or docx in POI

The title may be a little confusing. The simplest method must be judging by extension name just like:
// is represents the InputStream
if (filePath.endsWith("doc")) {
WordExtractor ex = new WordExtractor(is);
text = ex.getText();
ex.close();
} else if(filePath.endsWith("docx")) {
XWPFDocument doc = new XWPFDocument(is);
XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
text = extractor.getText();
extractor.close();
}
This works in most cases. But I have found that for certain file whose extension is doc (a docx file essentially) if you open using winrar, you will find xml files. As it is known that a docx file is a zip file consists of xml files.
I believe this problem must not be rare. But I have not found any information about this. Obviously, judging by extension name to read a doc or docx is not appropriate.
In my case, I have to read a lot of files. And I will even read the doc or docx inside a compressed file, zip, 7z or even rar. Hence, I have to read content by inputStream instead of a File or something else. So how to know whether a file is .docx or .doc format from Apache POI is totally not suitable for my case with ZipInputStream.
What is the best way to judge a file is a doc or docx? I want a solution to read the content from a file which may be doc or docx. But not only just simply judge if it is a doc or docx. Apparently, ZipInpuStream is not a good method for my case. And I believe it is not a appropriate method for others either. Why do I have to judge if the file is doc or docx by an exception?
Using the current stable apache poi version 3.17 you may use FileMagic. But internally this will of course also have a look into the files.
Example:
import java.io.InputStream;
import java.io.FileInputStream;
import java.io.BufferedInputStream;
import org.apache.poi.poifs.filesystem.FileMagic;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
public class ReadWord {
static String read(InputStream is) throws Exception {
System.out.println(FileMagic.valueOf(is));
String text = "";
if (FileMagic.valueOf(is) == FileMagic.OLE2) {
WordExtractor ex = new WordExtractor(is);
text = ex.getText();
ex.close();
} else if(FileMagic.valueOf(is) == FileMagic.OOXML) {
XWPFDocument doc = new XWPFDocument(is);
XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
text = extractor.getText();
extractor.close();
}
return text;
}
public static void main(String[] args) throws Exception {
InputStream is = new BufferedInputStream(new FileInputStream("ExampleOLE.doc")); //really a binary OLE2 Word file
System.out.println(read(is));
is.close();
is = new BufferedInputStream(new FileInputStream("ExampleOOXML.doc")); //a OOXML Word file named *.doc
System.out.println(read(is));
is.close();
is = new BufferedInputStream(new FileInputStream("ExampleOOXML.docx")); //really a OOXML Word file
System.out.println(read(is));
is.close();
}
}
try {
new ZipFile(new File("/Users/giang/Documents/a.doc"));
System.out.println("this file is .docx");
} catch (ZipException e) {
System.out.println("this file is not .docx");
e.printStackTrace();
}

How to read raw data, say only text, from a file(word document, excel) without format? [duplicate]

The strings I'm (programmatically) getting from MS Word files when using Apache POI are not the same text I can look at when I open the files with MS Word.
When using the following code:
File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
HWPFDocument wordDoc = new HWPFDocument(inputStrm);
System.out.println(wordDoc.getText());
the output is a single line with many 'invalid' characters (yes, the 'boxes'), and many unwanted strings, like "FORMTEXT", "HYPERLINK \l "_Toc##########"" ('#' being numeric digits), "PAGEREF _Toc########## \h 4", etc.
The following code "fixes" the single-line problem, but maintains all the invalid characters and unwanted text:
File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
WordExtractor wordExtractor = new WordExtractor(inputStrm);
for(String paragraph:wordExtractor.getParagraphText()){
System.out.println(paragraph);
}
I don't know if I'm using the wrong method for extracting the text, but that's what I've come up with when looking at POI's quick-guide. If I am, what is the correct approach?
If that output is correct, is there a standard way for getting rid of the unwanted text, or will I have to write a filter of my own?
This class can read both .doc and .docx files in Java. For this I'm using tika-app-1.2.jar:
/*
* This class is used to read .doc and .docx files
*
* #author Developer
*
*/
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.net.URL;
import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.detect.Detector;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
class TextExtractor {
private OutputStream outputstream;
private ParseContext context;
private Detector detector;
private Parser parser;
private Metadata metadata;
private String extractedText;
public TextExtractor() {
context = new ParseContext();
detector = new DefaultDetector();
parser = new AutoDetectParser(detector);
context.set(Parser.class, parser);
outputstream = new ByteArrayOutputStream();
metadata = new Metadata();
}
public void process(String filename) throws Exception {
URL url;
File file = new File(filename);
if (file.isFile()) {
url = file.toURI().toURL();
} else {
url = new URL(filename);
}
InputStream input = TikaInputStream.get(url, metadata);
ContentHandler handler = new BodyContentHandler(outputstream);
parser.parse(input, handler, metadata, context);
input.close();
}
public void getString() {
//Get the text into a String object
extractedText = outputstream.toString();
//Do whatever you want with this String object.
System.out.println(extractedText);
}
public static void main(String args[]) throws Exception {
if (args.length == 1) {
TextExtractor textExtractor = new TextExtractor();
textExtractor.process(args[0]);
textExtractor.getString();
} else {
throw new Exception();
}
}
}
To compile:
javac -cp ".:tika-app-1.2.jar" TextExtractor.java
To run:
java -cp ".:tika-app-1.2.jar" TextExtractor SomeWordDocument.doc
There are two options, one provided directly in Apache POI, the other via Apache Tika (which uses Apache POI internally).
The first option is to use WordExtractor, but wrap it in a call to stripFields(String) when calling it. That will remove the text based fields included in the text, things like HYPERLINK that you've seen. Your code would become:
NPOIFSFileSystem fs = new NPOIFSFileSytem(file);
WordExtractor extractor = new WordExtractor(fs.getRoot());
for(String rawText : extractor.getParagraphText()) {
String text = extractor.stripFields(rawText);
System.out.println(text);
}
The other option is to use Apache Tika. Tika provides text extraction, and metadata, for a wide variety of files, so the same code will work for .doc, .docx, .pdf and many others too. To get clean, plain text of your word document (you can also get XHTML if you'd rather), you'd do something like:
TikaConfig tika = TikaConfig.getDefaultConfig();
TikaInputStream stream = TikaInputStream.get(file);
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
tika.getParser().parse(input, handler, metadata, new ParseContext());
String text = handler.toString();
Try this, works for me and is purely a POI solution. You will have to look for the HWPFDocument counterpart though. Make sure the document you are reading predates Word 97, else use XWPFDocument like I do.
InputStream inputstream = new FileInputStream(m_filepath);
//read the file
XWPFDocument adoc= new XWPFDocument(inputstream);
//and place it in a xwpf format
aString = new XWPFWordExtractor(adoc).getText();
//gets the full text
Now if you want certain parts you can use the getparagraphtext but dont use the text extractor, use it directly on the paragraph like this
for (XWPFParagraph p : adoc.getParagraphs())
{
System.out.println(p.getParagraphText());
}

Converting a pdf to word document using java

I've successfully converted JPEG to Pdf using Java, but don't know how to convert Pdf to Word using Java, the code for converting JPEG to Pdf is given below.
Can anyone tell me how to convert Pdf to Word (.doc/ .docx) using Java?
import java.io.FileOutputStream;
import com.itextpdf.text.Image;
import com.itextpdf.text.pdf.PdfWriter;
import com.itextpdf.text.Document;
public class JpegToPDF {
public static void main(String[] args) {
try {
Document convertJpgToPdf = new Document();
PdfWriter.getInstance(convertJpgToPdf, new FileOutputStream(
"c:\\java\\ConvertImagetoPDF.pdf"));
convertJpgToPdf.open();
Image convertJpg = Image.getInstance("c:\\java\\test.jpg");
convertJpgToPdf.add(convertJpg);
convertJpgToPdf.close();
System.out.println("Successfully Converted JPG to PDF in iText");
} catch (Exception i1) {
i1.printStackTrace();
}
}
}
In fact, you need two libraries. Both libraries are open source. The first one is iText, it is used to extract the text from a PDF file. The second one is POI, it is ued to create the word document.
The code is quite simple:
//Create the word document
XWPFDocument doc = new XWPFDocument();
// Open the pdf file
String pdf = "myfile.pdf";
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
// Read the PDF page by page
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
TextExtractionStrategy strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
// Extract the text
String text=strategy.getResultantText();
// Create a new paragraph in the word document, adding the extracted text
XWPFParagraph p = doc.createParagraph();
XWPFRun run = p.createRun();
run.setText(text);
// Adding a page break
run.addBreak(BreakType.PAGE);
}
// Write the word document
FileOutputStream out = new FileOutputStream("myfile.docx");
doc.write(out);
// Close all open files
out.close();
reader.close();
Beware: With the used extraction strategy, you will lose all formatting. But you can fix this, by inserting your own, more complex extraction strategy.
You can use 7-pdf library
have a look at this it may help :
http://www.7-pdf.de/sites/default/files/guide/manuals/library/index.html
PS: itext has some issues when given file is non RGB image, try this out!!
Although it's far from being a pure Java solution OpenOffice/LibreOfffice allows one to connect to it through a TCP port; it's possible to use that to convert documents. If this looks like an acceptable solution, JODConverter can help you.

error in converting word document to pdf using iText

The below is the code that i used to convert a word document to pdf. After compiling the code, the PDF file is generated. But the file contains some junk characters along with the word document content. Please help me to know what modification should i do to get rid of the junk characters.
The code i used is:
import com.lowagie.text.Document;
import com.lowagie.text.Paragraph;
import com.lowagie.text.pdf.PdfWriter;
import java.io.File;
import java.io.FileOutputStream;
public class PdfConverter
{
private void createPdf(String inputFile, String outputFile)//, boolean isPictureFile)
{
Document pdfDocument = new Document();
String pdfFilePath = outputFile;
try
{
FileOutputStream fileOutputStream = new FileOutputStream(pdfFilePath);
PdfWriter writer = null;
writer = PdfWriter.getInstance(pdfDocument, fileOutputStream);
writer.open();
pdfDocument.open();
/*if (isPictureFile)
{
pdfDocument.add(com.lowagie.text.Image.getInstance(inputFile));
}
else
{ */
File file = new File(inputFile);
pdfDocument.add(new Paragraph(org.apache.commons.io.FileUtils.readFileToString(file)));
//}
pdfDocument.close();
writer.close();
System.out.println("PDF has been generted");
}
catch (Exception exception)
{
System.out.println("Document Exception!" + exception);
}
}
public static void main(String args[])
{
PdfConverter pdfConversion = new PdfConverter();
pdfConversion.createPdf("C:/test.doc", "C:/test.pdf");//, true);
}
}
Thanks for you help.
Only because you name your class PdfConverter you don't have one. All you do is reading the binary content as a String and writing this as one paragraph (and that's what you see). This approach will definitively not be successful. See https://stackoverflow.com/questions/437394 for a similar question.
If you are interested just in the content of your word document, you might want to give Apache POI - the Java API for Microsoft Documents a try to read your the document not at binary level but on a hight abstraction level. If your Word document has a simple (and I mean a really simple) structure you might get reasonable results.
To do this, you will have to read the doc file correctly and then use the read data to create the PDF file.
What you are doing right now is that you are reading data from doc file, which is having garbage values since you are not using proper API to read the data, and then storing the obtained garbage data in the PDF file. Hence the issue.

Images are not appearing good when converted from .docx to pdf

I converted .docx file to .pdf file, the text is converting fine, but the images in the .docx file is not appearing, instead it is represented as some special characters, below is my code:
import com.lowagie.text.Document;
import com.lowagie.text.Paragraph;
import com.lowagie.text.pdf.PdfWriter;
import java.io.File;
import java.io.FileOutputStream;
public class PDFConversion {
/**
* 14. This method is used to convert the given file to a PDF format 15.
*
* #param inputFile
* - Name and the path of the file 16.
* #param outputFile
* - Name and the path where the PDF file to be saved 17.
* #param isPictureFile
* 18.
*/
private void createPdf(String inputFile, String outputFile, boolean isPictureFile) {
/**
* 22. Create a new instance for Document class 23.
*/
Document pdfDocument = new Document();
String pdfFilePath = outputFile;
try {
FileOutputStream fileOutputStream = new FileOutputStream(pdfFilePath);
PdfWriter writer = null;
writer = PdfWriter.getInstance(pdfDocument, fileOutputStream);
writer.open();
pdfDocument.open();
/**
* 34. Proceed if the file given is a picture file 35.
*/
if (isPictureFile) {
pdfDocument.add(com.lowagie.text.Image.getInstance(inputFile));
}
/**
* 41. Proceed if the file given is (.txt,.html,.doc etc) 42.
*/
else {
File file = new File(inputFile);
pdfDocument.add(new Paragraph(org.apache.commons.io.FileUtils
.readFileToString(file)));
}
pdfDocument.close();
writer.close();
} catch (Exception exception) {
System.out.println("Document Exception!" + exception);
}
}
public static void main(String args[]) {
PDFConversion pdfConversion = new PDFConversion();
pdfConversion.createPdf("C:/Users/LENOVO/Downloads/The_JFileChooser_Component.doc",
"E:/The_JFileChooser_Component.pdf", false);
// For other files
// pdfConversion.createPdf("C:/shunmuga/sample.html",
// "C:/shunmuga/sample.pdf", false);
}
}
I'm not sure what it could be, but for some alternatives have a look at:
Apose.Words Library for Java it has some really cool features one of them being docx to pdf conversion by a few simple lines (and it's reliable):
Document doc = new Document("d:/test/mydoc.docx");
doc.Save("d:/test/Out.pdf", SaveFormat.Pdf);
Docx4j
which can be used to convert docx and many others to PDF, it does this by first using HTML/XML based on IText then converts it to a PDF (All libararies are included within docx4j, just added the itext link for completeness):
org.docx4j.convert.out.pdf.PdfConversion c
= new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(wordMLPackage);//using xml
// = new org.docx4j.convert.out.pdf.viaHTML.Conversion(wordMLPackage);//using html
// = new org.docx4j.convert.out.pdf.viaIText.Conversion(wordMLPackage);//using itext libs
If that's not enough it has sample source code for you to try.
xdocreport also comes with a lot of samples for conversion (haven't downloaded them, but it should have the doc/docx to PDF converter source)

Categories

Resources