I converted .docx file to .pdf file, the text is converting fine, but the images in the .docx file is not appearing, instead it is represented as some special characters, below is my code:
import com.lowagie.text.Document;
import com.lowagie.text.Paragraph;
import com.lowagie.text.pdf.PdfWriter;
import java.io.File;
import java.io.FileOutputStream;
public class PDFConversion {
/**
* 14. This method is used to convert the given file to a PDF format 15.
*
* #param inputFile
* - Name and the path of the file 16.
* #param outputFile
* - Name and the path where the PDF file to be saved 17.
* #param isPictureFile
* 18.
*/
private void createPdf(String inputFile, String outputFile, boolean isPictureFile) {
/**
* 22. Create a new instance for Document class 23.
*/
Document pdfDocument = new Document();
String pdfFilePath = outputFile;
try {
FileOutputStream fileOutputStream = new FileOutputStream(pdfFilePath);
PdfWriter writer = null;
writer = PdfWriter.getInstance(pdfDocument, fileOutputStream);
writer.open();
pdfDocument.open();
/**
* 34. Proceed if the file given is a picture file 35.
*/
if (isPictureFile) {
pdfDocument.add(com.lowagie.text.Image.getInstance(inputFile));
}
/**
* 41. Proceed if the file given is (.txt,.html,.doc etc) 42.
*/
else {
File file = new File(inputFile);
pdfDocument.add(new Paragraph(org.apache.commons.io.FileUtils
.readFileToString(file)));
}
pdfDocument.close();
writer.close();
} catch (Exception exception) {
System.out.println("Document Exception!" + exception);
}
}
public static void main(String args[]) {
PDFConversion pdfConversion = new PDFConversion();
pdfConversion.createPdf("C:/Users/LENOVO/Downloads/The_JFileChooser_Component.doc",
"E:/The_JFileChooser_Component.pdf", false);
// For other files
// pdfConversion.createPdf("C:/shunmuga/sample.html",
// "C:/shunmuga/sample.pdf", false);
}
}
I'm not sure what it could be, but for some alternatives have a look at:
Apose.Words Library for Java it has some really cool features one of them being docx to pdf conversion by a few simple lines (and it's reliable):
Document doc = new Document("d:/test/mydoc.docx");
doc.Save("d:/test/Out.pdf", SaveFormat.Pdf);
Docx4j
which can be used to convert docx and many others to PDF, it does this by first using HTML/XML based on IText then converts it to a PDF (All libararies are included within docx4j, just added the itext link for completeness):
org.docx4j.convert.out.pdf.PdfConversion c
= new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(wordMLPackage);//using xml
// = new org.docx4j.convert.out.pdf.viaHTML.Conversion(wordMLPackage);//using html
// = new org.docx4j.convert.out.pdf.viaIText.Conversion(wordMLPackage);//using itext libs
If that's not enough it has sample source code for you to try.
xdocreport also comes with a lot of samples for conversion (haven't downloaded them, but it should have the doc/docx to PDF converter source)
Related
The title may be a little confusing. The simplest method must be judging by extension name just like:
// is represents the InputStream
if (filePath.endsWith("doc")) {
WordExtractor ex = new WordExtractor(is);
text = ex.getText();
ex.close();
} else if(filePath.endsWith("docx")) {
XWPFDocument doc = new XWPFDocument(is);
XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
text = extractor.getText();
extractor.close();
}
This works in most cases. But I have found that for certain file whose extension is doc (a docx file essentially) if you open using winrar, you will find xml files. As it is known that a docx file is a zip file consists of xml files.
I believe this problem must not be rare. But I have not found any information about this. Obviously, judging by extension name to read a doc or docx is not appropriate.
In my case, I have to read a lot of files. And I will even read the doc or docx inside a compressed file, zip, 7z or even rar. Hence, I have to read content by inputStream instead of a File or something else. So how to know whether a file is .docx or .doc format from Apache POI is totally not suitable for my case with ZipInputStream.
What is the best way to judge a file is a doc or docx? I want a solution to read the content from a file which may be doc or docx. But not only just simply judge if it is a doc or docx. Apparently, ZipInpuStream is not a good method for my case. And I believe it is not a appropriate method for others either. Why do I have to judge if the file is doc or docx by an exception?
Using the current stable apache poi version 3.17 you may use FileMagic. But internally this will of course also have a look into the files.
Example:
import java.io.InputStream;
import java.io.FileInputStream;
import java.io.BufferedInputStream;
import org.apache.poi.poifs.filesystem.FileMagic;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
public class ReadWord {
static String read(InputStream is) throws Exception {
System.out.println(FileMagic.valueOf(is));
String text = "";
if (FileMagic.valueOf(is) == FileMagic.OLE2) {
WordExtractor ex = new WordExtractor(is);
text = ex.getText();
ex.close();
} else if(FileMagic.valueOf(is) == FileMagic.OOXML) {
XWPFDocument doc = new XWPFDocument(is);
XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
text = extractor.getText();
extractor.close();
}
return text;
}
public static void main(String[] args) throws Exception {
InputStream is = new BufferedInputStream(new FileInputStream("ExampleOLE.doc")); //really a binary OLE2 Word file
System.out.println(read(is));
is.close();
is = new BufferedInputStream(new FileInputStream("ExampleOOXML.doc")); //a OOXML Word file named *.doc
System.out.println(read(is));
is.close();
is = new BufferedInputStream(new FileInputStream("ExampleOOXML.docx")); //really a OOXML Word file
System.out.println(read(is));
is.close();
}
}
try {
new ZipFile(new File("/Users/giang/Documents/a.doc"));
System.out.println("this file is .docx");
} catch (ZipException e) {
System.out.println("this file is not .docx");
e.printStackTrace();
}
I have a pdf file (some part of it given below), and want to extract text from it. I have used PDFTextStream, but it doesn't work with this file. (However it worked with other file, that has simple text).
What other OCR libraries are capable of doing it?
Please Help.
Thank you.
I tried with PDFBox and it produced satisfactory results.
Here is the code to extract text from PDF using PDFBox:
import java.io.*;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.util.*;
public class PDFTest {
public static void main(String[] args){
PDDocument pd;
BufferedWriter wr;
try {
File input = new File("C:/BillOCR/data/bill.pdf"); // The PDF file from where you would like to extract
File output = new File("D:/SampleText.txt"); // The text file where you are going to store the extracted data
pd = PDDocument.load(input);
System.out.println(pd.getNumberOfPages());
System.out.println(pd.isEncrypted());
pd.save("CopyOfBill.pdf"); // Creates a copy called "CopyOfInvoice.pdf"
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1); //Start extracting from page 3
stripper.setEndPage(1); //Extract till page 5
wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output)));
stripper.writeText(pd, wr);
if (pd != null) {
pd.close();
}
// I use close() to flush the stream.
wr.close();
} catch (Exception e){
e.printStackTrace();
}
}
}
I am using the PDF iText library to convert PDF to text.
Below is my code to convert PDF to text file using Java.
public class PdfConverter {
/** The original PDF that will be parsed. */
public static final String pdfFileName = "jdbc_tutorial.pdf";
/** The resulting text file. */
public static final String RESULT = "preface.txt";
/**
* Parses a PDF to a plain text file.
* #param pdf the original PDF
* #param txt the resulting text
* #throws IOException
*/
public void parsePdf(String pdf, String txt) throws IOException {
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
TextExtractionStrategy strategy;
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
out.println(strategy.getResultantText());
System.out.println(strategy.getResultantText());
}
out.flush();
out.close();
reader.close();
}
/**
* Main method.
* #param args no arguments needed
* #throws IOException
*/
public static void main(String[] args) throws IOException {
new PdfConverter().parsePdf(pdfFileName, RESULT);
}
}
The above code works for extracting PDF to text. But my requirement is to ignore header and footer and extract only content from PDF file.
Because your pdf has headers and footers, it would be marked as artifacts(if not its just a text or content placed at the position of a header or footer). If its marked as artifacts, you can extract it using ParseTaggedPdf. You can also make use of ExtractPageContentArea if ParseTaggedPdf doesn't work. You can check for a few examples related to it.
The above solution is general and depends on the file. If you really need an alternate solution, you can use apache API's like PdfBox, tika and others like PDFTextStream. The solution which i'm giving below wont work if you have to persist with iText and can't move on to other libraries. In PdfBox you can use PDFTextStripperByArea or PDFTextStripper. Look at the JavaDoc or some examples if you need to know how to use it.
Using IText I found one example in this site http://what-when-how.com/itext-5/parsing-pdfs-part-2-itext-5/
In this you create a rectangle that defines the bounds of the text you are getting.
PdfReader reader = new PdfReader(pdf);
PrintWriter out= new PrintWriter(new FileOutputStream(txt));
//Creating the rectangle
Rectangle rect=new Rectangle(70,80,420,500);
//creating a filter based on the rectangle
RenderFilter filter = new RegionTextRenderFilter(rect);
TextExtractionStrategy strategy;
for(int i=1;i<=reader.getNumberOfPages();i+){
//setting the filter on the text extraction strategy
strategy= new FilteredTextRenderListener(
new LocationTextExtractionStrategy(),filter);
out.println(PdfTextExtractor.getTextFromPage(reader,i,strategy));
}
out.flush();out.close();
as the web page describes this, It should work even if the pdf is not tagged.
You can read specific locations of a pdf file. Just mark those areas that you need to get text from and leave the areas where the header and footer are shown. I have done it and here is the complete code. itext reading specific location from pdf file runs in intellij and gives desired output but executable jar throws error
I have a html with large number of columns(you can find the sample at this link)
Now When I try to convert it to PDF using flying saucer(jar link recompiled to work with iText 2.1.X), the generated PDF has truncated Columns
Is there some way to make Flying saucer to either break the table or to increase the width of the page according to the html content?
This is the code that I am using
String doc = file.toURI().toURL().toString();
ITextRenderer renderer = new ITextRenderer();
renderer.setDocument(doc);
String outputFile = "test.pdf";
OutputStream os = new FileOutputStream(outputFile);
renderer.layout();
renderer.createPDF(os);
os.flush();
os.close();
Where file is the html which I am trying to convert.
Use YAHP library.This is the best library I have worked so far to convert HTML to PDF. This is written on the top of flying saucer which is a big disappointment as compared to it's popularity.It won't even render simple input text boxes.So, I turned to YAHP library which is excellent for your case.
try this code after you get all the jars related to this library.
import java.io.File;
import java.io.FileOutputStream;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.allcolor.yahp.converter.CYaHPConverter;
import org.allcolor.yahp.converter.IHtmlToPdfTransformer;
import org.htmlcleaner.CleanerProperties;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.PrettyHtmlSerializer;
import org.htmlcleaner.TagNode;
public class YahpHtmlToPdf {
#SuppressWarnings({ "unchecked", "rawtypes" })
public static void main(String[] args) {
try{
CleanerProperties props = new CleanerProperties();
props.setTranslateSpecialEntities(true);
props.setTransResCharsToNCR(true);
props.setOmitComments(true);
TagNode tagNode = new HtmlCleaner(props).clean(new File("C:\\Users\\MyComputer\\Desktop\\aspose.html"));
String newString=new PrettyHtmlSerializer(props).getAsString(tagNode, "ISO-8859-1");
CYaHPConverter converter = new CYaHPConverter();
File fout = new File("C:\\sample\\aspose.pdf");
FileOutputStream out = new FileOutputStream(fout);
Map properties = new HashMap();
List headerFooterList = new ArrayList();
properties.put(IHtmlToPdfTransformer.PDF_RENDERER_CLASS,IHtmlToPdfTransformer.FLYINGSAUCER_PDF_RENDERER);
converter.convertToPdf(newString,IHtmlToPdfTransformer.A1P,headerFooterList, "file:///temp/",out,properties);
out.flush();
out.close();
}catch(Exception e){
e.printStackTrace();
}
}
}
This is screenshot of pdf generated with your html. .You can specify page size like this IHtmlToPdfTransformer.A1P.
Its possible to change the pagesize if you know the size you need.
Use the #page rule for this:
Please note: in this example im working on html with Jsoup (see comment).
/*
* This part is optional - Jsoup is used for cleaning the html and inserting the style tag into the head.
* You can use everything else for doing this.
*
* If you will use Jsoup, make shure you set proper charset (2nd parameter).
*
* Note: this is NOT a W3C Document but a Jsoup one.
*/
Document doc = Jsoup.parse(file, null);
/*
* Here you specify the pagesize you need (size: with height).
* Inserting this html is the key part!
*/
doc.head().append("<style type=\"text/css\"><!--#page { size:50.0cm 20.0cm; }--></style>");
ITextRenderer renderer = new ITextRenderer();
/*
* This part ist jsoup related. 'doc.toString()' does nothing else than
* returning the Html of 'doc' as a string.
*
* You can set it like in your code too.
*/
renderer.setDocumentFromString(doc.toString());
final String outputFile = "test.pdf";
OutputStream os = new FileOutputStream(outputFile);
renderer.layout();
renderer.createPDF(os);
os.flush();
os.close();
With this code you'll get a Pdf where the whole table is on a landscape page (maybe you have to change width / height for your needs.
The below is the code that i used to convert a word document to pdf. After compiling the code, the PDF file is generated. But the file contains some junk characters along with the word document content. Please help me to know what modification should i do to get rid of the junk characters.
The code i used is:
import com.lowagie.text.Document;
import com.lowagie.text.Paragraph;
import com.lowagie.text.pdf.PdfWriter;
import java.io.File;
import java.io.FileOutputStream;
public class PdfConverter
{
private void createPdf(String inputFile, String outputFile)//, boolean isPictureFile)
{
Document pdfDocument = new Document();
String pdfFilePath = outputFile;
try
{
FileOutputStream fileOutputStream = new FileOutputStream(pdfFilePath);
PdfWriter writer = null;
writer = PdfWriter.getInstance(pdfDocument, fileOutputStream);
writer.open();
pdfDocument.open();
/*if (isPictureFile)
{
pdfDocument.add(com.lowagie.text.Image.getInstance(inputFile));
}
else
{ */
File file = new File(inputFile);
pdfDocument.add(new Paragraph(org.apache.commons.io.FileUtils.readFileToString(file)));
//}
pdfDocument.close();
writer.close();
System.out.println("PDF has been generted");
}
catch (Exception exception)
{
System.out.println("Document Exception!" + exception);
}
}
public static void main(String args[])
{
PdfConverter pdfConversion = new PdfConverter();
pdfConversion.createPdf("C:/test.doc", "C:/test.pdf");//, true);
}
}
Thanks for you help.
Only because you name your class PdfConverter you don't have one. All you do is reading the binary content as a String and writing this as one paragraph (and that's what you see). This approach will definitively not be successful. See https://stackoverflow.com/questions/437394 for a similar question.
If you are interested just in the content of your word document, you might want to give Apache POI - the Java API for Microsoft Documents a try to read your the document not at binary level but on a hight abstraction level. If your Word document has a simple (and I mean a really simple) structure you might get reasonable results.
To do this, you will have to read the doc file correctly and then use the read data to create the PDF file.
What you are doing right now is that you are reading data from doc file, which is having garbage values since you are not using proper API to read the data, and then storing the obtained garbage data in the PDF file. Hence the issue.