Converting PDF document containing graphs and tables to Word Document

Converting PDF document containing graphs and tables to Word Document - java

I am trying to convert a PDF document to a Word file using Java. On Internet, I found a code snippet which converts PDF document to Word. but the alignments in the resulting Word document is clumsy. Images tables and graphs are not in sync. Everything is displaying as string paragraph/words.
The code, I have written is given below.
XWPFDocument doc = new XWPFDocument();
String pdf = "D:\\xyz.pdf";
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
TextExtractionStrategy strategy = (TextExtractionStrategy)
parser.processContent(i,new SimpleTextExtractionStrategy());
String text = strategy.getResultantText();
XWPFParagraph p = doc.createParagraph();
XWPFRun run = p.createRun();
run.setText(text);
run.addBreak(BreakType.PAGE);
Please anyone help.....

Related

Itext pdf - text alignment to right

I am using Itext PDF API to generate a pdf. I am trying to get some text to be aligned to the right-hand side of the pdf. I have tried the manual method of spacing but is not working for some reason (Code shown below). Meanwhile, if there is a way of doing it dynamically that would be great, please!
String dest = "\\location\\";
PdfWriter writer;
writer = new PdfWriter(dest);
// Creating a PdfDcoument
PdfDocument pdf = new PdfDocument(writer);
// Creating a Document
Document document = new Document(pdf);
// Creating a String
String para1 = "TEXT";
//Spacing length
while (para1.length() < 50) {
para1 = " " + para1;
}
//Creating Paragraphs
Paragraph paragraph1 = new Paragraph(para1);
//paragraph1.setAlignment(Element.ALIGN_CENTER);
//Adding Paragraphs to document
document.add(paragraph1);
// Closing the document
document.close();
Thanks in advance!

Class com.itextpdf.layout.element.Paragraph in itext7 has method setTextAlignment. I hope this is what you are looking for:
...
paragraph1.setTextAlignment(TextAlignment.RIGHT);
...

I'm using com.itextpdf:itextpdf:5.5.10 and it looks like the stuff has moved around a bit.
paragraph1.setAlignment(com.itextpdf.text.Element.ALIGN_RIGHT)

Convert DOC [HWPFDocument] to pdf [with font, Table and images] using java

converting doc file to pdf
I am using the following code :
POIFSFileSystem fs = null;
Document Pdfdocument = new Document();
fs = new POIFSFileSystem(new FileInputStream(srcFile));
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
PdfWriter writer = PdfWriter.getInstance(Pdfdocument, new
FileOutputStream(targetFile));
Pdfdocument.open();
writer.setPageEmpty(true);
Pdfdocument.newPage();
writer.setPageEmpty(true);
String[] paragraphs = we.getParagraphText();
for (int i = 0; i < paragraphs.length; i++) {
Pdfdocument.add(new Paragraph(paragraphs[i]));
}
This generates a pdf without formatting and images
even fonts will be missing.
Since WordExtractor uses only text
is there any other way to convert with fonts and images.
Convertion form doc(HWPFDocument) but not on docx
I have referred these SO links
Convert doc to pdf using Apache POI
https://stackoverflow.com/a/6210694/6032482
how to convert doc,docx files to pdf in java programatically
and many more but found
they all use WordExtractor .
Note:
I can't use library office
nor
Aspose
Can it be done using:
ApachePOI
DOCX4j
itext

Unable to read unicode character in pdf using java

I am trying to convert Pdf document that contains Tamil unicode characters into a word document retaining all the formatting. I am not able to read the unicode character in the Pdf they are appearing as junk character in word. I am using the below code can someone please help?
public static void main(String[] args) throws IOException {
System.out.println("Document converted started");
XWPFDocument doc = new XWPFDocument();
String pdf = "D:\\sample1.pdf";
PdfReader reader = new PdfReader(pdf);
// InputStreamReader isr = new InputStreamReader(reader,"UTF8");
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
TextExtractionStrategy strategy = parser.processContent(i,
new SimpleTextExtractionStrategy());
System.out.println(strategy.getResultantText());
String text = strategy.getResultantText();
XWPFParagraph p = doc.createParagraph();
XWPFRun run = p.createRun();
// run.setFontFamily(new Font("Arial"));
run.setFontSize(14);
run.setText(text);
// run.addBreak(BreakType.PAGE);
}
FileOutputStream out = new FileOutputStream("D:\\tamildoc.docx");
doc.write(out);
out.close();
reader.close();
System.out.println("Document converted successfully");
}

You can use the library Apache PDFBox https://pdfbox.apache.org/download.cgi . With the component PDFTextStripper, invoking method getText(PDDocument doc) you will obtain a simple String that represents the content of .pdf file
Here an example :
UploadedFile file = new UploadedFile(fileName);
InputStream is = file.getInputStream();
PDDocument doc = PDDocument.load(is);
String content = new PDFTextStripper().getText(doc);
doc.close();
And after that you can write on your file

How to make a multiple pages docx?

InputStream is = Thread.currentThread().getContextClassLoader().getResourceAsStream(TEMPLATE);
XWPFDocument document = new XWPFDocument(is);
List<IBodyElement> elements = document.getBodyElements();
for (int i = 0; i < elements.size(); i++) {
document.removeBodyElement(i);
}
CTBody body = document.getDocument().getBody();
CTSectPr docSp = body.getSectPr();
CTPageSz pageSize = docSp.getPgSz();
CTPageMar margin = docSp.getPgMar();
BigInteger pageWidth = pageSize.getW();
pageWidth = pageWidth.add(BigInteger.ONE);
BigInteger totalMargins = margin.getLeft().add(margin.getRight());
BigInteger contentWidth = pageWidth.subtract(totalMargins);
...
XWPFTable table = document.createTable(totalRows, totalColumns);
Starting from a template I create a XWPFDocument and add a table to. How would could I add multiple tables each on a page? That is, perhaps, how do I insert a page break ?

I am just a beginner using POI to generate .docx files, but I have so far figured out how to insert a page break. When you have created an XWPFParagraph, you can insert a page break like this:
XWPFDocument document = new XWPFDocument(is);
...
XWPFParagraph paragraph = document.createParagraph();
XWPFRun run = paragraph.createRun();
run.addBreak(BreakType.PAGE);
Hope this helps.

Another way is you can set the page break using XWPFParagraph:
XWPFDocument document = new XWPFDocument(is);
...
XWPFParagraph paragraph = document.createParagraph();
paragraph.setPageBreak(true);

Apache POI HWPF - problem in convert doc file to pdf

I am currently working Java project with use of apache poi.
Now in my project I want to convert doc file to pdf file. The conversion done successfully but I only get text in pdf not any text style or text colour.
My pdf file looks like a black & white. While my doc file is coloured and have different style of text.
This is my code,
POIFSFileSystem fs = null;
Document document = new Document();
try {
System.out.println("Starting the test");
fs = new POIFSFileSystem(new FileInputStream("/document/test2.doc"));
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
OutputStream file = new FileOutputStream(new File("/document/test.pdf"));
PdfWriter writer = PdfWriter.getInstance(document, file);
Range range = doc.getRange();
document.open();
writer.setPageEmpty(true);
document.newPage();
writer.setPageEmpty(true);
String[] paragraphs = we.getParagraphText();
for (int i = 0; i < paragraphs.length; i++) {
org.apache.poi.hwpf.usermodel.Paragraph pr = range.getParagraph(i);
// CharacterRun run = pr.getCharacterRun(i);
// run.setBold(true);
// run.setCapitalized(true);
// run.setItalic(true);
paragraphs[i] = paragraphs[i].replaceAll("\\cM?\r?\n", "");
System.out.println("Length:" + paragraphs[i].length());
System.out.println("Paragraph" + i + ": " + paragraphs[i].toString());
// add the paragraph to the document
document.add(new Paragraph(paragraphs[i]));
}
System.out.println("Document testing completed");
} catch (Exception e) {
System.out.println("Exception during test");
e.printStackTrace();
} finally {
// close the document
document.close();
}
}
please help me.
Thnx in advance.

If you look at Apache Tika, there's a good example of reading some style information from a HWPF document. The code in Tika generates HTML based on the HWPF contents, but you should find that something very similar works for your case.
The Tika class is
https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
One thing to note about word documents is that everything in any one Character Run has the same formatting applied to it. A Paragraph is therefore made up of one or more Character Runs. Some styling is applied to a Paragraph, and other parts are done on the runs. Depending on what formatting interests you, it may therefore be on the paragraph or the run.

If you use WordExtractor, you will get text only. Try using CharacterRun class. You will get style along with text. Please refer following Sample code.
Range range = doc.getRange();
for (int i = 0; i < range.numParagraphs(); i++) {
org.apache.poi.hwpf.usermodel.Paragraph poiPara = range.getParagraph(i);
int j = 0;
while (true) {
CharacterRun run = poiPara.getCharacterRun(j++);
System.out.println("Color "+run.getColor());
System.out.println("Font size "+run.getFontSize());
System.out.println("Font Name "+run.getFontName());
System.out.println(run.isBold()+" "+run.isItalic()+" "+run.getUnderlineCode());
System.out.println("Text is "+run.text());
if (run.getEndOffset() == poiPara.getEndOffset()) {
break;
}
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Converting PDF document containing graphs and tables to Word Document - java

Related

Itext pdf - text alignment to right

Convert DOC [HWPFDocument] to pdf [with font, Table and images] using java

Unable to read unicode character in pdf using java

How to make a multiple pages docx?

Apache POI HWPF - problem in convert doc file to pdf

Categories

Resources