I have a html with large number of columns(you can find the sample at this link)
Now When I try to convert it to PDF using flying saucer(jar link recompiled to work with iText 2.1.X), the generated PDF has truncated Columns
Is there some way to make Flying saucer to either break the table or to increase the width of the page according to the html content?
This is the code that I am using
String doc = file.toURI().toURL().toString();
ITextRenderer renderer = new ITextRenderer();
renderer.setDocument(doc);
String outputFile = "test.pdf";
OutputStream os = new FileOutputStream(outputFile);
renderer.layout();
renderer.createPDF(os);
os.flush();
os.close();
Where file is the html which I am trying to convert.
Use YAHP library.This is the best library I have worked so far to convert HTML to PDF. This is written on the top of flying saucer which is a big disappointment as compared to it's popularity.It won't even render simple input text boxes.So, I turned to YAHP library which is excellent for your case.
try this code after you get all the jars related to this library.
import java.io.File;
import java.io.FileOutputStream;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.allcolor.yahp.converter.CYaHPConverter;
import org.allcolor.yahp.converter.IHtmlToPdfTransformer;
import org.htmlcleaner.CleanerProperties;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.PrettyHtmlSerializer;
import org.htmlcleaner.TagNode;
public class YahpHtmlToPdf {
#SuppressWarnings({ "unchecked", "rawtypes" })
public static void main(String[] args) {
try{
CleanerProperties props = new CleanerProperties();
props.setTranslateSpecialEntities(true);
props.setTransResCharsToNCR(true);
props.setOmitComments(true);
TagNode tagNode = new HtmlCleaner(props).clean(new File("C:\\Users\\MyComputer\\Desktop\\aspose.html"));
String newString=new PrettyHtmlSerializer(props).getAsString(tagNode, "ISO-8859-1");
CYaHPConverter converter = new CYaHPConverter();
File fout = new File("C:\\sample\\aspose.pdf");
FileOutputStream out = new FileOutputStream(fout);
Map properties = new HashMap();
List headerFooterList = new ArrayList();
properties.put(IHtmlToPdfTransformer.PDF_RENDERER_CLASS,IHtmlToPdfTransformer.FLYINGSAUCER_PDF_RENDERER);
converter.convertToPdf(newString,IHtmlToPdfTransformer.A1P,headerFooterList, "file:///temp/",out,properties);
out.flush();
out.close();
}catch(Exception e){
e.printStackTrace();
}
}
}
This is screenshot of pdf generated with your html. .You can specify page size like this IHtmlToPdfTransformer.A1P.
Its possible to change the pagesize if you know the size you need.
Use the #page rule for this:
Please note: in this example im working on html with Jsoup (see comment).
/*
* This part is optional - Jsoup is used for cleaning the html and inserting the style tag into the head.
* You can use everything else for doing this.
*
* If you will use Jsoup, make shure you set proper charset (2nd parameter).
*
* Note: this is NOT a W3C Document but a Jsoup one.
*/
Document doc = Jsoup.parse(file, null);
/*
* Here you specify the pagesize you need (size: with height).
* Inserting this html is the key part!
*/
doc.head().append("<style type=\"text/css\"><!--#page { size:50.0cm 20.0cm; }--></style>");
ITextRenderer renderer = new ITextRenderer();
/*
* This part ist jsoup related. 'doc.toString()' does nothing else than
* returning the Html of 'doc' as a string.
*
* You can set it like in your code too.
*/
renderer.setDocumentFromString(doc.toString());
final String outputFile = "test.pdf";
OutputStream os = new FileOutputStream(outputFile);
renderer.layout();
renderer.createPDF(os);
os.flush();
os.close();
With this code you'll get a Pdf where the whole table is on a landscape page (maybe you have to change width / height for your needs.
Related
We are building a java code to read word document (.docx) into our program using apache POI.
We are stuck when we encounter formulas and chemical equation inside the document.
Yet, we managed to read formulas but we have no idea how to locate its index in concerned string..
INPUT (format is *.docx)
text before formulae **CHEMICAL EQUATION** text after
OUTPUT (format shall be HTML) we designed
text before formulae text after **CHEMICAL EQUATION**
We are unable to fetch the string and reconstruct to its original form.
Question
Now is there any way to locate the position of the image and formulae within the stripped line, so that it can be restored to its original form after reconstruction of the string, as against having it appended at the end of string.?
If the needed format is HTML, then Word text content together with Office MathML equations can be read the following way.
In Reading equations & formula from Word (Docx) to html and save database using java I have provided an example which gets all Office MathML equations out of an Word document into HTML. It uses paragraph.getCTP().getOMathList() and paragraph.getCTP().getOMathParaList() to get the OMath elements from the paragraph. This takes the OMath elements out of the text context.
If one wants get those OMath elements in context with the other elements in the paragraphs, then using a org.apache.xmlbeans.XmlCursor is needed to loop over all different XML elements in the paragraph. The following example uses the XmlCursor to get text runs together with OMath elements from the paragraph.
The transformation from Office MathML into MathML is taken using the same XSLT approach as in Reading equations & formula from Word (Docx) to html and save database using java. There also is described where the OMML2MML.XSL comes from.
The file Formula.docx looks like:
Code:
import java.io.*;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMath;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMathPara;
import org.apache.xmlbeans.XmlCursor;
import org.w3c.dom.Node;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.stream.StreamResult;
import java.awt.Desktop;
import java.util.List;
import java.util.ArrayList;
/*
needs the full ooxml-schemas-1.4.jar as mentioned in https://poi.apache.org/faq.html#faq-N10025
*/
public class WordReadTextWithFormulasAsHTML {
static File stylesheet = new File("OMML2MML.XSL");
static TransformerFactory tFactory = TransformerFactory.newInstance();
static StreamSource stylesource = new StreamSource(stylesheet);
//method for getting MathML from oMath
static String getMathML(CTOMath ctomath) throws Exception {
Transformer transformer = tFactory.newTransformer(stylesource);
Node node = ctomath.getDomNode();
DOMSource source = new DOMSource(node);
StringWriter stringwriter = new StringWriter();
StreamResult result = new StreamResult(stringwriter);
transformer.setOutputProperty("omit-xml-declaration", "yes");
transformer.transform(source, result);
String mathML = stringwriter.toString();
stringwriter.close();
//The native OMML2MML.XSL transforms OMML into MathML as XML having special name spaces.
//We don't need this since we want using the MathML in HTML, not in XML.
//So ideally we should changing the OMML2MML.XSL to not do so.
//But to take this example as simple as possible, we are using replace to get rid of the XML specialities.
mathML = mathML.replaceAll("xmlns:m=\"http://schemas.openxmlformats.org/officeDocument/2006/math\"", "");
mathML = mathML.replaceAll("xmlns:mml", "xmlns");
mathML = mathML.replaceAll("mml:", "");
return mathML;
}
//method for getting HTML including MathML from XWPFParagraph
static String getTextAndFormulas(XWPFParagraph paragraph) throws Exception {
StringBuffer textWithFormulas = new StringBuffer();
//using a cursor to go through the paragraph from top to down
XmlCursor xmlcursor = paragraph.getCTP().newCursor();
while (xmlcursor.hasNextToken()) {
XmlCursor.TokenType tokentype = xmlcursor.toNextToken();
if (tokentype.isStart()) {
if (xmlcursor.getName().getPrefix().equalsIgnoreCase("w") && xmlcursor.getName().getLocalPart().equalsIgnoreCase("r")) {
//elements w:r are text runs within the paragraph
//simply append the text data
textWithFormulas.append(xmlcursor.getTextValue());
} else if (xmlcursor.getName().getLocalPart().equalsIgnoreCase("oMath")) {
//we have oMath
//append the oMath as MathML
textWithFormulas.append(getMathML((CTOMath)xmlcursor.getObject()));
}
} else if (tokentype.isEnd()) {
//we have to check whether we are at the end of the paragraph
xmlcursor.push();
xmlcursor.toParent();
if (xmlcursor.getName().getLocalPart().equalsIgnoreCase("p")) {
break;
}
xmlcursor.pop();
}
}
return textWithFormulas.toString();
}
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("Formula.docx"));
//using a StringBuffer for appending all the content as HTML
StringBuffer allHTML = new StringBuffer();
//loop over all IBodyElements - should be self explained
for (IBodyElement ibodyelement : document.getBodyElements()) {
if (ibodyelement.getElementType().equals(BodyElementType.PARAGRAPH)) {
XWPFParagraph paragraph = (XWPFParagraph)ibodyelement;
allHTML.append("<p>");
allHTML.append(getTextAndFormulas(paragraph));
allHTML.append("</p>");
} else if (ibodyelement.getElementType().equals(BodyElementType.TABLE)) {
XWPFTable table = (XWPFTable)ibodyelement;
allHTML.append("<table border=1>");
for (XWPFTableRow row : table.getRows()) {
allHTML.append("<tr>");
for (XWPFTableCell cell : row.getTableCells()) {
allHTML.append("<td>");
for (XWPFParagraph paragraph : cell.getParagraphs()) {
allHTML.append("<p>");
allHTML.append(getTextAndFormulas(paragraph));
allHTML.append("</p>");
}
allHTML.append("</td>");
}
allHTML.append("</tr>");
}
allHTML.append("</table>");
}
}
document.close();
//creating a sample HTML file
String encoding = "UTF-8";
FileOutputStream fos = new FileOutputStream("result.html");
OutputStreamWriter writer = new OutputStreamWriter(fos, encoding);
writer.write("<!DOCTYPE html>\n");
writer.write("<html lang=\"en\">");
writer.write("<head>");
writer.write("<meta charset=\"utf-8\"/>");
//using MathJax for helping all browsers to interpret MathML
writer.write("<script type=\"text/javascript\"");
writer.write(" async src=\"https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=MML_CHTML\"");
writer.write(">");
writer.write("</script>");
writer.write("</head>");
writer.write("<body>");
writer.write(allHTML.toString());
writer.write("</body>");
writer.write("</html>");
writer.close();
Desktop.getDesktop().browse(new File("result.html").toURI());
}
}
Result:
Just tested this code using apache poi 5.0.0 and it works. You need poi-ooxml-full-5.0.0.jar for apache poi 5.0.0. Please read https://poi.apache.org/help/faq.html#faq-N10025 for what ooxml libraries are needed for what apache poi version.
XWPFParagraph paragraph;
for (CTOMath ctomath : paragraph.getCTP().getOMathList()) {
formulas=formulas + getMathML(ctomath);
}
With the above code it is able to extract the math formula from the given paragraph of a docx file.
Also for the purpose displaying the formula in a html page I m converting it to mathml code and rendering it with MathJax on the page. This I m able to do.
But the problem is, Is it possible to get the position of the formula in the given paragraph. So that I can display the formula in the exact location in the paragraph while rendering it as a html page.
I have a pdf file (some part of it given below), and want to extract text from it. I have used PDFTextStream, but it doesn't work with this file. (However it worked with other file, that has simple text).
What other OCR libraries are capable of doing it?
Please Help.
Thank you.
I tried with PDFBox and it produced satisfactory results.
Here is the code to extract text from PDF using PDFBox:
import java.io.*;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.util.*;
public class PDFTest {
public static void main(String[] args){
PDDocument pd;
BufferedWriter wr;
try {
File input = new File("C:/BillOCR/data/bill.pdf"); // The PDF file from where you would like to extract
File output = new File("D:/SampleText.txt"); // The text file where you are going to store the extracted data
pd = PDDocument.load(input);
System.out.println(pd.getNumberOfPages());
System.out.println(pd.isEncrypted());
pd.save("CopyOfBill.pdf"); // Creates a copy called "CopyOfInvoice.pdf"
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1); //Start extracting from page 3
stripper.setEndPage(1); //Extract till page 5
wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output)));
stripper.writeText(pd, wr);
if (pd != null) {
pd.close();
}
// I use close() to flush the stream.
wr.close();
} catch (Exception e){
e.printStackTrace();
}
}
}
I've successfully converted JPEG to Pdf using Java, but don't know how to convert Pdf to Word using Java, the code for converting JPEG to Pdf is given below.
Can anyone tell me how to convert Pdf to Word (.doc/ .docx) using Java?
import java.io.FileOutputStream;
import com.itextpdf.text.Image;
import com.itextpdf.text.pdf.PdfWriter;
import com.itextpdf.text.Document;
public class JpegToPDF {
public static void main(String[] args) {
try {
Document convertJpgToPdf = new Document();
PdfWriter.getInstance(convertJpgToPdf, new FileOutputStream(
"c:\\java\\ConvertImagetoPDF.pdf"));
convertJpgToPdf.open();
Image convertJpg = Image.getInstance("c:\\java\\test.jpg");
convertJpgToPdf.add(convertJpg);
convertJpgToPdf.close();
System.out.println("Successfully Converted JPG to PDF in iText");
} catch (Exception i1) {
i1.printStackTrace();
}
}
}
In fact, you need two libraries. Both libraries are open source. The first one is iText, it is used to extract the text from a PDF file. The second one is POI, it is ued to create the word document.
The code is quite simple:
//Create the word document
XWPFDocument doc = new XWPFDocument();
// Open the pdf file
String pdf = "myfile.pdf";
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
// Read the PDF page by page
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
TextExtractionStrategy strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
// Extract the text
String text=strategy.getResultantText();
// Create a new paragraph in the word document, adding the extracted text
XWPFParagraph p = doc.createParagraph();
XWPFRun run = p.createRun();
run.setText(text);
// Adding a page break
run.addBreak(BreakType.PAGE);
}
// Write the word document
FileOutputStream out = new FileOutputStream("myfile.docx");
doc.write(out);
// Close all open files
out.close();
reader.close();
Beware: With the used extraction strategy, you will lose all formatting. But you can fix this, by inserting your own, more complex extraction strategy.
You can use 7-pdf library
have a look at this it may help :
http://www.7-pdf.de/sites/default/files/guide/manuals/library/index.html
PS: itext has some issues when given file is non RGB image, try this out!!
Although it's far from being a pure Java solution OpenOffice/LibreOfffice allows one to connect to it through a TCP port; it's possible to use that to convert documents. If this looks like an acceptable solution, JODConverter can help you.
The below is the code that i used to convert a word document to pdf. After compiling the code, the PDF file is generated. But the file contains some junk characters along with the word document content. Please help me to know what modification should i do to get rid of the junk characters.
The code i used is:
import com.lowagie.text.Document;
import com.lowagie.text.Paragraph;
import com.lowagie.text.pdf.PdfWriter;
import java.io.File;
import java.io.FileOutputStream;
public class PdfConverter
{
private void createPdf(String inputFile, String outputFile)//, boolean isPictureFile)
{
Document pdfDocument = new Document();
String pdfFilePath = outputFile;
try
{
FileOutputStream fileOutputStream = new FileOutputStream(pdfFilePath);
PdfWriter writer = null;
writer = PdfWriter.getInstance(pdfDocument, fileOutputStream);
writer.open();
pdfDocument.open();
/*if (isPictureFile)
{
pdfDocument.add(com.lowagie.text.Image.getInstance(inputFile));
}
else
{ */
File file = new File(inputFile);
pdfDocument.add(new Paragraph(org.apache.commons.io.FileUtils.readFileToString(file)));
//}
pdfDocument.close();
writer.close();
System.out.println("PDF has been generted");
}
catch (Exception exception)
{
System.out.println("Document Exception!" + exception);
}
}
public static void main(String args[])
{
PdfConverter pdfConversion = new PdfConverter();
pdfConversion.createPdf("C:/test.doc", "C:/test.pdf");//, true);
}
}
Thanks for you help.
Only because you name your class PdfConverter you don't have one. All you do is reading the binary content as a String and writing this as one paragraph (and that's what you see). This approach will definitively not be successful. See https://stackoverflow.com/questions/437394 for a similar question.
If you are interested just in the content of your word document, you might want to give Apache POI - the Java API for Microsoft Documents a try to read your the document not at binary level but on a hight abstraction level. If your Word document has a simple (and I mean a really simple) structure you might get reasonable results.
To do this, you will have to read the doc file correctly and then use the read data to create the PDF file.
What you are doing right now is that you are reading data from doc file, which is having garbage values since you are not using proper API to read the data, and then storing the obtained garbage data in the PDF file. Hence the issue.
I converted .docx file to .pdf file, the text is converting fine, but the images in the .docx file is not appearing, instead it is represented as some special characters, below is my code:
import com.lowagie.text.Document;
import com.lowagie.text.Paragraph;
import com.lowagie.text.pdf.PdfWriter;
import java.io.File;
import java.io.FileOutputStream;
public class PDFConversion {
/**
* 14. This method is used to convert the given file to a PDF format 15.
*
* #param inputFile
* - Name and the path of the file 16.
* #param outputFile
* - Name and the path where the PDF file to be saved 17.
* #param isPictureFile
* 18.
*/
private void createPdf(String inputFile, String outputFile, boolean isPictureFile) {
/**
* 22. Create a new instance for Document class 23.
*/
Document pdfDocument = new Document();
String pdfFilePath = outputFile;
try {
FileOutputStream fileOutputStream = new FileOutputStream(pdfFilePath);
PdfWriter writer = null;
writer = PdfWriter.getInstance(pdfDocument, fileOutputStream);
writer.open();
pdfDocument.open();
/**
* 34. Proceed if the file given is a picture file 35.
*/
if (isPictureFile) {
pdfDocument.add(com.lowagie.text.Image.getInstance(inputFile));
}
/**
* 41. Proceed if the file given is (.txt,.html,.doc etc) 42.
*/
else {
File file = new File(inputFile);
pdfDocument.add(new Paragraph(org.apache.commons.io.FileUtils
.readFileToString(file)));
}
pdfDocument.close();
writer.close();
} catch (Exception exception) {
System.out.println("Document Exception!" + exception);
}
}
public static void main(String args[]) {
PDFConversion pdfConversion = new PDFConversion();
pdfConversion.createPdf("C:/Users/LENOVO/Downloads/The_JFileChooser_Component.doc",
"E:/The_JFileChooser_Component.pdf", false);
// For other files
// pdfConversion.createPdf("C:/shunmuga/sample.html",
// "C:/shunmuga/sample.pdf", false);
}
}
I'm not sure what it could be, but for some alternatives have a look at:
Apose.Words Library for Java it has some really cool features one of them being docx to pdf conversion by a few simple lines (and it's reliable):
Document doc = new Document("d:/test/mydoc.docx");
doc.Save("d:/test/Out.pdf", SaveFormat.Pdf);
Docx4j
which can be used to convert docx and many others to PDF, it does this by first using HTML/XML based on IText then converts it to a PDF (All libararies are included within docx4j, just added the itext link for completeness):
org.docx4j.convert.out.pdf.PdfConversion c
= new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(wordMLPackage);//using xml
// = new org.docx4j.convert.out.pdf.viaHTML.Conversion(wordMLPackage);//using html
// = new org.docx4j.convert.out.pdf.viaIText.Conversion(wordMLPackage);//using itext libs
If that's not enough it has sample source code for you to try.
xdocreport also comes with a lot of samples for conversion (haven't downloaded them, but it should have the doc/docx to PDF converter source)