I'm developing an Arabic OCR application in java which extracts Arabic texts in images and then saving the text into a Microsoft Word file, for this purpose i use Apache-POI library.
My problem is that when i extract some text the order of the words are fine but when i save it in a Word file the order of the words are kinda messed up and looks mirrored
for example:
BUT after saving it as a Word:
and here is the code for saving the Word file:
public class SavingStringAsWordDoc {
File f=theGUI.toBeSavedWord;
public void saveAsWorddd (){
String st=TesseractPerformer.toBeShown;
try(FileOutputStream fout=new FileOutputStream(f);XWPFDocument docfile=new XWPFDocument()){
XWPFParagraph paraTit=docfile.createParagraph();
paraTit.setAlignment(ParagraphAlignment.LEFT);
XWPFRun paraTitRun=paraTit.createRun();
paraTitRun.setBold(true);
paraTitRun.setFontSize(15);
paraTit.setAlignment(ParagraphAlignment.RIGHT);
docfile.createParagraph().createRun().setText(st); //content to be written
docfile.write(fout); //adding to output stream
} catch(IOException e){
System.out.println("IO ERROR:"+e);
}
}
i noticed one thing which might help understanding the problem:
if i copy the messed up text in the word file and then paste it by choosing the (Keep Text Only) paste option it fixes the order of the paragraph
This needs bidirectional text direction support (bidi) and is not yet implemented in XWPF of apache poi per default. But the underlying object org.openxmlformats.schemas.wordprocessingml.x2006.main.CTPPr supports this. So we must get this underlying object from the XWPFParagraph and then set Bidi on.
Example:
import java.io.File;
import java.io.FileOutputStream;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTPPr;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.STOnOff;
public class CreateWord {
public static void main(String[] args) throws Exception {
String content = Files.readString(new File("ArabicTextFile.txt").toPath(), StandardCharsets.UTF_16);
XWPFDocument document = new XWPFDocument();
XWPFParagraph paragraph = document.createParagraph();
// set bidirectional text support on
CTP ctp = paragraph.getCTP();
CTPPr ctppr = ctp.getPPr();
if (ctppr == null) ctppr = ctp.addNewPPr();
ctppr.addNewBidi().setVal(STOnOff.ON);
XWPFRun run=paragraph.createRun();
run.setBold(true);
run.setFontSize(22);
run.setText(content);
FileOutputStream out = new FileOutputStream("CreateWord.docx");
document.write(out);
out.close();
document.close();
}
}
My ArabicTextFile.txt contains the text
هذا هو النص باللغة العربية لاختبار النص باللغة العربية
in UTF-16 encoding (Unicode).
Result in Word:
Related
i've been trying to copy Hebrew data from excel files into a document.
while the letters themselves were copied correctly, it got a betty messy whenever some symbols were involved.
for example: instead of (text), i got )text(
this is my code so far:
XWPFParagraph newPara = document.insertNewParagraph(cursor);
newPara.setAlignment (ParagraphAlignment.RIGHT);
CTP ctp = newPara.getCTP();
CTPPr ctppr;
if ((ctppr = ctp.getPPr()) == null) ctppr = ctp.addNewPPr();
ctppr.addNewBidi().setVal(STOnOff.ON);
XWPFRun newParaRun = newPara.createRun();
newParaRun.setText(name);
i've tried some "bidirectional text direction support" (bidi) lines
(got it from here:
how change text direction(not paragraph alignment) in document in apache poi word?(XWPF) )
but it's not that, nor has to do with alignment...
Using older word processing software applications there seems to be problems when LTR characters and RTL characters gets mixed in one text run. Then using special BiDi character types might be the solution. See https://en.wikipedia.org/wiki/Bidirectional_text#Table_of_possible_BiDi_character_types.
See also bidirectional with word document using Aphace POI.
Using this the following works:
import java.io.FileOutputStream;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTPPr;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.STOnOff;
public class CreateWordRTLParagraph {
public static void main(String[] args) throws Exception {
XWPFDocument doc= new XWPFDocument();
XWPFParagraph paragraph = doc.createParagraph();
XWPFRun run = paragraph.createRun();
run.setText("Paragraph 1 LTR");
paragraph = doc.createParagraph();
CTP ctp = paragraph.getCTP();
CTPPr ctppr;
if ((ctppr = ctp.getPPr()) == null) ctppr = ctp.addNewPPr();
ctppr.addNewBidi().setVal(STOnOff.ON);
run = paragraph.createRun();
String line = "(שָׁלוֹם)";
run.setText("\u202E" + line + "\u202C");
paragraph = doc.createParagraph();
run = paragraph.createRun();
run.setText("Paragraph 3 LTR");
FileOutputStream out = new FileOutputStream("WordDocument.docx");
doc.write(out);
out.close();
doc.close();
}
}
It uses U+202E RIGHT-TO-LEFT OVERRIDE (RLO) before the text line having LTR charcters (( and )) and RTL characters (שָׁלוֹם) mixed and U+202C POP DIRECTIONAL FORMATTING (PDF) after that text line. That tells the word processing software exactly where RTL starts and ends. That leads to correct output for me using MS Word 365 and WordPad.
Using apache poi 5.0.0 for Bidi .setVal(STOnOff.ON) is not more possible but .setVal(true) can be used:
//ctppr.addNewBidi().setVal(STOnOff.ON); // up to apache poi 4.1.2
ctppr.addNewBidi().setVal(true); // from apache poi 5.0.0 on
I am unable to add text containing blank lines as separate paragraphs to a word document.
If I try to add the following text that contains 3 different paragraphs.
Some text here.
Another text here.
Another one here.
what I get is 1. Some text here. 2. Another text here. 3. Another one here. as if they were the same paragraph.
Is it possible to add a text containing blank lines as separate paragraphs to a Word document using Apache POI?
public static void addingMyParagraphs(XWPFDocument doc, String text) throws InvalidFormatException, IOException {
XWPFParagraph p = doc.createParagraph();
XWPFRun run = p.createRun();
run.setText(text);
run.setFontFamily("Times new Roman");
}
--In the method below MyText variable is a textArea variable that's part of a javaFx application.
public void CreatingDocument() throws IOException, InvalidFormatException {
String theText = myText.getText();
addingMyParagraphs(doc, theText);
FileOutputStream output = new FileOutputStream("MyDocument.docx");
doc.write(output);
output.close();
}
}
You need to split your text into "paragraphs" and add each paragraph separately to your WORD document. This has nothing to do with JavaFX.
Here is an example that uses text blocks to simulate the text that is entered into the [JavaFX] TextArea. Explanations after the code.
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFRun;
public class PoiWord0 {
public static void main(String[] args) {
String text = """
1. Some text here.
2. Another text here.
3. Another one here.
""";
String[] paras = text.split("(?m)^[ \\t]*\\r?\\n");
try (XWPFDocument doc = new XWPFDocument();
FileOutputStream output = new FileOutputStream("MyDocument.docx")) {
for (String para : paras) {
XWPFParagraph p = doc.createParagraph();
XWPFRun run = p.createRun();
run.setText(para.stripTrailing());
}
doc.write(output);
}
catch (IOException xIo) {
xIo.printStackTrace();
}
}
}
I assume that a paragraph delimiter is a blank line, so I split the text on the blank lines. This still leaves the trailing newline character in each element of the array. I use stripTrailing() to remove that newline.
Now I have an array of paragraphs, so I simply add a new paragraph to the [WORD] document for each array element.
Note that the above code was written using JDK 15.
The regex for splitting the text came from the SO question entitled Remove empty line from a multi-line string with Java
try-with-resources was added in Java 7.
stripTrailing() was added in JDK 11
I'm trying convert word to pdf, my code is:
public static void main(String[] args) {
try {
XWPFDocument document = new XWPFDocument();
document.createStyles();
XWPFParagraph paragraph = document.createParagraph();
XWPFRun title = paragraph.createRun();
title.setText("gLETS GO");
PdfOptions options = PdfOptions.create();
OutputStream out = new FileOutputStream(new File("C:/Users/pepe/Desktop/DocxToPdf1.pdf"));
PdfConverter.getInstance().convert(document, out, options);
System.out.println("Done");
} catch (Exception e) {
e.printStackTrace();
}
}
I'm getting error:
fr.opensagres.poi.xwpf.converter.core.XWPFConverterException: org.apache.xmlbeans.XmlException: error: Unexpected end of file after null
at fr.opensagres.poi.xwpf.converter.pdf.PdfConverter.doConvert(PdfConverter.java:71)
at fr.opensagres.poi.xwpf.converter.pdf.PdfConverter.doConvert(PdfConverter.java:39)
Caused by: org.apache.xmlbeans.XmlException: error: Unexpected end of file
I have tried other solutions but doesnt works. I create a java project, if someone can help me or other way to do
This is probably a duplicate of Trying to make simple PDF document with Apache poi. But let's have a complete example again to show how to create a new XWPFDocument from scratch using the latest apache poi 4.1.2 which then can be converted to PDF using PdfConverter of fr.opensagres.poi.xwpf.converter version 2.0.2 and iText.
As told the default *.docx documents created by apache poi lacks some content which PdfConverter needs.
There must be a styles document, even if it is empty.
And there must be section properties for the page having at least the page size set. To fulfilling this we must add some code additionally in our program. Unfortunately this then needs the full jar of all of the schemas ooxml-schemas-1.4.jar as mentioned in Faq-N10025.
And because we need changing the underlying low level objects, the document must be written so underlying objects will be committed. Else the XWPFDocument which we hand over the PdfConverter will be incomplete.
Minimal complete working example:
import java.io.*;
import java.math.BigInteger;
//needed jars: fr.opensagres.poi.xwpf.converter.core-2.0.2.jar,
// fr.opensagres.poi.xwpf.converter.pdf-2.0.2.jar,
// fr.opensagres.xdocreport.itext.extension-2.0.2.jar,
// itext-4.2.1.jar
import fr.opensagres.poi.xwpf.converter.pdf.PdfOptions;
import fr.opensagres.poi.xwpf.converter.pdf.PdfConverter;
//needed jars: apache poi and it's dependencies
// and additionally: ooxml-schemas-1.4.jar
import org.apache.poi.xwpf.usermodel.*;
import org.apache.poi.util.Units;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.*;
public class XWPFToPDFConverterSampleMin {
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument();
// there must be a styles document, even if it is empty
XWPFStyles styles = document.createStyles();
// there must be section properties for the page having at least the page size set
CTSectPr sectPr = document.getDocument().getBody().addNewSectPr();
CTPageSz pageSz = sectPr.addNewPgSz();
pageSz.setW(BigInteger.valueOf(12240)); //12240 Twips = 12240/20 = 612 pt = 612/72 = 8.5"
pageSz.setH(BigInteger.valueOf(15840)); //15840 Twips = 15840/20 = 792 pt = 792/72 = 11"
// filling the body
XWPFParagraph paragraph = document.createParagraph();
XWPFRun title = paragraph.createRun();
title.setText("gLETS GO");
//document must be written so underlaaying objects will be committed
ByteArrayOutputStream out = new ByteArrayOutputStream();
document.write(out);
document.close();
document = new XWPFDocument(new ByteArrayInputStream(out.toByteArray()));
PdfOptions options = PdfOptions.create();
PdfConverter converter = (PdfConverter)PdfConverter.getInstance();
converter.convert(document, new FileOutputStream("XWPFToPDFConverterSampleMin.pdf"), options);
document.close();
}
}
I would not suggest you to use apache poi since its library to convert word to pdf have been discontinued now. As of today I don't think that there is any open source library which do the conversion (they require some dependencies like some need MS word to be installed, etc). The best way I could think of (it will only work if you are deploying project on linux machine) is that install Libre Office (open source) in the linux machine and run this :
String command = "libreoffice --headless --convert-to pdf " + inputPath + " --outdir " + outputPath;
try {
Runtime.getRuntime().exec(command);
} catch (IOException e) {
e.printStackTrace();
}
We are building a java code to read word document (.docx) into our program using apache POI.
We are stuck when we encounter formulas and chemical equation inside the document.
Yet, we managed to read formulas but we have no idea how to locate its index in concerned string..
INPUT (format is *.docx)
text before formulae **CHEMICAL EQUATION** text after
OUTPUT (format shall be HTML) we designed
text before formulae text after **CHEMICAL EQUATION**
We are unable to fetch the string and reconstruct to its original form.
Question
Now is there any way to locate the position of the image and formulae within the stripped line, so that it can be restored to its original form after reconstruction of the string, as against having it appended at the end of string.?
If the needed format is HTML, then Word text content together with Office MathML equations can be read the following way.
In Reading equations & formula from Word (Docx) to html and save database using java I have provided an example which gets all Office MathML equations out of an Word document into HTML. It uses paragraph.getCTP().getOMathList() and paragraph.getCTP().getOMathParaList() to get the OMath elements from the paragraph. This takes the OMath elements out of the text context.
If one wants get those OMath elements in context with the other elements in the paragraphs, then using a org.apache.xmlbeans.XmlCursor is needed to loop over all different XML elements in the paragraph. The following example uses the XmlCursor to get text runs together with OMath elements from the paragraph.
The transformation from Office MathML into MathML is taken using the same XSLT approach as in Reading equations & formula from Word (Docx) to html and save database using java. There also is described where the OMML2MML.XSL comes from.
The file Formula.docx looks like:
Code:
import java.io.*;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMath;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMathPara;
import org.apache.xmlbeans.XmlCursor;
import org.w3c.dom.Node;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.stream.StreamResult;
import java.awt.Desktop;
import java.util.List;
import java.util.ArrayList;
/*
needs the full ooxml-schemas-1.4.jar as mentioned in https://poi.apache.org/faq.html#faq-N10025
*/
public class WordReadTextWithFormulasAsHTML {
static File stylesheet = new File("OMML2MML.XSL");
static TransformerFactory tFactory = TransformerFactory.newInstance();
static StreamSource stylesource = new StreamSource(stylesheet);
//method for getting MathML from oMath
static String getMathML(CTOMath ctomath) throws Exception {
Transformer transformer = tFactory.newTransformer(stylesource);
Node node = ctomath.getDomNode();
DOMSource source = new DOMSource(node);
StringWriter stringwriter = new StringWriter();
StreamResult result = new StreamResult(stringwriter);
transformer.setOutputProperty("omit-xml-declaration", "yes");
transformer.transform(source, result);
String mathML = stringwriter.toString();
stringwriter.close();
//The native OMML2MML.XSL transforms OMML into MathML as XML having special name spaces.
//We don't need this since we want using the MathML in HTML, not in XML.
//So ideally we should changing the OMML2MML.XSL to not do so.
//But to take this example as simple as possible, we are using replace to get rid of the XML specialities.
mathML = mathML.replaceAll("xmlns:m=\"http://schemas.openxmlformats.org/officeDocument/2006/math\"", "");
mathML = mathML.replaceAll("xmlns:mml", "xmlns");
mathML = mathML.replaceAll("mml:", "");
return mathML;
}
//method for getting HTML including MathML from XWPFParagraph
static String getTextAndFormulas(XWPFParagraph paragraph) throws Exception {
StringBuffer textWithFormulas = new StringBuffer();
//using a cursor to go through the paragraph from top to down
XmlCursor xmlcursor = paragraph.getCTP().newCursor();
while (xmlcursor.hasNextToken()) {
XmlCursor.TokenType tokentype = xmlcursor.toNextToken();
if (tokentype.isStart()) {
if (xmlcursor.getName().getPrefix().equalsIgnoreCase("w") && xmlcursor.getName().getLocalPart().equalsIgnoreCase("r")) {
//elements w:r are text runs within the paragraph
//simply append the text data
textWithFormulas.append(xmlcursor.getTextValue());
} else if (xmlcursor.getName().getLocalPart().equalsIgnoreCase("oMath")) {
//we have oMath
//append the oMath as MathML
textWithFormulas.append(getMathML((CTOMath)xmlcursor.getObject()));
}
} else if (tokentype.isEnd()) {
//we have to check whether we are at the end of the paragraph
xmlcursor.push();
xmlcursor.toParent();
if (xmlcursor.getName().getLocalPart().equalsIgnoreCase("p")) {
break;
}
xmlcursor.pop();
}
}
return textWithFormulas.toString();
}
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("Formula.docx"));
//using a StringBuffer for appending all the content as HTML
StringBuffer allHTML = new StringBuffer();
//loop over all IBodyElements - should be self explained
for (IBodyElement ibodyelement : document.getBodyElements()) {
if (ibodyelement.getElementType().equals(BodyElementType.PARAGRAPH)) {
XWPFParagraph paragraph = (XWPFParagraph)ibodyelement;
allHTML.append("<p>");
allHTML.append(getTextAndFormulas(paragraph));
allHTML.append("</p>");
} else if (ibodyelement.getElementType().equals(BodyElementType.TABLE)) {
XWPFTable table = (XWPFTable)ibodyelement;
allHTML.append("<table border=1>");
for (XWPFTableRow row : table.getRows()) {
allHTML.append("<tr>");
for (XWPFTableCell cell : row.getTableCells()) {
allHTML.append("<td>");
for (XWPFParagraph paragraph : cell.getParagraphs()) {
allHTML.append("<p>");
allHTML.append(getTextAndFormulas(paragraph));
allHTML.append("</p>");
}
allHTML.append("</td>");
}
allHTML.append("</tr>");
}
allHTML.append("</table>");
}
}
document.close();
//creating a sample HTML file
String encoding = "UTF-8";
FileOutputStream fos = new FileOutputStream("result.html");
OutputStreamWriter writer = new OutputStreamWriter(fos, encoding);
writer.write("<!DOCTYPE html>\n");
writer.write("<html lang=\"en\">");
writer.write("<head>");
writer.write("<meta charset=\"utf-8\"/>");
//using MathJax for helping all browsers to interpret MathML
writer.write("<script type=\"text/javascript\"");
writer.write(" async src=\"https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=MML_CHTML\"");
writer.write(">");
writer.write("</script>");
writer.write("</head>");
writer.write("<body>");
writer.write(allHTML.toString());
writer.write("</body>");
writer.write("</html>");
writer.close();
Desktop.getDesktop().browse(new File("result.html").toURI());
}
}
Result:
Just tested this code using apache poi 5.0.0 and it works. You need poi-ooxml-full-5.0.0.jar for apache poi 5.0.0. Please read https://poi.apache.org/help/faq.html#faq-N10025 for what ooxml libraries are needed for what apache poi version.
XWPFParagraph paragraph;
for (CTOMath ctomath : paragraph.getCTP().getOMathList()) {
formulas=formulas + getMathML(ctomath);
}
With the above code it is able to extract the math formula from the given paragraph of a docx file.
Also for the purpose displaying the formula in a html page I m converting it to mathml code and rendering it with MathJax on the page. This I m able to do.
But the problem is, Is it possible to get the position of the formula in the given paragraph. So that I can display the formula in the exact location in the paragraph while rendering it as a html page.
I've successfully converted JPEG to Pdf using Java, but don't know how to convert Pdf to Word using Java, the code for converting JPEG to Pdf is given below.
Can anyone tell me how to convert Pdf to Word (.doc/ .docx) using Java?
import java.io.FileOutputStream;
import com.itextpdf.text.Image;
import com.itextpdf.text.pdf.PdfWriter;
import com.itextpdf.text.Document;
public class JpegToPDF {
public static void main(String[] args) {
try {
Document convertJpgToPdf = new Document();
PdfWriter.getInstance(convertJpgToPdf, new FileOutputStream(
"c:\\java\\ConvertImagetoPDF.pdf"));
convertJpgToPdf.open();
Image convertJpg = Image.getInstance("c:\\java\\test.jpg");
convertJpgToPdf.add(convertJpg);
convertJpgToPdf.close();
System.out.println("Successfully Converted JPG to PDF in iText");
} catch (Exception i1) {
i1.printStackTrace();
}
}
}
In fact, you need two libraries. Both libraries are open source. The first one is iText, it is used to extract the text from a PDF file. The second one is POI, it is ued to create the word document.
The code is quite simple:
//Create the word document
XWPFDocument doc = new XWPFDocument();
// Open the pdf file
String pdf = "myfile.pdf";
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
// Read the PDF page by page
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
TextExtractionStrategy strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
// Extract the text
String text=strategy.getResultantText();
// Create a new paragraph in the word document, adding the extracted text
XWPFParagraph p = doc.createParagraph();
XWPFRun run = p.createRun();
run.setText(text);
// Adding a page break
run.addBreak(BreakType.PAGE);
}
// Write the word document
FileOutputStream out = new FileOutputStream("myfile.docx");
doc.write(out);
// Close all open files
out.close();
reader.close();
Beware: With the used extraction strategy, you will lose all formatting. But you can fix this, by inserting your own, more complex extraction strategy.
You can use 7-pdf library
have a look at this it may help :
http://www.7-pdf.de/sites/default/files/guide/manuals/library/index.html
PS: itext has some issues when given file is non RGB image, try this out!!
Although it's far from being a pure Java solution OpenOffice/LibreOfffice allows one to connect to it through a TCP port; it's possible to use that to convert documents. If this looks like an acceptable solution, JODConverter can help you.