docx4j adding style to paragraph destroys document

docx4j adding style to paragraph destroys document - java

I'm trying to add a paragraph containing a headline (with style) and some plain unformatted text. The following code destroys the document.
Edit: After executing the following code and trying to open the document in word I get an error message "Unspecified Error" Location: Part /word/document.xml Line 1 Column 0
ObjectFactory factory = new ObjectFactory();
P complete = factory.createP();
org.docx4j.wml.P headline=factory.createP();
R hrun = factory.createR();
Text htxt = new Text();
hrun.getContent().add(htxt);
htxt.setValue(View_Beta.this.falseAlarmChoice.getSelectedItem().toString());
headline.getContent().add(hrun);
org.docx4j.wml.PPr pPr = factory.createPPr();
headline.setPPr(pPr);
org.docx4j.wml.PPrBase.PStyle pStyle = factory.createPPrBasePStyle();
pPr.setPStyle(pStyle);
pStyle.setVal("Title");
complete.getContent().add(headline);
P ptext = factory.createP();
R rtext = factory.createR();
Text ttext = new Text();
rtext.getContent().add(ttext);
ptext.getContent().add(rtext);
ttext.setValue(falseAlarmChoice.getSelectedItem()
+ falseAlarmDsc.getText());
complete.getContent().add(ptext);
//add to document context
View_Beta.this.c.insertAtPos(complete,
paragraphlst.getSelectedIndex());

In your code,
complete.getContent().add(headline)
adds a paragraph inside a paragraph, which is invalid per the Open XML spec.

Related

Itext pdf - text alignment to right

I am using Itext PDF API to generate a pdf. I am trying to get some text to be aligned to the right-hand side of the pdf. I have tried the manual method of spacing but is not working for some reason (Code shown below). Meanwhile, if there is a way of doing it dynamically that would be great, please!
String dest = "\\location\\";
PdfWriter writer;
writer = new PdfWriter(dest);
// Creating a PdfDcoument
PdfDocument pdf = new PdfDocument(writer);
// Creating a Document
Document document = new Document(pdf);
// Creating a String
String para1 = "TEXT";
//Spacing length
while (para1.length() < 50) {
para1 = " " + para1;
}
//Creating Paragraphs
Paragraph paragraph1 = new Paragraph(para1);
//paragraph1.setAlignment(Element.ALIGN_CENTER);
//Adding Paragraphs to document
document.add(paragraph1);
// Closing the document
document.close();
Thanks in advance!

Class com.itextpdf.layout.element.Paragraph in itext7 has method setTextAlignment. I hope this is what you are looking for:
...
paragraph1.setTextAlignment(TextAlignment.RIGHT);
...

I'm using com.itextpdf:itextpdf:5.5.10 and it looks like the stuff has moved around a bit.
paragraph1.setAlignment(com.itextpdf.text.Element.ALIGN_RIGHT)

Text associated to PDF paragraph in document content object wit PDFBox

I'm trying to get the text associated to a paragraph navigating through the content tree of a PDF file. I am using PDFBox and cannot find the link between the paragraph and the text that it contains (see code below):
public class ReadPdf {
public static void main( String[] args ) throws IOException{
MyBufferedWriter out = new MyBufferedWriter(new FileWriter(new File(
"C:/Users/wip.txt")));
RandomAccessFile raf = new RandomAccessFile(new File(
"C:/Users/mypdf.pdf"), "r");
PDFParser parser = new PDFParser(raf);
parser.parse();
COSDocument cosDoc = parser.getDocument();
out.write(cosDoc.getXrefTable().toString());
out.write(cosDoc.getObjects().toString());
PDDocument document = parser.getPDDocument()
document.getClass();
COSParser cosParser = new COSParser(raf);
PDStructureTreeRoot treeRoot = document.getDocumentCatalog().getStructureTreeRoot();
for (Object kid : treeRoot.getKids()){
for (Object kid2 :((PDStructureElement)kid).getKids()){
PDStructureElement kid2c = (PDStructureElement)kid2;
if (kid2c.getStandardStructureType() == "P"){
for (Object kid3 : kid2c.getKids()){
if (kid3 instanceof PDStructureElement){
PDStructureElement kid3c = (PDStructureElement)kid3;
}
else{
for (Entry<COSName, COSBase>entry : kid2c.getCOSObject().entrySet()){
// Print all the Keys in the paragraph COSDictionary
System.out.println(entry.getKey().toString());
System.out.println(entry.getValue().toString());}
}}}}}}}
When I print the contents I get the following Keys:
/P : Reference to Parent
/A : Format of the paragraph
/K : Position of the paragraph in the section
/C : Name of the paragraph (!= text)
/Pg : Reference to the page
Example output:
COSName{K}
COSInt{2}
COSName{Pg}
COSObject{12, 0}
COSName{C}
COSName{Normal}
COSName{A}
COSObject{434, 0}
COSName{S}
COSName{Normal}
COSName{P}
COSObject{421, 0}
Now none of these points to the actual text inside the paragraph.
I know that the relation can be obtained as it is parsed when I open the document with acrobat (see pic below):

I found a way to do this through the parsing of the Content Stream from a page.
Navigating through the PDF Specification Chapter 10.6.3 there is a link between the numbering of each Text Stream which comes under \P \MCID and an attribute of the Tag (PDStructureElement in PDFBox) which can be found in the COSObject.
1) To get the text and the MCID:
PDPage pdPage;
Iterator<PDStream> inputStream = pdPage.getContentStreams();
while (inputStream.hasNext()) {
try {
PDFStreamParser parser2 = new PDFStreamParser((PDStream)inputStream.next());
parser2.parse();
List<Object> tokens = parser2.getTokens();
for (int j = 0; j < tokens.size(); j++){
tokenString = (tokenString + tokens.get(j).toString()}
// here comes the parsing of the string. Chapter 5 specifies what each of the operators Tj (actual text), Tm, BDC, BT, ET, EMC mean, MCID
Then to get the tags and their attribute that matches MCID:
PDStructureElement pDStructureElement;
pDStructureElement .getCOSObject().getInt(COSName.K)
That should do it. In documents without Tags (document.getDocumentCatalog().getStructureTreeRoot() is empty of children) this match cannot be performed but the text can still be read using step 1.

Converting PDF document containing graphs and tables to Word Document

I am trying to convert a PDF document to a Word file using Java. On Internet, I found a code snippet which converts PDF document to Word. but the alignments in the resulting Word document is clumsy. Images tables and graphs are not in sync. Everything is displaying as string paragraph/words.
The code, I have written is given below.
XWPFDocument doc = new XWPFDocument();
String pdf = "D:\\xyz.pdf";
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
TextExtractionStrategy strategy = (TextExtractionStrategy)
parser.processContent(i,new SimpleTextExtractionStrategy());
String text = strategy.getResultantText();
XWPFParagraph p = doc.createParagraph();
XWPFRun run = p.createRun();
run.setText(text);
run.addBreak(BreakType.PAGE);
Please anyone help.....

How do you find/replace a placeholder in a .docx file with Apache POI?

I have a file, "template.docx" that I would like to have placeholders (ie. [serial number]) that can be replaced with a string or maybe a table. I am using Apache POI and no i cannot use docx4j.
Is there a way to have the program iterate over all occurrences of "[serial number]" and replace them with a string? Many of these tags will be inside a large table so is there some equivalent command with the Apache POI to just pressing ctrl+f in word and using replace all?
Any suggestions would be appreciated, thanks

XWPFDocument (docx) has different kind of sub-elements like XWPFParagraphs, XWPFTables, XWPFNumbering etc.
Once you create XWPFDocument object via:
document = new XWPFDocument(inputStream);
You can iterate through all of Paragraphs:
document.getParagraphsIterator();
When you iterator through Paragraphs, For each Paragraph you will get multiple XWPFRuns which are multiple text blocks with same styling, some times same styling text blocks will be split into multiple XWPFRuns in which case you should look into this question to avoid splitting of your Runs, doing so will help identify your placeHolders without merging multiple Runs within same Paragraph. At this point you should expect that your placeHolder will not be split in multiple runs if that's the case then you can go ahead and Iterate over 'XWPFRun's for each paragraph and look for text matching your placeHolder, something like this will help:
XWPFParagraph para = (XWPFParagraph) xwpfParagraphElement;
for (XWPFRun run : para.getRuns()) {
if (run.getText(0) != null) {
String text = run.getText(0);
Matcher expressionMatcher = expression.matcher(text);
if (expressionMatcher.find() && expressionMatcher.groupCount() > 0) {
System.out.println("Expression Found...");
}
}
}
Where expressionMatcher is Matcher based on a RegularExpression for particular PlaceHolder. Try having regex that matches something optional before your PlaceHolder and after as well e.g \([]*)(PlaceHolderGroup)([]*)^, trust me it works best.
Once you find the right XWPFRun extract text of your interest in it and create a replacement text which should be easy enough, then you should replace new text with previous text in this particular run by:
run.setText(text, 0);
If you were to replace this whole XWPFRun with a completely a new XWPFRun or perhaps insert a new Paragraph/Table after the Paragraph owning this run, you would probably run into a few problems, like A. ConcurrentModificationException which means you cannot modify this List(of XWPFRuns) you are iterating and B. finding the position of new Element to insert. To resolve these issues you should have a List<XWPFParagraph> of XWPFParagarphs that can hold paras after which new Element is to be inserted. Once you have your List of replacement you can iterator over it and for each replacement Paragraph you simply get a cursor and insert new element at that cursor:
for (XWPFParagraph para: paras) {
XmlCursor cursor = (XmlCursor) para.getCTP().newCursor();
XWPFTable newTable = para.getBody().insertNewTbl(cursor);
//Generate your XWPF table based on what's inside para with your own logic
}
To create an XWPFTable, read this.
Hope this helps someone.

// Text nodes begin with w:t in the word document
final String XPATH_TO_SELECT_TEXT_NODES = "//w:t";
try {
// Open the input file
String fileName="test.docx";
String[] splited=fileName.split(".");
File dir=new File("D:\\temp\\test.docx");
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new FileInputStream(dir));
// Build a list of "text" elements
List<?> texts = wordMLPackage.getMainDocumentPart().getJAXBNodesViaXPath(XPATH_TO_SELECT_TEXT_NODES, true);
HashMap<String, String> mappings = new HashMap<String, String>();
mappings.put("1", "one");
mappings.put("2", "two");
// Loop through all "text" elements
Text text = null;
for (Object obj : texts) {
text = (Text) ((JAXBElement<?>) obj).getValue();
String textToReplace = text.getValue();
if (mappings.keySet().contains(textToReplace)) {
text.setValue(mappings.get(textToReplace));
}
}
wordMLPackage.save(new java.io.File("D:/temp/forPrint.docx"));//your path
} catch (Exception e) {
}
}
}

Updating the text of a XWPFParagraph using Apache POI

I have been able to loop through all paragraphs in a document and get at the text and everything and I have read and understood how you can create a document from scratch. But how can I update and replace the text in a paragraph? I can do createRun in a paragraph but that will just create a new piece of text in it.
...
FileInputStream fis = new FileInputStream("Muu.docx");
XWPFDocument myDoc = new XWPFDocument(fis);
XWPFParagraph[] myParas = myDoc.getParagraphs();
...
My theory is that I need to get at the existing "run" in the paragraph I want to change, or delete the paragraph and add it again) but I cannot find methods to do that.

You can't change the text on a XWPFParagraph directly. A XWPFParagraph is made up of one or more XWPFRun instances. These provide the way to set the text.
To change the text, your code would want to be something like:
public void changeText(XWPFParagraph p, String newText) {
List<XWPFRun> runs = p.getRuns();
for(int i = runs.size() - 1; i > 0; i--) {
p.removeRun(i);
}
XWPFRun run = runs.get(0);
run.setText(newText, 0);
}
That will ensure you only have one text run (the first one), and will replace all the text to be what you provided.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

docx4j adding style to paragraph destroys document - java

In your code, complete.getContent().add(headline) adds a paragraph inside a paragraph, which is invalid per the Open XML spec.

Related

Itext pdf - text alignment to right

Text associated to PDF paragraph in document content object wit PDFBox

Converting PDF document containing graphs and tables to Word Document

How do you find/replace a placeholder in a .docx file with Apache POI?

Updating the text of a XWPFParagraph using Apache POI

Categories

Resources