I have been struggling with trying to follow a code sample by XDocReport(open source project).
I followed this tutorial from the website:
https://code.google.com/p/xdocreport/wiki/DocxReportingJavaMainListFieldInTable
I used the Freemarker template style.
I would not iterate and create the table, I just get back: $variable as text in the output doc.
Then I dug further, and discovered that this tutorial on the website was probably not updated for the newer version. I found some more examples in this url, which contains a zip file.
https://code.google.com/p/xdocreport/downloads/detail?name=docxandfreemarker-1.0.4-sample.zip
I still could not get it to work.
I was hoping someone would have a working code sample that takes a java collection and populates a table in a Word document.
I hope one of the developers of XDocReport, angelo.zerr, would give some input on this.
Sincerely,
P
I was hoping someone would have a working code sample that takes a java collection and populates a table in a Word document.
What is the problem with https://code.google.com/p/xdocreport/wiki/DocxReportingJavaMainListFieldInTable?
I suggets you that you create an issue on XDocReport forum with a very simple case (simple Java main + docx)
It seems that the issue was the template. If one sets up a mail merge field in a Word template and don't use it in the Java program, the program then complains it can't find the variable, or something to that effect. And if you just delete the mail merge text in the document, it may still exist as a mail merge field variable in the word document.
So one needs to be very careful it seems with how to set things in the template.
I think the API should be able to ignore if there is a field setup in the template, and we are not referencing it in the code though. But that solved the problem.
Related
I have to do a project with OpenNLP, strictly in italian language. Since it's almost impossible to find some existing structures in this language, my idea is to create a simple model myself. Reading some posts on this platform, my idea is try to do this using model-builder addon.
First of all, it's possible to obtain my goal with this addon?
If so, referring to this other post, what kind of file is meant by "modelOutFile"? In my case I don't have an existing model.
N.B.: the addon uses some deprecated functions (such as nameFinderME.train()).
Naively, I tried to pass as a "modelOutFile" a simple empty file "model.bin", but, of course I bumped into an error:
Cannot invoke "java.util.Properties.getProperty(String)" because "manifest" is null
Furthermore, I used a few names and sentences for the test (I only wanted to know if this worked), not the large amount requested (15000 sentences at least).
I'm open to other suggestions instead of the use of modelbuilder addons.
Hope someone can help me.
I have tried with PDFTextStripperByArea and PDPageContentStream classes to extract the number values from my pdf file. They work fine!
But my requirement is to use PDFTable or PDFTableExtractor class to read the pdf contents. Can you tell me what is the maven dependency and jar file I need to use to access the above said classes?
Also mention the required methods to get the values from a particular position.
I have another doubt. Can we extract the table formatted data from PDF file as it is? I meant the data with rows and columns with table lines. If a page contains some text and a table, can we just read only the table headers and the rows? I have uploaded my page in GitHub. Click here! From that image, I only need the values of Gross premium, GST and Total Payable. Please let me know whether it's possible
First, don't use classes from packages com.lowagie
That code is old, obsolete and no longer supported. Furthermore, this code belonged to the very early version of iText.
Afterwards a thorough investigation was done into the intellectual property rights of all the code (since iText has had a lot of contributors). When you use the old code, you may (unknowingly) be using code for which you do not have the copyright.
Second, if you just want to solve the problem of extracting numbers and tables from a PDF document, have a look at pdf2Data. It's an iText add-on that makes things a lot easier.
It gives you a nice UI, where you can build templates for data extraction. Then you can call a single method to match an existing (XML) template against an input PDF document, and you'd get a datastructure that contains all the information about the match.
http://pdf2data.online/
PDFTable
I have found two PDFTable classes:
com.lowagie.text.pdf.PdfPTable
com.itextpdf.text.pdf.PdfPTable
Documentation of both of this class (this may help you to learn the methods you need):
https://www.coderanch.com/how-to/javadoc/itext-2.1.7/com/lowagie/text/pdf/PdfPTable.html
http://itextsupport.com/apidocs/itext5/5.5.9/com/itextpdf/text/pdf/PdfPTable.html
If you want to use this classes, you can copy the dependency to your pom.file from:
https://mvnrepository.com/artifact/com.itextpdf/itextpdf
https://mvnrepository.com/artifact/com.lowagie/itext - As mentioned in this link, This artifact was moved to com.itextpdf
Examples of how to use this classes you may found here:
https://developers.itextpdf.com/examples/itext-action-second-edition/chapter-4
https://www.programcreek.com/java-api-examples/index.php?api=com.lowagie.text.pdf.PdfPTable
What I am wanting to do is take in a word doc/docx template which already has pre-designed headers and footers and replace certain words with words applicable with that document generated from what a user has input and has been saved through MySQL. I already have a program that works to get the user input and saves to the MySQL. However, I'm a little confused at how the word manipulation would work into this.
I found docx4j and a tutorial that shows what I am looking for here and have found on another question on this site example code here. As I'm a beginner in using this, the things I'm confused on are:
I understand JAXB is used for converting to and from XML. Why is this relevant in a situation like this? Or if it's not, in what case would it be?
I am seeing two versions of loading:
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new java.io.File("P:\\Engineering\\Projects\\Naming\\EX_TEMP.docx"));
........ and the second example:
private WordprocessingMLPackage getTemplate(String name) throws Docx4JException, FileNotFoundException {
WordprocessingMLPackage template = WordprocessingMLPackage.load(new FileInputStream(new File(name)));
return template;
}
(where would you put the file directory on the second code, or how can you specify the file you want to load?)
what does hyperlinkresolver do and why is it necessary? (second link)
what is applying binding in this situation? (second link)
what is the content accessor? (first link)
am I going about this the right way, or is there an easier/better way of doing this?
I am using Eclipse with Java on a Windows 7 if that helps.
I would appreciate any help, thanks!
Also if anyone has any examples with good comments or explanations, that would be helpful!
You probably ought to take a step back and decide which approach to injecting your data you want to take. Docx4j supports three approaches:
replacing variables on the document surface (brittle but simple)
mail merge (using MERGEFIELD), good for legacy documents
content control data binding (your 2nd link; the modern/sophisticated/powerful approach, but you need to understand XML, and may be overkill here)
For answers to most of your specific questions, please take the time to read docx4j's Getting Started guide.
I am working with content parsing I executed the sample program for this i have taken a sample link
please visit the below link
http://www.equitymaster.com/stockquotes/sector.asp?sector=0%2CSOFTL&utm_source=top-menu&utm_medium=website&utm_campaign=performance&utm_content=key-sector
or
Click Here
in the above link i parsed the table data and store into java object.
BSE and NSE are not my exact requirement just I am taken sample example. the above link is developed in the tables they are not used id's and classes. in my example I parsed data using XPath
this is my Xpath
/html/body/table[4]/tbody/tr/td/table[2]/tbody/tr[2]/td[2]/font/table[2]
I selected and parsing it is working fine . here is a problem in future if they changed website structure my program will not work for sure. tell me any other way to parse data dynamically and able to store in database. display the results based on the condition even if they changed the webpage structure I used for this JSOUP api for this. Tell me any other ApI's which provide best support for this type of requirement
If you're trying to parse a page without any clear id/class to select your nodes, you have to try and rely on something else. Redefining the whole tree is indeed the weakest way of doing it, if anything is added/changed everything will collapse.
You could try relying on color: //table[#bgcolor="#c9d0e0"], the "GET MORE INFO" field: //table[tr/td//text()="GET MORE INFO"], the "More Info" there is on every line: //table[.//td//text()=" More Info "]...
The idea is to find something ideally unique (if you can't find any unique criteria, table[color condition selecting a few tables][2] is still stronger walking the whole tree), present every time, and use that as an id.
I am working on a project here that ingests internal resumes from people at my company, strips out the skills and relevant content from them and stores it in a database. This was all done using docx4j and Grails. This required the resumes to first be submitted via a template that formatted everything just right so that the ingest tool knew what to look for to strip the data.
The 2nd portion of this, is what if we want to get out a "reduced" resume from the database. In other words, I want to search the uploaded content I now have, and only print out new resumes for people who have Java programming experience lets say. So I can go into my database, find the people who originally had java as a skill, and output a new set of resumes that are also still in a nice templated format, and only have the relevant info in them, instead of ALL the content.
I have been writing some software to do this in Java that will basically use a docx template, overwriting the items in customXML which are bound to the content controls in the doc, so the new data shows up and can eb saved as a new docx with that custom data.
This seems really cumbersome to me, and has some limitations. For one, lets say my template has a place for 3 Skills, and the particular person has 8 skills. There seems to be no good way to add those 5 additional skills to the docx other than painstakingly inserting the data with all of the formatting XML tags and such. This is a real pain, because if the template changes, I dont want to have to go back into my software and edit source code to change that additional data input XML tag to bold instead of italic.
I was doing some reading up on using Infopath to create a form that I could use to get the input, connecting to some sharepoint data source or something to store the stripped out data. However, I can't seem to find out if it is possible using sharepoint to get the data back out, in a nice formatted way. What would the general steps for this be? It seems like I couldnt find very much about this topic with any quick googling.
Thanks
You could set up the skills:
<skills>
<skill>..</skill>
<skill>..</skill>
and use a "repeat" content control pointing to the container. This would handle any number of <skill> entries.