I need to display Microsoft Doc on my web page and then parse the doc for further process.
Use something like: http://poi.apache.org/document/index.html
to parse your word document and extract the data. Then render it as HTML to client browser.
But it sound poor to do that. Better would be to use a format which can directly displayed in the browser like PDF. Then you don't have to parse between the word doc styles and your webpage styles.
You can use Apache Poi to parse the doc file.
Related
I have following link
https://hero.epa.gov/hero/ws/swift.cfc?method=getProjectRIS&project_id=993&getallabstracts=true
I want to parse this xml to get only text, like
Provider: HERO - 2.xx
DBvendor=EPA
Text-encoding=UTF-8
How can I parse it ?
Well, it's not a text file, it's an HTML file. If you open a file in browser and select view source you will be able to see text enclosed in <char> tags.
When it's opened in browser, these tags and other HTML content is interpreted and output is rendered on the page (that's why it looks like a text). If you want to implement similar behavior in Java then you should look into PhantomJS and/or JSoup examples.
It looks like a text file but it is an XML file and the browser just displays its text content.
To verify right click and look at the page source.
You can use a library like Jsoup for parsing the file and getting the contents.
https://jsoup.org/cookbook/introduction/parsing-a-document
I am trying to Jsoup to parse the html from the URL http://www.threadflip.com/shop/search/john%20hardy
Jsoup looks to only get the data from the line
<![CDATA[ window.gon= ..............
Does anyone know why this would be?
Document doc = Jsoup.connect("http://www.threadflip.com/shop/search/john%20hardy").get();
The site you try to parse loads most of its contents async via AJAX calls. JSoup does not interpret Javascript and therefore does not act like a browser. It seems that the store is filled by calling their api:
http://www.threadflip.com/api/v3/items?attribution%5Bapp%5D=web&item_collection_id=&q=john+hardy&page=1&page_size=30
So maybe you need to directly load the API Url in order to read the stuff you want. Note that the response is JSON, not HTML, so the JSoup html parser is of not much help here. But there is great JSON libraries available. I use JSON-Simple.
Alternatively, you may switch to Selenium webdriver, which actually remote controls a real browser. This should have no trouble accessing all items from the page.
This is not a duplicate question. I had searched and tried many options before posting this question.
We have a web page, in which user should be able to input data in text boxes, text areas, images and also Rich Text editors. This data has to be filled in an existing report, like filling the blanks.
I was able to achieve the functionality using Apache FOP when the user input is simple text. But Apache FOP doesn't work if the user input is Rich Text(html format). FOP will not render html, and it just pushes the html code(ex: <strong> XYZ /strong>) into the pdf.
I tried using iText, but the setback here is that even though iText supports rendering of html to pdf, it is not able to place the images, that are included in <img> tags, in the pdf file.
I can try to create a pdf using iText api block by block, but the problem is rich text data entered by the user can not be embedded between the code since building pdf block by block and html to pdf can not be done together in iText. Or at least that is what I think from my experience.
Is there any other way to create a pdf file from java with images, rich text rendering as it is, headers and footers?
iText provides the capability to convert HTML Data to Pdf. Below is the snippet to do it :
Lets assume the html data is available as Input Stream (If its a String then we can convert it to InputStream using Apache Commons - IOUtils)
InputStream htmlData; // Html Data that needs to converted to Pdf
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
Document document = new Document();
PdfWriter pdfWriter = PdfWriter.getInstance(document, outputStream);
document.open();
// convert the HTML with the built-in convenience method
XMLWorkerHelper.getInstance().parseXHtml(pdfWriter, document, htmlData);
document.close();
// outputStream now has the required pdf data
I am working as Social Media Developer for Aspose and to add rich text to a form field in PDF file, you can try our Aspose.Pdf for Java API. Check the following sample code:
// Open a PDF document
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document("c:\\data\\input.pdf");
//Find Rich TextBox field using Field Name
RichTextBoxField textBoxField1 = (RichTextBoxField)pdfDocument.getForm().get("textbox1");
//Set the field value
textBoxField1.setValue("<strong> XYZ </strong>");
// Save the modified PDF
pdfDocument.save("c:\\data\\output2.pdf");
I am not trying to market or promote this product. This api actually solved our problem so thought of mentioning it as it might help fellow developers. please let me know if this is against your policy.
I finally realized that the solution for my requirement can not be achieved with either FOP, iText, Aspose, Flying Saucer, JODConverter.
I found a paid api Sferyx. This api allows to render a very complex html to pdf almost preserving the original style. It also renders the images included in the html. We are still exploring this api and will post what other features this api provides.
Lets see at this example:
I've got HTML tagged text:
<font size="100">Example text</font>
I have *.odt (OpenDocument Text) document where I want to place this HTML text with formatting depends on HTML tags (in this example font tag should be ommited and text Example text should have 100point size font in result *.odt file).
I prefer (but this is not strong requirement) to use OpenOffice UNO API for Java to achieve that. Is there any way to inject this HTML text into body of *.odt document with simple UNO API build-in HTML-odt converter or something like this (or I have to manually go through HTML tags in text and then use OO UNO API for placing text with specific formatting - e.g. font size)?
OK, this is what I've done to achieve this (using OpenOffice UNO Api with JAVA):
Load odt document where we want to place HTML text.
Goto place where you want to place HTML text.
Save HTML text in temp file in the system (maybe it is possible without saving with http URL but I wasn't testing it).
Insert HTML into odt following this instructions and passing URL to temp HTML file (remember about converting system path to OO path).
Maybe you can use JODConverter or you can use the xslt from xhtml2odt
I'm trying to use Freemarker to convert an XML Word document to a standard DOC. For example:
I generate a Word document (A.doc) and then save it as XML Word document (A.xml).
On Freemarker, I import A.xml and export it as 2003 Word (B.doc).
In POI, I import the converted DOC (B.doc). (POI can't read XML docs.)
The problem is: the converted document isn't really a DOC, it's an XML doc,
so POI fails to open it.
How to use Freemarker generate a real DOC, not a XML word document?
I'm using Linux.
Your approach probably won't work because FreeMarker is designed for generating text output. Classic Word DOC files are not very "textual", so I think FreeMarker is not the right tool for your task.
(Side note: but RTF might work)