I have following link
https://hero.epa.gov/hero/ws/swift.cfc?method=getProjectRIS&project_id=993&getallabstracts=true
I want to parse this xml to get only text, like
Provider: HERO - 2.xx
DBvendor=EPA
Text-encoding=UTF-8
How can I parse it ?
Well, it's not a text file, it's an HTML file. If you open a file in browser and select view source you will be able to see text enclosed in <char> tags.
When it's opened in browser, these tags and other HTML content is interpreted and output is rendered on the page (that's why it looks like a text). If you want to implement similar behavior in Java then you should look into PhantomJS and/or JSoup examples.
It looks like a text file but it is an XML file and the browser just displays its text content.
To verify right click and look at the page source.
You can use a library like Jsoup for parsing the file and getting the contents.
https://jsoup.org/cookbook/introduction/parsing-a-document
Related
I'm trying to export a text area (for which I use ckeditor) into a Word document. I'm using JSP, and setting HTTP headers of a target page to receive the textarea value in request scope:
<%#page contentType="application/vnd.ms-word"%>
response.setHeader("Content-Disposition", "attachment;filename=responseLetter.doc")
...
<%=textAreaReqScopeValue%>
However, I lose formatting and style of my source ckeditor (example below) when the Word document has been generated:
<p>Dear Anonymous,</p><p>This is in response to your <strong><em><u>request regarding your continued ...
Is there any way to keep the formatting, either by generating the Word document or through CKEditor?
Using googoose.js or html-doc.js solved my problem. An open xml library should have been used to process html tags for the ms-word output.
I'm generating pdfs from HTML pages with an application. Sometimes, the pdf is formatted correctly (with styles); other times, it lacks style elements.
In the log file I can see the "Error in rendering".
We are using HTML tags and using string buffer we are converting html tag to pdf file. Not sure why we are getting this missing format issues while generating the pdf file.
So sometimes the CSS file (style) does convert with the HTML file, and sometimes the CSS doesn't convert with the HTML file.
I'm guessing that you use an external CSS file. If I were you, I would try to type your CSS code inside your HTML file, under the header element, like this:
<style>
body {background-color:#fff}
h1 {color:#eee}
</style>
I need to display Microsoft Doc on my web page and then parse the doc for further process.
Use something like: http://poi.apache.org/document/index.html
to parse your word document and extract the data. Then render it as HTML to client browser.
But it sound poor to do that. Better would be to use a format which can directly displayed in the browser like PDF. Then you don't have to parse between the word doc styles and your webpage styles.
You can use Apache Poi to parse the doc file.
Lets see at this example:
I've got HTML tagged text:
<font size="100">Example text</font>
I have *.odt (OpenDocument Text) document where I want to place this HTML text with formatting depends on HTML tags (in this example font tag should be ommited and text Example text should have 100point size font in result *.odt file).
I prefer (but this is not strong requirement) to use OpenOffice UNO API for Java to achieve that. Is there any way to inject this HTML text into body of *.odt document with simple UNO API build-in HTML-odt converter or something like this (or I have to manually go through HTML tags in text and then use OO UNO API for placing text with specific formatting - e.g. font size)?
OK, this is what I've done to achieve this (using OpenOffice UNO Api with JAVA):
Load odt document where we want to place HTML text.
Goto place where you want to place HTML text.
Save HTML text in temp file in the system (maybe it is possible without saving with http URL but I wasn't testing it).
Insert HTML into odt following this instructions and passing URL to temp HTML file (remember about converting system path to OO path).
Maybe you can use JODConverter or you can use the xslt from xhtml2odt
I want to display a regular XML or any other file in text format in JEditorPane..I don't want to display the content in html page... the content should be exactly the same as in file with line break..the XML file is located on local system
Do you need syntax highlighting?
or just show plain text?
If syntax try to use this
http://java-sl.com/xml_editor_kit.html
If not just use JTextArea or normal JEditorPane with default editor kit. And call setText() to set your content.