Getting the name of a w3c Document - java

Is this possible? For example, if I run the jSoup Code:
Document source = Jsoup.connect("http://www.xmlFile.com/file.xml").get();
// Convert the Jsoup object to a w3c Document object with Apache.
return DOMBuilder.jsoup2DOM(source);
I now have a w3c Document object. Is there any way of getting file.xml from that object? I've not found anything online that informs me eitherway.

There's a method Document called getDocumentUri() that might do what you want. It seems to depend on what way the Document was created whether the Uri is set to something or null. I don't know how JSoup creates the documents, so your mileage may vary.
http://docs.oracle.com/javase/6/docs/api/org/w3c/dom/Document.html#getDocumentURI()
If JSoup doesn't set it, you could call setDocumentUri() manually in your method, and then use getDocumentUri() where you needed it?

Related

How can I use XPath to create a new xml content?

I have the following problem. Into a Java application I have to create a new XML content using XPath (I always used it to parse XML files and obtain values inside its tag, can I use it also for build a new XML content?).
So my final result (that have to be saved on a database CLOB field, not on an .xml file, but I think that this is not important) have to be something like this:
<?xml version="1.0" encoding="ISO-8859-1"?>
<Messaggio>
<Intestazione>
<Da>06655971007</Da>
<A>01392380547</A>
<id>69934</id>
<idEnel/>
<DataInvio>2015-05-06</DataInvio>
<DataRicezione/>
<InRisposta/>
<TipoDoc>Ricevuta</TipoDoc>
</Intestazione>
<Documenti>
<Ricevuta>
<Testata>
<Documento>
<Tipo>380</Tipo>
<NumeroDocumento>ff</NumeroDocumento>
<Stato>KO</Stato>
<Data>2014-03-10</Data>
</Documento>
</Testata>
<Dettaglio>
<Messaggio>
<Codice>000</Codice>
<Descrizione>Documento NON Conforme / NON dovuto</Descrizione>
</Messaggio>
</Dettaglio>
</Ricevuta>
</Documenti>
</Messaggio>
So what I need to do is to programmatically add the nodes and the content of these nodes (the content is obtained from a model object).
Can I do it using XPath? How?
Tnx
XPath is an API to locate nodes in a XML document. It can't create new nodes or manipulate existing nodes. So what you need is to locate the nodes to modify using XPath and then use the API of the found nodes to make the changes.
But in your case, you're starting with an empty document. Have a look at frameworks like JDOM 2 to build XML documents from scratch. This tutorial should get you started: http://www.studytrails.com/java/xml/jdom2/java-xml-jdom2-example-usage.jsp
You can't. XPath is a matching technology, not a content creation technology. Possibly you are looking for XSLT?

Missing fields when rendering lotus notes document to RTF,DXL with Java API

I'm attempting to render a notes document to RTF, then DXL using the Java API. Once I have the DXL, I'm converting it to HTML with an XSL stylesheet. My goal is to produce an HTML document that displays as close as possible to the document rendering in the notes client.
However, computed fields are missing from the rendered RTF and DXL.
Here is the code used to generate the DXL:
private String renderDocumentToDxl(lotus.domino.Document lotusDocument)
throws Exception {
Database db = getDatabase();
lotus.domino.Document tmp = db.createDocument();
RichTextItem rti = tmp.createRichTextItem("Body");
lotusDocument.computeWithForm(true, false);
lotusDocument.save();
lotusDocument.renderToRTItem(rti);
DxlExporter dxlExporter = getSession().createDxlExporter();
dxlExporter.setOutputDOCTYPE(false);
dxlExporter.setConvertNotesBitmapsToGIF(true);
return dxlExporter.exportDxl(tmp);
}
Fields added to the document by the call to computeWithForm are not present in the generated DXL.
Is there any way to get the computed fields into the generated DXL with the Java API? Or is there a better way to generate an HTML representation of a notes document using the domino Java API?
I'm not quite clear on your objective. There are two possibilities:
1) You want the items from lotusDocument to exist in tmp, and to be exported as actual tag data in the DXL. Your code does not do this.
2) You want the values of the non-hidden Items from lotusDocument to exist as text within the rich text Body item in tmp, and you want those values to be included within the DXL that is exported from tmp - as text within the tag for the Body item. This should be what your code is doing.
If you expected the former, then that's not what renderToRTItem does. What it does is the latter. I.e., it gives you a snapshot of the values of the items in lotusDocument - but if and only if they would be displayed to a user who opens the document. You do not get the items themselves, and they won't appear separately in the DXL. If that's all you expected, and it's not happening, then there's something else going wrong and you haven't given enough infornmation here to figure it out.
If you wanted the former, i.e., the actual items from lotusDocument to exist as separate tag elements within the DXL exported from tmp, then you should be using
lotusDocument.copyAllItems(tmp,true);,
or sequences of
Item tmpItem = lotusDocument.getFirstItem(itemName);
tmp.copyItem(tmpItem,"");
You can get the HTML representation of a RichText field with the URL
http://server/db.nsf/view/docunid/RichTextFieldname?OpenField
So, save your tmp document, get the docunid and read the result via http from URL
http://server/db.nsf/0/tmpdocunid/Body?OpenField
You don't need to call lotusDocument.computeWithForm as lotusDocument.renderToRTItem does execute form's input translation and validation formulas already.
Be aware that for both methods form's LotusScript code won't be executed - just in case your fields gets calculated this way.
In case you can use XPages this would be an alternative: http://linqed.eu/2014/07/11/getting-html-from-any-richtext-item/

Microdata extraction from HTML in Java

I really need help to extract Mircodata which is embedded in HTML5. My purpose is to get structured data from a webpage just like this tool of google: http://www.google.com/webmasters/tools/richsnippets. I have searched a lot but there is no possible solution.
Currently, I use the any23 library but I can’t find any documentation, just only javadocs which dont provide enough information for me.
I use any23's Microdata Extractor but getting stuck at the third parameter: "org.w3c.dom.Document in". I can't parse a HTML content to be a w3cDom. I have used JTidy as well as JSoup but the DOM objects in these library are not fixed with the Extractor constructor. In addition, I also doubt about the 2nd parameter of the Microdata Extractor.
I hope that anyone can help me to do with any23 or suggest another library can solve this extraction issues.
Edit: I found solution myself by using the same way as any23 command line tool did. Here is the snippet of code:
HTTPDocumentSource doc = new HTTPDocumentSource(DefaultHTTPClient.createInitializedHTTPClient(), value);
InputStream documentInputInputStream = doc.openInputStream();
TagSoupParser tagSoupParser = new TagSoupParser(documentInputInputStream, doc.getDocumentURI());
Document document = tagSoupParser.getDOM();
ByteArrayOutputStream byteArrayOutput = new ByteArrayOutputStream();
MicrodataParser.getMicrodataAsJSON(tagSoupParser.getDOM(),new PrintStream(byteArrayOutput));
String result = byteArrayOutput.toString("UTF-8");
These line of code only extract microdata from HTML and write them in JSON format. I tried to use MicrodataExtractor which can change the output format to others(Rdf, turtle, ...) but the input document seems to only accept XML format. It throws "Document didn't start" when I put in a HTML document.
If anyone found the way to use MicrodataExtractor, please leave the answer here.
Thank you.
xpath is generally the way to consume html or xml.
have a look at: How to read XML using XPath in Java

Unusual output when using Jsoup in Java

I am getting this output when trying to use Jsoup to extract text from Wikipedia:
I dont have enough rep to post pictures as I am new to this site but its basically like this:
[]{k[]q[]f[]d[]d etc..
Here is part of my code:
public static void scrapeTopic(String url)
{
String html = getUrl("http://www.wikipedia.org/" + url);
Document doc = Jsoup.parse(html);
String contentText = doc.select("*").first().text();
System.out.println(contentText);
}
It appears to get all the information but in the wrong format!
I appreciate any help given
Thanks in advance
Here are some suggestion for you. While fetching general webpage, which doesn't require HTTP header's field to be set like cookie, user-agent just call:
Document doc = Jsoup.connect("givenURL").get();
This function read the webpage using a GET request. When you are selecting element using *, it returns any element, that is all the element of the document. Hence, calling doc.select("*").first() is returning the #root element. Try printing it to see:
System.out.println(doc.select("*").first().tagName()); // #root
System.out.println(doc.select("*").first()); // will print the whole document,
System.out.println(doc); //print the whole document, the above action is pointless
System.out.println(doc.select("*").first()==doc);
// check whither they are equal, and it will print TRUE
I am assuming that you are just playing around to learn about this API, although selector is much powerful, but a good start should be trying general document manipulation function e.g., doc.getElementsByTag().
However, in my local machine, i was successful to fetch the Document and parsing it using your getURL() function !!

Selenium 2: Detect content type of link destinations

I am using the Selenium 2 Java API to interact with web pages. My question is: How can i detect the content type of link destinations?
Basically, this is the background: Before clicking a link, i want to be sure that the response is an HTML file. If not, i need to handle it in another way. So, let's say there is a download link for a PDF file. The application should directly read the contents of that URL instead of opening it in the browser.
The goal is to have an application which automatically knows wheather the current location is an HTML, PDF, XML or whatever to use appropriate parsers to extract useful information out of the documents.
Update
Added bounty: Will reward it to the best solution which allows me to get the content type of a given URL.
As Jochen suggests, the way to get the Content-type without also downloading the content is HTTP HEAD, and the selenium webdrivers does not seem to offer functionality like that. You'll have to find another library to help you with fetching the content type of an url.
A Java library that can do this is Apache HttpComponents, especially HttpClient.
(The following code is untested)
HttpClient httpclient = new DefaultHttpClient();
HttpHead httphead = new HttpHead("http://foo/bar");
HttpResponse response = httpclient.execute(httphead);
BasicHeader contenttypeheader = response.getFirstHeader("Content-Type");
System.out.println(contenttypeheader);
The project publishes JavaDoc for HttpClient, the documentation for the HttpClient interface contains a nice example.
You can figure out the content type will processing the data coming in.
Not sure why you need to figure this out first.
If so, use the HEAD method and look at the Content-Type header.
You can retrieve all the URLs from the DOM, and then parse the last few characters of each URL (using a java regex) to determine the link type.
You can parse characters proceeding the last dot. For example, in the url http://yoursite.com/whatever/test.pdf, extract the pdf, and enforce your test logic accordingly.
Am I oversimplifying your problem?

Categories

Resources