I am new to parsers. I like to fetch specific data from a website. I need to use parsers for that. How to get started with parsers? What do I need to download?
What would the code be to fetch the data from a website using parsers in Java?
My advice would be to use an open source HTML parser such as HTMLCleaner - http://htmlcleaner.sourceforge.net/
You can use HTMLCleaner (or similar) to create a representation of the web page DOM, and then use this to extract whatever information you want from the web pages.
The process looks something like this:
URL url = new URL("website you want to load");
HTMLCleaner h = new HTMLCleaner();
TagNode HtmlNode = h.clean(url.openStream());
//perform queries on the DOM to extract information
Related
I have a student database (Oracle 11G), I need to create a module(separate) which will generate a student's details in a well-formatted word document. When I give the student ID, I need all the info(Kind of a biodata) of the student in a docx file which is very presentable. I'm not sure how to start, I was exploring Python-docx and java DOCX4j. I need suggestion how can I achieve this. Is there any tool I can do this
Your help is highly appreciated
You could extract the data from Oracle into an XML format, then use content control data binding in your Word document to bind elements in the XML.
All you need to do is inject the XML into the docx as a custom xml part, and Word will display the results automatically.
docx4j can help you to the inject the XML. If you don't want to rely on Word to display the results, then you can use docx4j to also apply the bindings.
Or you could try simple variable replacement: https://github.com/plutext/docx4j/blob/master/src/samples/docx4j/org/docx4j/samples/VariableReplace.java
If you want a simple way to format your Word document directly from Java, you can try pxDoc.
The screenshot below provide an example of code and document generated from an Authors/Books model: whatever the way you request the data from your database, it is easy to render them in a well formatted document.
simple document generation example
Regarding your use case, you could also generate a document for all students at once. In the context of the screenshot example:
for (author:library.authors) {
var filename = 'c:/MyDocuments/'+author.name+'.docx'
document fileName:filename {
/** Content of my document */
}
I have the following problem. Into a Java application I have to create a new XML content using XPath (I always used it to parse XML files and obtain values inside its tag, can I use it also for build a new XML content?).
So my final result (that have to be saved on a database CLOB field, not on an .xml file, but I think that this is not important) have to be something like this:
<?xml version="1.0" encoding="ISO-8859-1"?>
<Messaggio>
<Intestazione>
<Da>06655971007</Da>
<A>01392380547</A>
<id>69934</id>
<idEnel/>
<DataInvio>2015-05-06</DataInvio>
<DataRicezione/>
<InRisposta/>
<TipoDoc>Ricevuta</TipoDoc>
</Intestazione>
<Documenti>
<Ricevuta>
<Testata>
<Documento>
<Tipo>380</Tipo>
<NumeroDocumento>ff</NumeroDocumento>
<Stato>KO</Stato>
<Data>2014-03-10</Data>
</Documento>
</Testata>
<Dettaglio>
<Messaggio>
<Codice>000</Codice>
<Descrizione>Documento NON Conforme / NON dovuto</Descrizione>
</Messaggio>
</Dettaglio>
</Ricevuta>
</Documenti>
</Messaggio>
So what I need to do is to programmatically add the nodes and the content of these nodes (the content is obtained from a model object).
Can I do it using XPath? How?
Tnx
XPath is an API to locate nodes in a XML document. It can't create new nodes or manipulate existing nodes. So what you need is to locate the nodes to modify using XPath and then use the API of the found nodes to make the changes.
But in your case, you're starting with an empty document. Have a look at frameworks like JDOM 2 to build XML documents from scratch. This tutorial should get you started: http://www.studytrails.com/java/xml/jdom2/java-xml-jdom2-example-usage.jsp
You can't. XPath is a matching technology, not a content creation technology. Possibly you are looking for XSLT?
I really need help to extract Mircodata which is embedded in HTML5. My purpose is to get structured data from a webpage just like this tool of google: http://www.google.com/webmasters/tools/richsnippets. I have searched a lot but there is no possible solution.
Currently, I use the any23 library but I can’t find any documentation, just only javadocs which dont provide enough information for me.
I use any23's Microdata Extractor but getting stuck at the third parameter: "org.w3c.dom.Document in". I can't parse a HTML content to be a w3cDom. I have used JTidy as well as JSoup but the DOM objects in these library are not fixed with the Extractor constructor. In addition, I also doubt about the 2nd parameter of the Microdata Extractor.
I hope that anyone can help me to do with any23 or suggest another library can solve this extraction issues.
Edit: I found solution myself by using the same way as any23 command line tool did. Here is the snippet of code:
HTTPDocumentSource doc = new HTTPDocumentSource(DefaultHTTPClient.createInitializedHTTPClient(), value);
InputStream documentInputInputStream = doc.openInputStream();
TagSoupParser tagSoupParser = new TagSoupParser(documentInputInputStream, doc.getDocumentURI());
Document document = tagSoupParser.getDOM();
ByteArrayOutputStream byteArrayOutput = new ByteArrayOutputStream();
MicrodataParser.getMicrodataAsJSON(tagSoupParser.getDOM(),new PrintStream(byteArrayOutput));
String result = byteArrayOutput.toString("UTF-8");
These line of code only extract microdata from HTML and write them in JSON format. I tried to use MicrodataExtractor which can change the output format to others(Rdf, turtle, ...) but the input document seems to only accept XML format. It throws "Document didn't start" when I put in a HTML document.
If anyone found the way to use MicrodataExtractor, please leave the answer here.
Thank you.
xpath is generally the way to consume html or xml.
have a look at: How to read XML using XPath in Java
New to the development scene, please ignore my ignorance if I happen to not make any sense......
I'm trying to access a xml file located in my EJB directory which has to stay there, I need to parse it into a javascript accessible object preferably JSON, to dynamically manipulate it using Javascript / Angular....
using JBOSS, and the file's location is something like
/FOO-ejb/src/main/resources/Config.xml, obviously not accessible through the web since it does not reside under a webserver root directory,
Java is the back-end and I can't seem to find any other ways to access this file to serve it to the front-end,
I'm heading towards the direction of using a service within the EJB to access the file, parse it, then use a REST service to serve the object to the front-end....or write a JSP to read in the file, parse it etc....
are there any other better solutions for this?
Thank you everyone for your time!
I think what you want to do is not achievable since it would mean you'd use Javascript to access the file system which is not possible though HTML5 offers some File API that could work but not to access any file in the file system.
So I'd say that the direction you're heading is the most appropriate and maybe easier because even if you find a way to do it in JavaScript it would be a browser-dependant or some weird workaround that could be broken in future browser's version.
I used Apache Abdera in a Servlet in the past to parse an XML RSS feed and convert it to JSON. Abdera is good at that and worked perfect for me. After getting the JSON object I just had to send it to the response and on the client side I used an AJAX call to the servlet to get the JSON object.
The code was something like this:
try {
PrintWriter result = response.getWriter();
// Creates Abdera object and client to process the request.
Abdera abderaObj = new Abdera();
AbderaClient client = new AbderaClient(abderaObj);
AbderaClient.registerTrustManager(); // For SSL connections.
// Sent the HTTP request of the ATOM Feed through AbderaClient.
ClientResponse resp = client.get( "http://url/to/your/feed" );
// if the response was OK...
if (resp.getType() == ResponseType.SUCCESS) {
// We get the document as a Feed
Document<Feed> doc = resp.getDocument();
// Creates a JSON writer to convert the ATOM Feed
Writer json = abderaObj.getWriterFactory().getWriter("json");
// Converts the (XML) ATOM Feed into JSON object
doc.writeTo(json, result);
}
} catch (Exception ex) {
ex.printStackTrace(System.out);
}
I have got a String request in the XML format , which i need to parse it to obtain the Request data from it .
The XML String would be conssiting of a lot of subtags within it , and data is appended in it in the form of CDATA as well as there are >< , ini t.
I want to use STAX approach for this .
Please sugesst if theer are any cons wtth this ??
i need this inside a Java Webservice
You can parse your XML file with native PHP function.
EX :
$path = '/my/file/xml.xml';
$file = file_get_contents($path);
$xml = new SimpleXMLElement($file);
And get nodes with :
$value1 = $xml->node[0]->value;
$value2 = $xml->node[1]->value;
There would not be any issues in using DOM parser based STAX API.