Microdata extraction from HTML in Java

Microdata extraction from HTML in Java - java

I really need help to extract Mircodata which is embedded in HTML5. My purpose is to get structured data from a webpage just like this tool of google: http://www.google.com/webmasters/tools/richsnippets. I have searched a lot but there is no possible solution.
Currently, I use the any23 library but I can’t find any documentation, just only javadocs which dont provide enough information for me.
I use any23's Microdata Extractor but getting stuck at the third parameter: "org.w3c.dom.Document in". I can't parse a HTML content to be a w3cDom. I have used JTidy as well as JSoup but the DOM objects in these library are not fixed with the Extractor constructor. In addition, I also doubt about the 2nd parameter of the Microdata Extractor.
I hope that anyone can help me to do with any23 or suggest another library can solve this extraction issues.
Edit: I found solution myself by using the same way as any23 command line tool did. Here is the snippet of code:
HTTPDocumentSource doc = new HTTPDocumentSource(DefaultHTTPClient.createInitializedHTTPClient(), value);
InputStream documentInputInputStream = doc.openInputStream();
TagSoupParser tagSoupParser = new TagSoupParser(documentInputInputStream, doc.getDocumentURI());
Document document = tagSoupParser.getDOM();
ByteArrayOutputStream byteArrayOutput = new ByteArrayOutputStream();
MicrodataParser.getMicrodataAsJSON(tagSoupParser.getDOM(),new PrintStream(byteArrayOutput));
String result = byteArrayOutput.toString("UTF-8");
These line of code only extract microdata from HTML and write them in JSON format. I tried to use MicrodataExtractor which can change the output format to others(Rdf, turtle, ...) but the input document seems to only accept XML format. It throws "Document didn't start" when I put in a HTML document.
If anyone found the way to use MicrodataExtractor, please leave the answer here.
Thank you.

xpath is generally the way to consume html or xml.
have a look at: How to read XML using XPath in Java

Related

Create a DOCX reading data from Oracle database

I have a student database (Oracle 11G), I need to create a module(separate) which will generate a student's details in a well-formatted word document. When I give the student ID, I need all the info(Kind of a biodata) of the student in a docx file which is very presentable. I'm not sure how to start, I was exploring Python-docx and java DOCX4j. I need suggestion how can I achieve this. Is there any tool I can do this
Your help is highly appreciated

You could extract the data from Oracle into an XML format, then use content control data binding in your Word document to bind elements in the XML.
All you need to do is inject the XML into the docx as a custom xml part, and Word will display the results automatically.
docx4j can help you to the inject the XML. If you don't want to rely on Word to display the results, then you can use docx4j to also apply the bindings.
Or you could try simple variable replacement: https://github.com/plutext/docx4j/blob/master/src/samples/docx4j/org/docx4j/samples/VariableReplace.java

If you want a simple way to format your Word document directly from Java, you can try pxDoc.
The screenshot below provide an example of code and document generated from an Authors/Books model: whatever the way you request the data from your database, it is easy to render them in a well formatted document.
simple document generation example
Regarding your use case, you could also generate a document for all students at once. In the context of the screenshot example:
for (author:library.authors) {
var filename = 'c:/MyDocuments/'+author.name+'.docx'
document fileName:filename {
/** Content of my document */
}

Quotation-Marks causing IllegalNameException when parsing HTML with JDom2

Good Evening everyone!
I'm trying to parse a HTML-page in Java with JDOM2, to access some information from it.
My code looks like this: (Just added the packages for this codeblock, don't have them in my real source)
//Here goes the reading of the site into my String "string" (using NekoHTML)
org.xml.sax.InputSource is = new InputSource();
is.setCharacterStream(new StringReader(string));
org.cyberneko.html.parsers.DOMParser parser = new DOMParser();
parser.parse(is);
org.jdom2.input.DOMBuilder builder = new DOMBuilder();
org.jdom2.Document doc = builder.build(parser.getDocument());
This works fine for everything except some special case: When the site contains quotation-Marks within an element. Here is an example of what I mean:
Der "realismo mágico" und die Phantastische...
So, after that wonderful Tag I get the following error-trace:
SEVERE: org.jdom2.IllegalNameException: The name "literatur"" is not legal for JDOM/XML attributes: XML name 'literatur"' cannot contain the character """.
So, now my question is: What are my options to take care of this error? Is there maybe a feature in NekoHTML I can use for this (using the "setFeature()"), or something within JDOM I could use?
If no: Are there other libaries that are suitable for scraping websites that can take such a thing as the quotation mark within the tag?
Thanks for your time!

Okay, so I solved the problem like following:
Since there wasn't any dependency on NekoHTML I switched to jTidy as parser which does the job in this case.
Question answered.

Validate HTML code programmatically

I am trying to validate a String of HTML code. That is, when HTML code syntax is wrong I want to know, perhaps in the form of a return false.
I am currently using JTidy but it doesn't tell me there was bad syntax it just corrects it. I don't need to correct it just say if the synthax is bad or good.
JTidy code:
String s = "<td>cookie<td>"; // bad syntax.
Tidy tidy = new Tidy();
InputStream stream = new ByteArrayInputStream(s.getBytes(StandardCharsets.UTF_8));
tidy.parse(stream, System.out);
Any help is appriciated.

java has inbuilt DOM Parser in it.
Use DOM Parser to check. It will also show errors.

What is the easiest way to extract plain text from an xml document?

I have some ebooks in xml format. The books' pages are marked using processing instructions(e.g. <?pg 01?>). I need to extract the content of the book in plain text, one page at a time and save each page as a text file. What's the best way of doing this?

The easiest way, assuming you need to integrate this into a Java program (as the tag implies), is probably to use a SAX parser such as XMLReader provides. You write a ContentHandler callback for text and processing instructions.
When your p-i handler is called, you open a new output file.
When your text handler is called, you copy the character data to the currently open output file.
This tutorial has some helpful example code.
However if you don't need to integrate this into a Java program, I might use XSLT 2.0 (Saxon is free). XSLT 1.0 will not allow multiple output documents, but XSLT 2.0 will, and it will also make grouping by "milestone markup" (your "pg" processing instructions) easier. If you're interested in this approach, just ask... and give more info about the structure of the input document.
P.S. Even if you do need to integrate this into a Java program, you can call XSLT from Java - Saxon for example is written in Java. However I think if you're just processing PI's and text, it would be less effort to use a SAX parser.

I would probably use castor to do this. It's a java tool that allows you to specify bindings to java objects, which you can then output as text to file

You need an ebook renderer for the format your books are in (and I highly doubt that it's XML if they use backslashes as processing instructions). Also, XPath works wonders if all you want to do is get the actual text, simply use //text() for all the text.

You could try converting it to YAML and editing it in a word processor--then a simple macro should fix it right up.
I just browsed for this XML to YAML conversion utility--it's small but I didn't test it or anything.
http://svn.pyyaml.org/pyyaml-legacy/trunk/experimental/XmlYaml/convertyaml_map.py

Use an XSL stylesheet with <xsl:output method="text"/>.
You can even debug stylesheets in eclipse nowadays.

You can do this with Apache Tika like:
byte[] value = //your xml content as a byte array
Parser parser = new XMLParser()
org.xml.sax.ContentHandler textHandler = new BodyContentHandler(-1)
Metadata metadata = new Metadata()
ParseContext context = new ParseContext()
parser.parse(new ByteArrayInputStream(value), textHandler, metadata, context)
return textHandler.toString()
if using maven, you'd probably want both of the below:
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.13</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.13</version>
</dependency>

Getting started with a parser in Java code

I am new to parsers. I like to fetch specific data from a website. I need to use parsers for that. How to get started with parsers? What do I need to download?
What would the code be to fetch the data from a website using parsers in Java?

My advice would be to use an open source HTML parser such as HTMLCleaner - http://htmlcleaner.sourceforge.net/
You can use HTMLCleaner (or similar) to create a representation of the web page DOM, and then use this to extract whatever information you want from the web pages.
The process looks something like this:
URL url = new URL("website you want to load");
HTMLCleaner h = new HTMLCleaner();
TagNode HtmlNode = h.clean(url.openStream());
//perform queries on the DOM to extract information

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Microdata extraction from HTML in Java - java

xpath is generally the way to consume html or xml. have a look at: How to read XML using XPath in Java

Related

Create a DOCX reading data from Oracle database

Quotation-Marks causing IllegalNameException when parsing HTML with JDom2

Validate HTML code programmatically

What is the easiest way to extract plain text from an xml document?

Getting started with a parser in Java code

Categories

Resources