How to parse random HTML pages with Apache Tika (and XPath) [duplicate]

How to parse random HTML pages with Apache Tika (and XPath) [duplicate] - java

This question already has an answer here:
java tika how to convert html to plain text retaining specific element
(1 answer)
Closed 1 year ago.
I'm new to Tika and stuggling to understand it.
What I want to achieve is extracting the link's href of a HTML page (which can be any webpage).
For trial version, I just tried to extract the links as such (or even just the first) using XPath. But I can never get it right and the handler is always empty.
(In this example, I've removed the XHTML: namspace bits because otherwise I had a SAX error).
The code example is below. Thanks so much for any help :)
XPathParser xhtmlParser = new XPathParser ("xhtml", XHTMLContentHandler.XHTML);
org.apache.tika.sax.xpath.Matcher anchorLinkContentMatcher = xhtmlParser.parse("//body//a");
ContentHandler handler = new MatchingContentHandler(
new ToXMLContentHandler(), anchorLinkContentMatcher);
HtmlParser parser = new HtmlParser();
Metadata metadata = new Metadata();
ParseContext pcontext = new ParseContext();
try {
parser.parse(urlContentStream, handler, metadata,pcontext);
System.out.println(handler);
}
catch (Exception e)
{
....
}

I found an answer (at least to get something working, even if not yet final version, I got something from the handler).
The answer is at java tika how to convert html to plain text retaining specific element

Related

Microdata extraction from HTML in Java

I really need help to extract Mircodata which is embedded in HTML5. My purpose is to get structured data from a webpage just like this tool of google: http://www.google.com/webmasters/tools/richsnippets. I have searched a lot but there is no possible solution.
Currently, I use the any23 library but I can’t find any documentation, just only javadocs which dont provide enough information for me.
I use any23's Microdata Extractor but getting stuck at the third parameter: "org.w3c.dom.Document in". I can't parse a HTML content to be a w3cDom. I have used JTidy as well as JSoup but the DOM objects in these library are not fixed with the Extractor constructor. In addition, I also doubt about the 2nd parameter of the Microdata Extractor.
I hope that anyone can help me to do with any23 or suggest another library can solve this extraction issues.
Edit: I found solution myself by using the same way as any23 command line tool did. Here is the snippet of code:
HTTPDocumentSource doc = new HTTPDocumentSource(DefaultHTTPClient.createInitializedHTTPClient(), value);
InputStream documentInputInputStream = doc.openInputStream();
TagSoupParser tagSoupParser = new TagSoupParser(documentInputInputStream, doc.getDocumentURI());
Document document = tagSoupParser.getDOM();
ByteArrayOutputStream byteArrayOutput = new ByteArrayOutputStream();
MicrodataParser.getMicrodataAsJSON(tagSoupParser.getDOM(),new PrintStream(byteArrayOutput));
String result = byteArrayOutput.toString("UTF-8");
These line of code only extract microdata from HTML and write them in JSON format. I tried to use MicrodataExtractor which can change the output format to others(Rdf, turtle, ...) but the input document seems to only accept XML format. It throws "Document didn't start" when I put in a HTML document.
If anyone found the way to use MicrodataExtractor, please leave the answer here.
Thank you.

xpath is generally the way to consume html or xml.
have a look at: How to read XML using XPath in Java

How to save a Jsoup Document to an HTML file?

I have used this method to retrieve a webpage into an org.jsoup.nodes.Document object:
myDoc = Jsoup.connect(myURL).ignoreContentType(true).get();
How should I write this object to a HTML file?
The methods myDoc.html(), myDoc.text() and myDoc.toString() don't output all elements of the document.
Some information in a javascript element can be lost in parsing it. For example, "timestamp" in the source of an Instagram media page.

Use doc.outerHtml().
import org.apache.commons.io.FileUtils;
public void downloadPage() throws Exception {
final Response response = Jsoup.connect("http://www.example.net").execute();
final Document doc = response.parse();
final File f = new File("filename.html");
FileUtils.writeStringToFile(f, doc.outerHtml(), StandardCharsets.UTF_8);
}
Don't forget to catch Exceptions. Add dependency or download Apache commons-io library for easy and quick way to saving files in UTF-8 format.

The fact that there are elements that are ignored, must be due to the attempt of normalization by Jsoup.
In order to get the server's exact output without any form of normalization use this.
Connection.Response html = Jsoup.connect("PUT_URL_HERE").execute();
System.out.println(html.body());

Getting paragraph count from Tika for both Word and PDF

I have a scenario where I need to reconcile two documents, an Word (.docx) doc as well as a PDF. The two are supposed to be "indentical" to each other (the PDF is just a PDF version of the DOCX file); meaning they should contain the same text, content, etc.
Specifically, I need to make sure that both documents contain the same number of paragraphs. So I need to read the DOCX, get the paragraph count, then read the PDF and grab its paragraph count. If both numbers are the same, then I'm in business.
It looks like Apache Tika (I'm interested in 1.3) is the right tool for the job here. I see in this source file that Tika supports the notion of paragraph counting, but trying to figure out how to get the count from both documents. Here's my best attempt but I'm choking on connecting some of the final dots:
InputStream docxStream = new FileInputStream("some-doc.docx");
InputStream pdfStream = new FileInputStream("some-doc.pdf");
ContentHandler handler = new DefaultContentHandler();
Metadata docxMeta = new Metadata();
Metadata pdfMeta = new Metadata();
Parser parser = new OfficeParser();
ParseContext pc = new ParseContext();
parser.parse(docxStream, handler, docxMeta, pc);
parser.parse(pdfStream, handler, pdfMeta, pc);
docxStream.close();
pdfStream.close();
int docxParagraphCount = docxMeta.getXXX(???);
int pdfParagraphCount = pdfMeta.getXXX(???);
if(docxParagraphCount == pdfParagraphCount)
setInBusiness(myself, true);
So I ask: have I set this up correctly or am I way off base? If off-base, please lend me some help to get me back on track. And if I have set things up correctly, then how do I get the desired counts out of the two Metadata instances? Thanks in advance.

First up, Tika will only give you back the metadata contained within your documents. It won't compute anything for you. So, if one of your documents lacks the paragraph count metadata, you're out of luck. If one of your documents has duff data (i.e. the program that wrote the file out got it wrong), you'll be out of luck.
Otherwise, your code is nearly there, but not quite. You most likely want to use DefaultParser or AutoDetectParser - OfficeParser is for the Microsoft file formats only, while the others automatically load all the available parsers and pick the correct one.
The property you want is PARAGRAPH_COUNT, which comes from the Office metadata namespace. Your code would be something like:
TikaInputStream docxStream = TikaInputStream.get(new File("some-doc.docx"));
TikaInputStream pdfStream = TikaInputStream.get(new File("some-doc.pdf"));
ContentHandler handler = new DefaultContentHandler();
Metadata docxMeta = new Metadata();
Metadata pdfMeta = new Metadata();
ParseContext pc = new ParseContext();
Parser parser = TikaConfig.getDefaultConfig().getParser();
parser.parse(docxStream, handler, docxMeta, pc);
parser.parse(pdfStream, handler, pdfMeta, pc);
int docxParagraphCount = docxMeta.getInt(Office.PARAGRAPH_COUNT);
int pdfParagraphCount = pdfMeta.getInt(Office.PARAGRAPH_COUNT);
If you don't care about the text at all, only the metadata, pass in a dummy content handler

premature end of file while parsing an xml file on android

I'm trying to read an xml file on from an android app using XOM as the XML library. I'm trying this:
Builder parser = new Builder();
Document doc = parser.build(context.openFileInput(XML_FILE_LOCATION));
But I'm getting nu.xom.ParsingException: Premature end of file. even when the file is empty.
I need to parse a very simple XML file, and I'm ready to use another library instead of XOM so let me know if there's a better one. or just a solution to the problem using XOM.
In case it helps, I'm using xerces to get the parser.
------Edit-----
PS: The purpose of this wasn't to parse an empty file, the file just happened to be empty on the first run which showed this error.

If you follow this post to the end, it seems that this has to do with xerces and the fact that its an empty file, and they didn't reach a solution on xerces side.
So I handled the issue as follows:
Document doc = null;
try {
Builder parser = new Builder();
doc = parser.build(context.openFileInput(XML_FILE_LOCATION));
}catch (ParsingException ex) { //other catch blocks are required for other exceptions.
//fails to open the file with a parsing error.
//I create a new root element and a new document.
//I fill them with xml data (else where in the code) and save them.
Element root = new Element("root");
doc = new Document(root);
}
And then I can do whatever I want with doc. and you can add extra checks to make sure that the cause is really an empty file (like check the file size as indicated by one of sam's comments on the question).

An empty file is not a well-formed XML document. Throwing a ParsingException is the right thing to do here.

Extracting text from documents of unknown content type

is there a parser for application/octet-stream type within Apache Tika? I suppose it's a non-parsable stream.
I just need to parse ODS documents, MS documents and PDF files. It seems that new Tika( ).parseToString(file); is enough. But I can't figure out what happens when the content type is not detected - > application/octet-stream is default. If I have a chance to extract text from those documents that are one of those types, but contentType detector didn't detect their type.
What else should I try instead of returning document to the user telling him that it is not supported format.
Or is really a resulting application/octet-stream content type a signal that we can't read this ? Or "you must figure out your own way how to deal with this" ?

If the detector doesn't know what the file is, it'll return application/octet-stream
And if the detector doesn't know what it is, then Tika won't be able to pick a suitable Parser for it. (You'll end up with the EmptyParser which does nothing)
If you can, pass in the name of your file when you do the detection and parsing, as that'll help with the detection in some cases:
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
ParseContext context = new ParseContext();
Parser parser = new AutoDetectParser();
parser.parse(input, textHandler, metadata, new ParseContext());
Also, it's worth checking the supported formats part of the Tika website to ensure that the documents you have are ones where there's a Parser - http://tika.apache.org/0.9/formats.html
If your documents are in a format that isn't currently supported, then you have two choices (neither immediate fixes). One is to help write a new parser (requires finding a suitable Java library for the format). The other is to use a command line based parser (requires finding an executable for your platform that can do the xhtml generation, then wiring that in)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to parse random HTML pages with Apache Tika (and XPath) [duplicate] - java

I found an answer (at least to get something working, even if not yet final version, I got something from the handler). The answer is at java tika how to convert html to plain text retaining specific element

Related

Microdata extraction from HTML in Java

How to save a Jsoup Document to an HTML file?

Getting paragraph count from Tika for both Word and PDF

premature end of file while parsing an xml file on android

Extracting text from documents of unknown content type

Categories

Resources