Parsing HTML issues with Apache Tika - java

I am crawling a webpage and after crawling it extract all the links from that webpage and then I am trying to parse all the url using Apache Tika and BoilerPipe by using below code so for some url it is parsing very well but for some I get error like this. And it shows some error on HTMLParser.java: line number 102. This is line number 102 in HTMLParser.java
String parsedText = tika.parseToString(htmlStream, md);
I have provided the HTMLParse code also.
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser#67c28a6a
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:203)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
at org.apache.tika.Tika.parseToString(Tika.java:357)
at edu.uci.ics.crawler4j.crawler.HTMLParser.parse(HTMLParser.java:102)
at edu.uci.ics.crawler4j.crawler.WebCrawler.handleHtml(WebCrawler.java:227)
at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:299)
at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:118)
at java.lang.Thread.run(Unknown Source)
Caused by: java.util.zip.ZipException: invalid block type
at java.util.zip.InflaterInputStream.read(Unknown Source)
at java.util.zip.ZipInputStream.read(Unknown Source)
at java.io.FilterInputStream.read(Unknown Source)
at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource$FakeZipEntry.<init>(ZipInputStreamZipEntrySource.java:114)
at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:55)
at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:82)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:220)
at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:152)
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:65)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:67)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
... 8 more
This is my HTMLParser.java file-
public void parse(String htmlContent, String contextURL) {
InputStream htmlStream = null;
text = null;
title = null;
metaData = new HashMap<String, String>();
urls = new HashSet<String>();
char[] chars = htmlContent.toCharArray();
bulletParser.setCallback(textExtractor);
bulletParser.parse(chars);
try {
text = articleExtractor.getText(htmlContent);
} catch (BoilerpipeProcessingException e) {
e.printStackTrace();
}
if (text == null){
text = textExtractor.text.toString().trim();
}
title = textExtractor.title.toString().trim();
try {
Metadata md = new Metadata();
String utfHtmlContent = new String(htmlContent.getBytes(),"UTF-8");
htmlStream = new ByteArrayInputStream(utfHtmlContent.getBytes());
//The below line is at the line number 102 according to error above
String parsedText = tika.parseToString(htmlStream, md);
//very unlikely to happen
if (text == null){
text = parsedText.trim();
}
processMetaData(md);
} catch (Exception e) {
e.printStackTrace();
} finally {
IOUtils.closeQuietly(htmlStream);
}
bulletParser.setCallback(linkExtractor);
bulletParser.parse(chars);
Iterator<String> it = linkExtractor.urls.iterator();
String baseURL = linkExtractor.base();
if (baseURL != null) {
contextURL = baseURL;
}
int urlCount = 0;
while
(it.hasNext()) {
String href = it.next();
href = href.trim();
if (href.length() == 0) {
continue;
}
String hrefWithoutProtocol = href.toLowerCase();
if (href.startsWith("http://")) {
hrefWithoutProtocol = href.substring(7);
}
if (hrefWithoutProtocol.indexOf("javascript:") < 0
&& hrefWithoutProtocol.indexOf("#") < 0) {
URL url = URLCanonicalizer.getCanonicalURL(href, contextURL);
if (url != null) {
urls.add(url.toExternalForm());
urlCount++;
if (urlCount > MAX_OUT_LINKS) {
break;
}
}
}
}
}
Any suggestions will be appreciated.

Sounds like a malformed OOXML document (.docx, .xlsx, etc.). To check whether the problem still occurs with the latest Tika version, you can download the tika-app jar and run it like this:
java -jar tika-app-1.0.jar --text http://url.of.the/troublesome/document.docx
This should print out the text contained in the document. If it doesn't work, please file a bug report with the URL of the troublesome document (or attach the document if it's not publicly available).

I had same issue, I found that documents(docx) files which I was trying to parse was not actually simple document, It was form developed in microsoft word with text and input fields beside label text.
I removed such files from folder and post rest of all files to Solr engine for parsing and indexing, It worked.

Related

Unsupported data type when getting mail JPG images

I'm trying to get the inline images of a mail, for which I have the following code:
protected void setCidAttachments(Message message, MensajeEmail mensajeEmail) {
try {
MimeMultipart mimeMultipart = (MimeMultipart) message.getDataHandler().getContent();
for (int k = 0; k < mimeMultipart.getCount(); k++) {
MimeBodyPart part = (MimeBodyPart) mimeMultipart.getBodyPart(k);
processPart(part, mensajeEmail);
}
}
catch (Exception e) {
log.error("Error obtendo adxuntos con cid", e);
}
}
private void processPart (BodyPart part, MensajeEmail mensajeEmail) throws MessagingException, IOException {
String type = getContentType(part);
StringBuilder content = new StringBuilder(mensajeEmail.getContenido());
if (isImage(type) && part.getDataHandler() != null && part.getDataHandler().getContent() != null) {
if (part.getDataHandler().getContent() instanceof MimeMultipart) {
MimeMultipart p = (MimeMultipart) part.getDataHandler().getContent();
for (int i = 0; i < p.getCount(); i++) {
BodyPart subpart = p.getBodyPart(i != p.getCount() - 1 ? i + 1 : i);
processPart(subpart, mensajeEmail);
}
} else {
mensajeEmail.setContenido(getInlineImage(part, content));
}
}
}
private String getInlineImage (BodyPart part, StringBuilder content) throws MessagingException, IOException {
Base64 decoder64 = new Base64();
ByteArrayOutputStream bos = new ByteArrayOutputStream();
// Get type
String type = getContentType(part);
// Get Content-ID
String contentId = getContentId(part);
// Replace
if (contentId.length() > 0) {
part.getDataHandler().writeTo(bos);
int start = content.indexOf("src=\"cid:" + contentId + "\"") + 5;
if (start > 4) {
int length = contentId.length() + 4;
content.replace(start, start + length, "data:" + (isImage(type) ? type : "image/png;") + " base64," + decoder64.encodeToString(bos.toByteArray()));
}
}
bos.close();
return content.toString();
}
private String getContentId (BodyPart part) throws MessagingException {
Enumeration headers = part.getAllHeaders();
while (headers.hasMoreElements()) {
Header header = (Header)headers.nextElement();
if (header.getName().equalsIgnoreCase("Content-ID"))
return cleanContentId(header.getValue());
}
return "";
}
private String getContentType (BodyPart part) throws MessagingException {
return part.getContentType().split(" ")[0];
}
private boolean isImage (String mime) {
return !mime.equals("text/html;") && !mime.equals("text/plain;");
}
private String cleanContentId (String contentId) {
if (contentId.charAt(0) == '<') contentId = contentId.substring(1, contentId.length() - 1);
return contentId;
}
This works perfectly fine when I send PNG images (which makes me think my code is indeed correct). However, when I try to send a JPG image, I get the following exception:
javax.activation.UnsupportedDataTypeException: Unknown image type image/jpeg; name=sony-car-796x418.jpg
at org.apache.geronimo.activation.handlers.AbstractImageHandler.getContent(AbstractImageHandler.java:57)
at javax.activation.DataSourceDataContentHandler.getContent(DataHandler.java:795)
at javax.activation.DataHandler.getContent(DataHandler.java:542)
at es.enxenio.fcpw.plinper.daemons.email.AbstractProtocoloObtencionEmail.processPart(AbstractProtocoloObtencionEmail.java:378)
...
Is the framework really not able to work with JPG images? Is there some way I can fix this?
EDIT: Gmail doesn't even let me send JPG images so it's probably not a very common format for mail images, which makes me think might not be widely implemented and that could be the reason why Java doesn't seem to be able to work with it
I found the problem. This line
if (isImage(type) && part.getDataHandler() != null && part.getDataHandler().getContent()
shouldn't check whether the type is an image but whether it is a multipart. Otherwise we could be processing a jpg image as a multipart. For some reason png images don't mind this and that's why it was working. Here are the changed parts of the code:
protected void setCidAttachments(Message message, MensajeEmail mensajeEmail) {
try {
processPart(message, mensajeEmail);
}
catch (Exception e) {
log.error("Error obtendo adxuntos con cid", e);
}
}
private void processPart(Part part, MensajeEmail mensajeEmail) throws MessagingException, IOException {
String type = getContentType(part);
StringBuilder content = new StringBuilder(mensajeEmail.getContenido());
if (isMultipart(type) && part.getDataHandler() != null && part.getDataHandler().getContent() != null && part.getDataHandler().getContent() instanceof MimeMultipart) {
MimeMultipart multipart = (MimeMultipart) part.getDataHandler().getContent();
for (int i = 0; i < multipart.getCount(); i++) {
BodyPart subpart = multipart.getBodyPart(i);
processPart(subpart, mensajeEmail);
}
}
else {
mensajeEmail.setContenido(getInlineImage(part, content));
}
}
private boolean isMultipart(String mime) {
return (Pattern.matches("multipart/.*", mime));
}
I got a similar exception running an app on eclipse osgi with java 11 and with bundles javax.mail.glassfish 1.4.1 and javax.activation 1.1.0 (got these 2 from https://eclipse.org/orbit):
javax.activation.UnsupportedDataTypeException: Unknown image type image/jpeg; name=image001.jpg
at org.apache.geronimo.activation.handlers.AbstractImageHandler.getContent(AbstractImageHandler.java:57)
at javax.activation.DataHandler.getContent(DataHandler.java:147)
at javax.mail.internet.MimeBodyPart.getContent(MimeBodyPart.java:652)
at my.code.calling.getcontent.MyClass(MyClass.java:802)
The package org.apache.geronimo.activation.handlers is included in the javax.transaction 1.1.0 bundle.
I resolved the problem by #-commenting the image/gif, image/jpeg handlers in the file META-INF/mailcap inside the javax.activation bundle:
## <apache license disclaimer> http://www.apache.org/licenses/LICENSE-2.0
##
## $Rev$ $Date: 2008/04/09 19:25:23 $
##
text/plain;; x-java-content-handler=org.apache.geronimo.activation.handlers.TextPlainHandler
text/html;; x-java-content-handler=org.apache.geronimo.activation.handlers.TextHtmlHandler
text/xml;; x-java-content-handler=org.apache.geronimo.activation.handlers.TextXmlHandler
#image/gif;; x-java-content-handler=org.apache.geronimo.activation.handlers.ImageGifHandler
#image/jpeg;; x-java-content-handler=org.apache.geronimo.activation.handlers.ImageJpegHandler
multipart/*;; x-java-content-handler=org.apache.geronimo.activation.handlers.MultipartHandler
There's no image/png here, that's why pngs are not a problem in the first place.
So by commenting gif and jpeg handlers, attachments of these types are now handled like pngs: getContent() will just yield an InputStream, instead of an AWT Image, which I think those geronimo ImageHandlers would produce if everything worked as intended.
Some internals: Geronimo AbstractImageHandler of javax.activation 1.1.0 tries to determine the type of image from javax.mail.glassfish 1.4.1 method IMAPBodyPart.getContent(), but this returns the mime-type incl. parameters, e.g. "image/jpeg; name=sony-car-796x418.jpg", which isn't understood and ultimately leads to the UnsupportedDataTypeException.
javax.mail.glassfish also has an META-INF/mailcap file, whose image/* section interestingly looks like this:
# can't support image types because java.awt.Toolkit doesn't work on servers
#
#image/gif;; x-java-content-handler=com.sun.mail.handlers.image_gif
#image/jpeg;; x-java-content-handler=com.sun.mail.handlers.image_jpeg
That could be a lead, I still did get the original jpeg exception also in a gui application, though.
Another thing, this error doesn't occur for me when running the same setup with java 8 instead of 11, probably got something to do with java 8 having its own javax.activation package.

Gate- Loading a gapp file is taking time .Hoe can I reduce it?

I am working on GATE related project. So, Here I am creating a pipeline through GATE. I am using that .gapp file in my java code. So, for loading a .gapp file, it takes around 10 seconds, which is too much for my application.
How can I solve this issue?
Second problem is that, I have to do System.exit after processing a document to release the memory, if I didn't then I got OutofMemoryError.
So, how can I solve these issues?
My code is like:
public class GateMainClass {
CorpusController application = null;
Corpus corpus = null;
public void processApp(String gateHome, String gapFilePath, String docSourcePath, String isSingleDocument) throws ResourceInstantiationException {
try {
if (!Gate.isInitialised()) {
Gate.runInSandbox(true);
Gate.setGateHome(new File(gateHome));
Gate.setPluginsHome(new File(gateHome, "plugins"));
Gate.init();
}
application = (CorpusController) PersistenceManager.loadObjectFromFile(new File(gapFilePath));
corpus = Factory.newCorpus("main");
application.setCorpus(corpus);
if(isSingleDocument.equals(Boolean.TRUE.toString())) {
Document doc = Factory.newDocument(new File(docSourcePath).toURI().toURL());
corpus.add(doc);
} else {
File[] files;
File folder = new File(docSourcePath);
files = folder.listFiles(new FileUtil.CustomFileNameFilter(".xml"));
Arrays.sort(files, LastModifiedFileComparator.LASTMODIFIED_REVERSE);
for (int i = 0; i < files.length; i++) {
Document doc = Factory.newDocument(files[i].toURI().toURL());
corpus.add(doc);
}
}
application.execute();
} catch (Exception e) {
e.printStackTrace();
} finally {
corpus.clear();
}
}
And my gapp file is like:
1.Document Reset PR
2.Annie English Tokenizer.
3.ANnie Gazetteer.
4.Annie Sentence Spiliter.
5.Annie POS Tagger.
6.GATE morphological Analyser.
7.Flexible gazetteer.
8.HTML markup transfer
9.Main Jape file.

Android Studio ignores some breakpoints, while others work

in Android Studio I would like to analyze my variables with breakpoints, but some of them are just ignored, while others work. I have no idea why.
I am especially interested in the variables of the following method.
Do have any idea why the debugger ignores the breakpoints in it?
private Entry readEntry(XmlPullParser parser) throws XmlPullParserException, IOException {
parser.require(XmlPullParser.START_TAG, ns, "searchresults");
String longitude = null;
String latitude = null;
String place_id = null;
while (parser.next() != XmlPullParser.END_TAG) {
if (parser.getEventType() != XmlPullParser.START_TAG) {
continue;
}
String name = parser.getName();
if (name.equals("place")) {
longitude = readLongitude(parser);
latitude = readLatitude(parser);
place_id = readPlace_id(parser);
} else {
skip(parser);
}
}
return new Entry(longitude, latitude, place_id);
}
As you might have guessed I am trying to parse an XML document in an Async Task and this is one step of the way. I am trying to find out, what I did wrong while parsing this xml document:
<searchresults timestamp="Fri, 26 Aug 16 10:38:53 +0000" attribution="Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright" querystring="warmbandwerk" polygon="false" exclude_place_ids="129799074,151481629" more_url="http://open.mapquestapi.com/nominatim/v1/search?format=xml&exclude_place_ids=129799074,151481629&accept-language=de,en-US;q=0.7,en;q=0.3&q=warmbandwerk">
<place place_id="129799074" osm_type="way" osm_id="220625788" place_rank="30" boundingbox="51.4947849,51.497683,6.7382846,6.746241" lat="51.49646145" lon="6.74226302912559" display_name="Warmbandwerk 1, Kaiser-Wilhelm-Straße, ThyssenKrupp Steel Europe AG, Hamborn, Duisburg, Regierungsbezirk Düsseldorf, Nordrhein-Westfalen, 47166, Deutschland" class="building" type="yes" importance="0.101"/>
<place place_id="151481629" osm_type="relation" osm_id="2917388" place_rank="30" boundingbox="51.4860521,51.4891516,6.7050164,6.7144352" lat="51.48741265" lon="6.70882226405504" display_name="Warmbandwerk 2, Werkstraße 1, ThyssenKrupp Steel Europe AG, Beeckerwerth, Meiderich/Beeck, Duisburg, Regierungsbezirk Düsseldorf, Nordrhein-Westfalen, 47139, Deutschland" class="building" type="yes" importance="0.101"/>
</searchresults>
edit: Because someone asked which breakpoints do not work. In the above method none of them work. In the method below all of them work until the for-statement. The for-statement and return breakpoints are ignored.
private String loadXmlFromNetwork(String urlString) throws XmlPullParserException, IOException {
InputStream stream = null;
// Instantiate the parser
XMLParser XMLParser = new XMLParser();
List<XMLParser.Entry> entries = null;
StringBuilder htmlString = new StringBuilder();
htmlString.append("<h3>" + getResources().getString(R.string.page_title) + "</h3>");
htmlString.append("<em>" + getResources().getString(R.string.updated));
try {
stream = downloadUrl(urlString);
entries = XMLParser.parse(stream);
// Makes sure that the InputStream is closed after the app is
// finished using it.
} finally {
if (stream != null) {
stream.close();
}
}
// StackOverflowXmlParser returns a List (called "entries") of Entry objects.
// Each Entry object represents a single post in the XML feed.
// This section processes the entries list to combine each entry with HTML markup.
// Each entry is displayed in the UI as a link that optionally includes
// a text summary.
for (XMLParser.Entry entry : entries) {
htmlString.append("<p><a href='");
htmlString.append(entry.longitude);
htmlString.append(entry.latitude);
htmlString.append(entry.place_id);
}
return htmlString.toString();
}

Save file from a website with java

I'm trying to build a jsoup based java app to automatically download English subtitles for films (I'm lazy, I know. It was inspired from a similar python based app). It's supposed to ask you the name of the film and then download an English subtitle for it from subscene.
I can make it reach the download link but I get an Unhandled content type error when I try to 'go' to that link. Here's my code
public static void main(String[] args) {
try {
String videoName = JOptionPane.showInputDialog("Title: ");
subscene(videoName);
}
catch (Exception e) {
System.out.println(e.getMessage());
}
}
public static void subscene(String videoName){
try {
String siteName = "http://www.subscene.com";
String[] splits = videoName.split("\\s+");
String codeName = "";
String text = "";
if(splits.length>1){
for(int i=0;i<splits.length;i++){
codeName = codeName+splits[i]+"-";
}
videoName = codeName.substring(0, videoName.length());
}
System.out.println("videoName is "+videoName);
// String url = "http://www.subscene.com/subtitles/"+videoName+"/english";
String url = "http://www.subscene.com/subtitles/title?q="+videoName+"&l=";
System.out.println("url is "+url);
Document doc = Jsoup.connect(url).get();
Element exact = doc.select("h2.exact").first();
Element yuel = exact.nextElementSibling();
Elements lis = yuel.children();
System.out.println(lis.first().children().text());
String hRef = lis.select("div.title > a").attr("href");
hRef = siteName+hRef+"/english";
System.out.println("hRef is "+hRef);
doc = Jsoup.connect(hRef).get();
Element nonHI = doc.select("td.a40").first();
Element papa = nonHI.parent();
Element link = papa.select("a").first();
text = link.text();
System.out.println("Subtitle is "+text);
hRef = link.attr("href");
hRef = siteName+hRef;
Document subDownloadPage = Jsoup.connect(hRef).get();
hRef = siteName+subDownloadPage.select("a#downloadButton").attr("href");
Jsoup.connect(hRef).get(); //<-- Here's where the problem lies
}
catch (java.io.IOException e) {
System.out.println(e.getMessage());
}
}
Can someone please help me so I don't have to manually download subs?
I just found out that using
java.awt.Desktop.getDesktop().browse(java.net.URI.create(hRef));
instead of
Jsoup.connect(hRef).get();
downloads the file after prompting me to save it. But I don't want to be prompted because this way I won't be able to read the name of the downloaded zip file (I want to unzip it after saving using java).
Assuming that your files are small, you can do it like this. Note that you can tell Jsoup to ignore the content type.
// get the file content
Connection connection = Jsoup.connect(path);
connection.timeout(5000);
Connection.Response resultImageResponse = connection.ignoreContentType(true).execute();
// save to file
FileOutputStream out = new FileOutputStream(localFile);
out.write(resultImageResponse.bodyAsBytes());
out.close();
I would recommend to verify the content before saving.
Because some servers will just return a HTML page when the file cannot be found, i.e. a broken hyperlink.
...
String body = resultImageResponse.body();
if (body == null || body.toLowerCase().contains("<body>"))
{
throw new IllegalStateException("invalid file content");
}
...
Here:
Document subDownloadPage = Jsoup.connect(hRef).get();
hRef = siteName+subDownloadPage.select("a#downloadButton").attr("href");
//specifically here
Jsoup.connect(hRef).get();
Looks like jsoup expects that the result of Jsoup.connect(hRef) should be an HTML or some text that it's able to parse, that's why the message states:
Unhandled content type. Must be text/*, application/xml, or application/xhtml+xml
I followed the execution of your code manually and the last URL you're trying to access returns a content type of application/x-zip-compressed, thus the cause of the exception.
In order to download this file, you should use a different approach. You could use the old but still useful URLConnection, URL or use a third party library like Apache HttpComponents to fire a GET request and retrieve the result as an InputStream, wrap it into a proper writer and write your file into your disk.
Here's an example about doing this using URL:
URL url = new URL(hRef);
InputStream in = url.openStream();
OutputStream out = new BufferedOutputStream(new FileOutputStream("D:\\foo.zip"));
final int BUFFER_SIZE = 1024 * 4;
byte[] buffer = new byte[BUFFER_SIZE];
BufferedInputStream bis = new BufferedInputStream(in);
int length;
while ( (length = bis.read(buffer)) > 0 ) {
out.write(buffer, 0, length);
}
out.close();
in.close();

Specific characters not rendering properly in Java

I have an issue when displaying strings received from a server in a JTable. Some specific characters appear as little white squares instead of "é" or "à" etc. I tried a lot of things but none of them fixed my problem. I'm working with Eclipse under Windows. The server was developped using Visual Studio 2010.
The server sends an XML file using tinyXML2, the client uses JDom to read it. The font used is "Dialog". The server takes the strings from an Oracle database.
I assume this is an encoding problem, but I haven't been able to fix it yet.
Does anyone have an idea ?
Thx
Arnaud
EDIT : As requested, this is how I use JDom
public static Player fromXML(Element e)
{
Player result = new Player();
String e_text = null;
try
{
e_text = e.getChildText(XMLTags.XML_Player_playerId);
if (e_text != null) result.setID(Integer.parseInt(e_text));
e_text = e.getChildText(XMLTags.XML_Player_lastName);
if (e_text != null) result.setName(e_text);
e_text = e.getChildText(XMLTags.XML_Player_point_scored);
if (e_text != null) result.addSpecial(STAT_SCORED, Double.parseDouble(e_text));
e_text = e.getChildText(XMLTags.XML_Player_point_scored_last);
if (e_text != null) result.addSpecial(STAT_SCORED_LAST, Double.parseDouble(e_text));
}
catch (Exception ex) {
ex.printStackTrace();
}
return result;
}
public static Document load(String filename) {
File XMLFile = new File(CLIENT_TO_SERVER, filename);
SAXBuilder sxb = new SAXBuilder();
Document document = new Document();
try
{
document = sxb.build(new File(XMLFile.getPath()));
} catch(Exception e){e.printStackTrace();}
return document;
}
read the file using correct encoding, something like:
document = sxb.build(new BufferedReader(new InputStreamReader(new FileInputStream(XMLFile.getPath()), "UTF8")));
Note: 1. 1st determine which char encoding used in that file. specify that charset instead of UTF8 above.
Incase encoding is not known or it's being generated from various systems with different encoding, you may use 'encoding detector library of Mozilla'. #see https://code.google.com/p/juniversalchardet/
need to handle UnsupportedEncodingException

Categories

Resources