Apache Nutch 2.3.1 Fetcher giving Invalid uri exception - java

I have configured Apache Nutch 2.3.1 with Hadoop ecosystem. I have to fetch some person-arabic script websites. Nutch is giving exception for few URLs at fetch time. Following is an example exception
java.lang.IllegalArgumentException: Invalid uri 'http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html': escaped absolute path not valid
at org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.java:222)
at org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.java:89)
at org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java:77)
at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:173)
at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:245)
at org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:564)

I've been able to reproduce this issue even on the 1.x branch. The problem is that the Java URI class that the Apache HTTP client library uses internally doesn't support non escaped UTF-8 characters:
From the JavaDoc documentation for java.net.URI:
Character categories
RFC 2396 specifies precisely which characters are permitted in the various components of a URI reference. The following categories, most of which are taken from that specification, are used below to describe these constraints:
alpha The US-ASCII alphabetic characters, 'A' through 'Z' and 'a' through 'z'
digit The US-ASCII decimal digit characters, '0' through '9'
alphanum All alpha and digit characters
unreserved All alphanum characters together with those in the string "_-!.~'()*"
punct The characters in the string ",;:$&+="
reserved All punct characters together with those in the string "?/[]#"
escaped Escaped octets, that is, triplets consisting of the percent character ('%') followed by two hexadecimal digits ('0'-'9', 'A'-'F', and 'a'-'f')
other The Unicode characters that are not in the US-ASCII character set, are not control characters (according to the Character.isISOControl method), and are not space characters (according to the Character.isSpaceChar method) (Deviation from RFC 2396, which is limited to US-ASCII)
The set of all legal URI characters consists of the unreserved, reserved, escaped, and other characters.
Properly escaped the URL would look more like:
http://agahi.safirak.com/ads/850/%D9%BE%DB%8C%DA%86-%D8%A8%D9%86%D8%AF-%D8%A8%D8%A7%D8%AF%DB%8C-%D9%87%D9%81%D8%AA%DB%8C%D8%B1%DB%8C-1800-%D8%AF%D9%88%D8%B1-%D8%A8%D8%A7%D8%AF%DB%8C-%D8%AC%DB%8C%D8%B3%D9%88%D9%86.html
Actually if you open the example URL on Chrome and then copy the URL from the address bar, you'll get the escaped representation. Feel free to open an issue for this (otherwise I'll do it). In the mean time you could try to use the protocol-http plugin which does not uses the Apache HTTP client. I've tested locally and the parsechecker works fine:
➜ local (master) ✗ bin/nutch parsechecker "http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html"
fetching: http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html
robots.txt whitelist not configured.
parsing: http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html
contentType: text/html
signature: 048b390ab07464f5d61ae09646253529
---------
Url
---------------
http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html
---------
ParseData
---------
Version: 5
Status: success(1,0)
Title: پیچ بند بادی هفتیری 1800 دور بادی جیسون-نیازمندی سفیرک
Outlinks: 76
outlink: toUrl: http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html anchor:
outlink: toUrl: http://agahi.safirak.com/assets/fonts/font-awesome/css/font-awesome.min.css anchor:
outlink: toUrl: http://agahi.safirak.com/assets/css/bootstrap.css anchor:
...

Related

Why does reading a test resource with ':' in filename result in a NPE?

I am trying to determine how to read a file from resources with the special character : in it's name. I am traversing the directory in my main code which returns File listings that I use to ingest the file without issue.
In my unit test doing something as simple as:
this.getClass().getResource("/filename:containing.character")
returns a java.lang.NullPointerException due to the presence of the :, presumably because it thinks it is a protocol. What is the work around for this? I cannot rename the file as the filename encodes some information I am looking to test in this case.
Special characters for referencing resources
Quoting from the IETF RFC 2396 specification by T. Berners-Lee et al., accessible from here: https://www.ietf.org/rfc/rfc2396.txt
2.2. Reserved Characters
Many URI include components consisting of or delimited by, certain
special characters. These characters are called "reserved", since
their usage within the URI component is limited to their reserved
purpose. If the data for a URI component would conflict with the
reserved purpose, then the conflicting data must be escaped before
forming the URI.
reserved = ";" | "/" | "?" | ":" | "#" | "&" | "=" | "+" |
"$" | ","
and
URI Syntactic Components
The URI syntax is dependent upon the scheme. In general, absolute
URI are written as follows:
<scheme>:<scheme-specific-part>
Consequence
Java's implementation of the URL class (8, 11, 14) is compliant with the IETF RFC 2396 specification document. Therefore, references to resources should/must not contain any colon : symbol.
This will result in a MalformedURLException when parsed with your input filename:containing.character (see OpenJDK implementation of URL, line 652). Next, this will result in a null value in URLClassPath (line 1254) as the previous instance of MalformedURLException is not thrown yet "converted" to a "non successful" response to this.getClass().getResource(..) call.
Conclusion
Avoid colon (:) symbols in your filenames OR
Rename affected files before processing them as "resources" via URLs in Java.
In addition to #MWiesner answer.
What we can do about if we already have files like that?
You can specify by parameter the folder in which those files are contained. And then traverse the list of children and find the desired file by simple name comparison.

XML entity references with unicode paths

I'm caching XML entities so I don't have to fetch them from server, resulting in XML header tags like
<!ENTITY % xhtml-special-local SYSTEM "/Users/test/Library/Application Support/test/xhtml-special.ent" > %xhtml-special-local;
This works great unless the username contains öäå or similar non-ascii characters. With these, I get the following parser error
java.net.MalformedURLException: no protocol: /Users/test/Library/Application Support/testööö/xhtml-special.ent
How should the entity path be escaped to be accepted by the parser?
This can be fixed by prepending file:/// to the path, like so
<!ENTITY % xhtml-special-local SYSTEM "file:////Users/test/Library/Application Support/test/xhtml-special.ent" > %xhtml-special-local;
It should be a URI, not a filename. That means it should have "file://" at the start.
Secondly, URIs only allow ASCII characters. Some systems are more flexible and accept IRIs (an extension of URIs that permits non-ASCII characters), but there's nothing in the specs that endorses this. To be portable you need to use %XX escaping for non-ASCII characters. The easiest way to do this if you're in Java is with File.toURI().
System.err.println(new File(
"/Users/test/Library/Application Support/testööö/xhtml-special.ent"
).toURI().toASCIIString());
outputs
file:/Users/test/Library/Application%20Support/test%C3%B6%C3%B6%C3%B6/xhtml-special.ent
The sequence %C3%B6 is the hex representation of the two octets making up the UTF-8 encoding of the Unicode character ö.

Java Play Framework, WSRequestHolder, how to specify encoding of query parameters?

I am using Play 2.3 Java application, I am sending Get request to a server and I include special characters in query parameters, like Š, which is sent as %C5%A0 but server understand only Windows-1250 characters. In this case it expects %8A (see encoding https://www.w3schools.com/tags/ref_urlencode.asp)
example:
wsRequestHolder.setQueryParameter("city", "Plavecký Štvrtok");
How can I set encoding of sending query paremeters via WSRequestHolder to something different than UTF-8?
There is no implicit way to define the encoding of HTTP Query parameters for WSRequestHandlers in Play.
The RFC 3986 - Uniform Resource Identifier (URI) only defines that characters not available in the ASCII charset must be encoded in a certain way.
So its up to you to convert the String into the proper encoding that is supported by the server. Play will then escape it to be a valid URI only consisting of ASCII characters.
ws.RequestHolder.setQueryParameter("city", new String("Plavecký Štvrtok".getBytes(), "Cp1250")
See supported encodings in Java 8 and what their canonical names are.

Different behavior when space is encoded as + and %20 in a URL

Pages with spaces in the URL don't get correctly translated:
i.e.
http://www.streetinsider.com/Press Releases/National Trends Reflected in Plano Housing Market/9778767.html
or
http://www.streetinsider.com/Press%20Releases/National+Trends+Reflected+in+Plano+Housing+Market/9778767.html
Gives 404. Please note "Press Releases" is encoded as "Press%20Releases".
However following two versions work fine where "Press Releases" is encoded as "Press+Releases".
http://www.streetinsider.com/Press+Releases/National+Trends+Reflected+in+Plano+Housing+Market/9778767.html
The article parses fine with plus signs or HEX spaces %20.
http://www.streetinsider.com/Press+Releases/National%20Trends%20Reflected%20in%20Plano%20Housing%20Market/9778767.html
Both + and %20 represent spaces. Then why this behavior.
And also, in java what could I use to get the correct encoded URL
Both + and %20 represent spaces
Only in query strings. Elsewhere in a URL a plus is a plus, not a space. In this case the web server gives you the same content for the two different URLs
http://www.streetinsider.com/Press+Releases/National+Trends+Reflected+in+Plano+Housing+Market/9778767.html
and
http://www.streetinsider.com/Press+Releases/National%20Trends%20Reflected%20in%20Plano%20Housing%20Market/9778767.html
but the two URLs are distinct, they're not alternative representations of the same URL.
Officially + might only be used in the query string (after ?).
This is what URLEncoder is for:
"?x=" + URLEncoder.encode("Hello World", "UTF-8");
"?x=" + URLEncoder.encode("ŝi estas ĉarma", "UTF-8");
?x=Hello+World
?x=%C5%9Di+estas+%C4%89arma
The more universal class URI, obeys the specification for spaces to be replaced, using %.
URI uri = new URI("http", "www.streetinsider.com",
"/Press Releases/National Trends Reflected in Plano Housing Market/9778767.html",
"?x=ŝi estas ĉarma");
String u = uri.toString();
http://www.streetinsider.com/Press%20Releases/National%20Trends%20
Reflected%20in%20Plano%20Housing%20Market/9778767.html#?x=ŝi%20estas%20ĉarma
One sometime encounters URI as generalisation for File and others, and then has to be careful not introducing %20 in file names.
So probably there is a partial remapping on streetinsider of + or even %20 as it seems; in order to reach the same code.
Your statement
Both + and %20 represent spaces.
is not exactly true in all cases.
Space characters may only be encoded as "+" in one context: application/x-www-form-urlencoded key-value pairs.
The RFC-1866 (HTML 2.0 specification), paragraph 8.2.1. subparagraph 1. says: "The form field names and values are escaped: space characters are replaced by `+', and then reserved characters are escaped").
Here is an example of such a string in URL where RFC-1866 allows encoding spaces as pluses: "http://example.com/over/there?name=foo+bar". So, only after "?", spaces can be replaced by pluses (in other cases, spaces should be encoded to %20). This way of encoding form data is also given in later HTML specifications, for example, look for relevant paragraphs about application/x-www-form-urlencoded in HTML 4.01 Specification, and so on.
The URL that you have provided is not a form data containing key/value pairs, it's just a path to a 9778767.html file:
http://www.streetinsider.com/Press%20Releases/National+Trends+Reflected+in+Plano+Housing+Market/9778767.html
So, it is illegal to use pluses here. The correct URL in this case should have been the following:
http://www.streetinsider.com/Press%20Releases/National%20Trends%20Reflected%20in%20Plano%20Housing%20Market/9778767.html

Chinese character in URL with Java

I used the following line in Firefox's URL field :
http://www.baidu.com/s?wd=你
This line was generated by my Java program.
The last Chinese character in the URL field sometimes became: %C4%E3 [Correct]
Other times it became: %E4%BD%A0 [Incorrect]
I tried to use the URL with IE. It shows up still as 你, but the result page search field shows the character as 浣. Could this be a UTF-8 or UTF-16 encoding problem? How do I get the correct code %C4%E3 from the char 你 with my Java program?
URLEncoder.encode(string, encoding)

Categories

Resources