Grammar vs regular expression for parsing URLs?

Grammar vs regular expression for parsing URLs? - java

The BNF form of URL is mentioned in the URL:
http://www.w3.org/Addressing/rfc1738.txt
What I need to do is extract the URLs from html text. Now I was wondering can I represent
String alpha = "[a-zA-Z]";
String alphadigit = "[a-zA-Z0-9]";
String domainlabel = alphadigit+"|"+alphadigit+"("+alphadigit+"|-)*?"+alphadigit;
//String toplabel = alpha+"|"+alpha+"("+alphadigit+"|-)*?"+alphadigit;
String toplabel = "com|org|net|mil|edu|(co\\.[a-z]+)";
String hostname = "(("+domainlabel+")\\.)*("+toplabel+")";
String hostport = hostname;
String lowalpha = "([a-z])";
String hialpha = "([A-Z])";
String alpha = "("+lowalpha+"|"+hialpha+")";
String digit = "([0-9])";
String safe = "($|-|_|.|\\+)";
String extra = "(!|\\*|'|\\(|\\)|,)";
//String national = "{" | "}" | "|" | "\" | "^" | "~" | "[" | "]" | "`";
String punctuation = "(<|>|#|%|\")";
String reserved = "(;|/|?|:|#|&|=)";
String hex = "("+digit+"[A-Fa-f]"+")";
String escape = "(%"+hex+hex+")";
String unreserved = "("+alpha+"|"+digit+"|"+safe+"|"+extra+")";
String uchar = "("+unreserved+"|"+escape+")";
String hsegment = "(("+uchar+"|;|:|#|&|=)*)";
String search = "("+uchar+"|;|:|#|&|=)?)";
String hpath = hsegment+"(/"+hsegment+")*";
//String httpurl = "http://"+hostport+"(/"+hpath+"(?"+search+")?)?";
String httpurl = "http://"+hostport+"/"+hpath;
The final regex:
http://(([a-zA-Z0-9]|[a-zA-Z0-9]([a-zA-Z0-9]|-)*?[a-zA-Z0-9])\.)*(com|org|net|mil|edu|(co\.[a-z]+))/(((((([a-z])|([A-Z]))|([0-9])|($|-|_|.|\+)|(!|\*|'|\(|\)|,))|(%(([0-9])[A-Fa-f])(([0-9])[A-Fa-f])))|;|:|#|&|=)*)(/(((((([a-z])|([A-Z]))|([0-9])|($|-|_|.|\+)|(!|\*|'|\(|\)|,))|(%(([0-9])[A-Fa-f])(([0-9])[A-Fa-f])))|;|:|#|&|=)*))*
So you can see I represented the whole BNF to a big regular expression which will be use with javax.util.regex methods to extract the URL out of text. Now is this the correct approach? If it is correct, then why do we need to write a context free grammar? What disadvantages the regex approach have?
Besides, for grammar parser, say for a language, the grammar is used to validate whether the code follows the grammar rules otherwise show some error messages. Also using the grammar we get a syntax tree which is used to evaluate the expression. For the URL thing we didn't evaulate anything. we just need to extract the urls out of the rest of the text.
I got this question, because previously I was trying to parse email address. After exhaustively searching for regular expressions, none of them turned out to be 100% accurate and some comment was made regarding the limitations of regex to match the exact BNF form of email addresses in RFC. Hence a grammar (instead of regex) might be required. Hence I have this question for URLs.
Thanks

Well, I think your issue could be solved more easily using some heuristics about how http link looks like in free text. It could work more faster than such complicated regexp, especially if we are talking about large texts:
http link (url) starts with unique http://
from start to end URL doesn't contains some set of characters (white-spaces for example). When you came cross such character it means that you found end of URL.

If the URL you are extracting is within tags (such as the href property of an anchor tag) then I'd recommend using JSoup to parse and inspect the HTML.
http://jsoup.org/
Within the body of text, I'm certain a more simple regex approach is possible, perhaps matching on the protocol (http://)

Related

Can't make sense of this piece of java code

I'm working on a legacy system, and i ran into this piece of code that i can't make sense of.
String note = URLDecoder.decode(URLEncoder.encode(
message.replaceAll("\\<.*?\\>", ""),
"UTF-8").replace("%0D%0A", "<br>"), "UTF-8");
What does this do, and why is it being encoded and then again decoded.
FYI: This "message" is appended to an email which is sent.

1st replace to enter is replacing CRLF (carret return and line feed symbols) with <br> tag.
2nd replaceAll removes all tags (like <tag>).
That UTF-8 is the charset encoding used to decode/encode raw bytes into actual characters. WWWC (World wide Web Consorcium) states that UTF-8 should be used.

From the coding perspective, break down to the following code may help to understand better:
String updatedMessage = message.replaceAll("\\<.*?\\>", "");
System.out.println(updatedMessage );
String encodedMessage = URLEncoder.encode(updatedMessage ,"UTF-8");
System.out.println(encodedMessage );
String updatedEncodedMessage = encodedMessage .replace("%0D%0A", "<br>");
System.out.println(updatedEncodedMessage );
String note = URLDecoder.decode(updatedEncodedMessage ,"UTF-8");
System.out.println(note );
There is no regex involved, just some string replacement.

Replacing special character from a String in Android

I have a String as folder/File Name. I am creating folder , file with that string. This string may or may not contain some charters which may not allow to create desired folder or file
e.g
String folder = "ArslanFolder 20/01/2013";
So I want to remove these characters with "_"
Here are characters
private static final String ReservedChars = "|\?*<\":>+[]/'";
What will be the regular expression for that? I know replaceAll(); but I want to create a regular expression for that.

Use this code:
String folder = "ArslanFolder 20/01/2013 ? / '";
String result = folder.replaceAll("[|?*<\":>+\\[\\]/']", "_");
And the result would be:
ArslanFolder 20_01_2013 _ _ _
you didn't say that space should be replaced, so spaces are there... you could add it if it is necessary to be done.

I used one of this:
String alphaOnly = input.replaceAll("[^\\p{Alpha}]+","");
String alphaAndDigits = input.replaceAll("[^\\p{Alpha}\\p{Digit}]+","");
See this link:
Replace special characters

Try this :
replaceAll("[\\W]", "_");
It will replace all non alphanumeric characters with underscore

This is correct solution:
String result = inputString.replaceAll("[\\\\|?\u0000*<\":>+\\[\\]/']", "_");
Kent answer is good, but he isnt include characters NUL and \.
Also, this is a secure solution for replacing/renaming text of user-input file names, for example.

How to extract query string from a URL of a web-page using java

From the following URL in OathCallBack page I want extract access_token and token_type using Java. Any idea how to do it?
http://myserver.com/OathCallBack#state=/profile&access_token=ya29.AHES6ZQLqtYrPKuw2pMzURJtWuvINspm8-Vf5x-MZ5YzqVy5&token_type=Bearer&expires_in=3600
I tried the following, but unable to extract required information.
{
String scheme = req.getScheme(); // http
String serverName = req.getServerName(); // myserver.com
int serverPort = req.getServerPort(); // 80
String contextPath = req.getContextPath();
String servletPath = req.getServletPath();
String pathInfo = req.getPathInfo(); // return null and exception
String queryString = req.getQueryString(); // return null
}
<---------------------------------------------------------->
I am going to edit my question
Thank you every one for nice reply,
google did it,
you can refer to that link by URL
http://developers.google.com/accounts/docs/OAuth2Login
inside above URL page there is following link
http://accounts.google.com/o/oauth2/auth? scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww‌.googleapis.com%2Fauth%2Fuserinfo.profile& state=%2Fprofile& redirect_uri=https%3A%2F%2Foauth2-login-demo.appspot.com%2Foauthcallback& response_type=token& client_id=812741506391.apps.googleusercontent.com
when you click on above link, then you will get your gmail login account access_token, and that token is after # sign

Some characters cannot be part of a URL (for example, the space) and some other characters have a special meaning in a URL: for example, the character # can be used to further specify a subsection (or fragment) of a document; the character = is used to separate a name from a value.
see http://en.wikipedia.org/wiki/Query_string for more:

It looks like the '#' should be a '?'.
In a normal URL, the parameters are passed as key value pairs following a '?' and multiple parameters chained together using '&'. A URL might look as follows:
http: //someserver.com/somedir/somepage.html?param1=value1&param2=value2&param3=value3.
Normally the Java servlet container would return everything after the '?' when calling getQueryString() but due to the absence of the '?' it returns null.
As #Sandeep Nair has suggested getRequestURL() should return this full URL to you and you could parse it using regular expressions to get the information you want. A possible regular expression to use would be along the lines of:
(?<=access_token=)[a-zA-Z0-9.-]*
However, getRequestURL() does NOT normally return the query string, so using this method is relying on the fact that there is a '#' rather and a '?' and is therefore probably not a great solution. See here.
I would advise that you find out why you are getting a '#' instead of a '?' and try to get this changed, if you can do this then the servlet container should manage the URL parameters for you and call to request.getAttribute("access_token") and request.getAttribute("token_type") (see here) will return both values as strings.

You get query string by calling
String queryString = req.getQueryString();
It correctly returns null in your case, as there is no query string. The characters after "#" are anchor specification, which is only visible to the browser and not sent to server.

How to encode URL to avoid special characters in Java? [duplicate]

This question already has answers here:
HTTP URL Address Encoding in Java
(24 answers)
Closed 5 years ago.
i need java code to encode URL to avoid special characters such as spaces and % and & ...etc

URL construction is tricky because different parts of the URL have different rules for what characters are allowed: for example, the plus sign is reserved in the query component of a URL because it represents a space, but in the path component of the URL, a plus sign has no special meaning and spaces are encoded as "%20".
RFC 2396 explains (in section 2.4.2) that a complete URL is always in its encoded form: you take the strings for the individual components (scheme, authority, path, etc.), encode each according to its own rules, and then combine them into the complete URL string. Trying to build a complete unencoded URL string and then encode it separately leads to subtle bugs, like spaces in the path being incorrectly changed to plus signs (which an RFC-compliant server will interpret as real plus signs, not encoded spaces).
In Java, the correct way to build a URL is with the URI class. Use one of the multi-argument constructors that takes the URL components as separate strings, and it'll escape each component correctly according to that component's rules. The toASCIIString() method gives you a properly-escaped and encoded string that you can send to a server. To decode a URL, construct a URI object using the single-string constructor and then use the accessor methods (such as getPath()) to retrieve the decoded components.
Don't use the URLEncoder class! Despite the name, that class actually does HTML form encoding, not URL encoding. It's not correct to concatenate unencoded strings to make an "unencoded" URL and then pass it through a URLEncoder. Doing so will result in problems (particularly the aforementioned one regarding spaces and plus signs in the path).

I also spent quite some time with this issue, so that's my solution:
String urlString2Decode = "http://www.test.com/äüö/path with blanks/";
String decodedURL = URLDecoder.decode(urlString2Decode, "UTF-8");
URL url = new URL(decodedURL);
URI uri = new URI(url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef());
String decodedURLAsString = uri.toASCIIString();

If you don't want to do it manually use Apache Commons - Codec library. The class you are looking at is: org.apache.commons.codec.net.URLCodec
String final url = "http://www.google.com?...."
String final urlSafe = org.apache.commons.codec.net.URLCodec.encode(url);

Here is my solution which is pretty easy:
Instead of encoding the url itself i encoded the parameters that I was passing because the parameter was user input and the user could input any unexpected string of special characters so this worked for me fine :)
String review="User input"; /*USER INPUT AS STRING THAT WILL BE PASSED AS PARAMTER TO URL*/
try {
review = URLEncoder.encode(review,"utf-8");
review = review.replace(" " , "+");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
String URL = "www.test.com/test.php"+"?user_review="+review;

I would echo what Wyzard wrote but add that:
for query parameters, HTML encoding is often exactly what the server is expecting; outside these, it is correct that URLEncoder should not be used
the most recent URI spec is RFC 3986, so you should refer to that as a primary source
I wrote a blog post a while back about this subject: Java: safe character handling and URL building

What is the most efficient way to format UTF-8 strings in java?

I am doing the following:
String url = String.format(WEBSERVICE_WITH_CITYSTATE, cityName, stateName);
String urlUtf8 = new String(url.getBytes(), "UTF8");
Log.d(TAG, "URL: [" + urlUtf8 + "]");
Reader reader = WebService.queryApi(url);
The output that I am looking for is essentially to get the city name with blanks (e.g., "Overland Park") to be formatted as Overland%20Park.
Is it this the best way?

Assuming you are actually wanting to encode your string for use in a URL (ie, "Overland Park" can also be formatted as "Overland+Park") you want URLEncoder.encode(url, "UTF-8"). Other unsafe characters will be converted to the %xx format you are asking for.

The simple answer is to use URLEncoder.encode(...) as stated by #Recurse. However, if part or all of the URL has already been encoded, then this can lead to double encoding. For example:
http://foo.com/pages/Hello%20There
or
http://foo.com/query?keyword=what%3f
Another concern with URLEncoder.encode(...) is that it doesn't understand that certain characters should be escaped in some contexts and not others. So for example, a '?' in a query parameter should be escaped, but the '?' that marks the start of the "query part" should not be escaped.
I think that safer way to add missing escapes would be the following:
String safeURI = new URI(url).toASCIIString();
However, I haven't tested this ...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.