Get domain name from given url - java

Given a URL, I want to extract domain name(It should not include 'www' part). Url can contain http/https. Here is the java code that I wrote. Though It seems to work fine, is there any better approach or are there some edge cases, that could fail.
public static String getDomainName(String url) throws MalformedURLException{
if(!url.startsWith("http") && !url.startsWith("https")){
url = "http://" + url;
}
URL netUrl = new URL(url);
String host = netUrl.getHost();
if(host.startsWith("www")){
host = host.substring("www".length()+1);
}
return host;
}
Input: http://google.com/blah
Output: google.com

If you want to parse a URL, use java.net.URI. java.net.URL has a bunch of problems -- its equals method does a DNS lookup which means code using it can be vulnerable to denial of service attacks when used with untrusted inputs.
"Mr. Gosling -- why did you make url equals suck?" explains one such problem. Just get in the habit of using java.net.URI instead.
public static String getDomainName(String url) throws URISyntaxException {
URI uri = new URI(url);
String domain = uri.getHost();
return domain.startsWith("www.") ? domain.substring(4) : domain;
}
should do what you want.
Though It seems to work fine, is there any better approach or are there some edge cases, that could fail.
Your code as written fails for the valid URLs:
httpfoo/bar -- relative URL with a path component that starts with http.
HTTP://example.com/ -- protocol is case-insensitive.
//example.com/ -- protocol relative URL with a host
www/foo -- a relative URL with a path component that starts with www
wwwexample.com -- domain name that does not starts with www. but starts with www.
Hierarchical URLs have a complex grammar. If you try to roll your own parser without carefully reading RFC 3986, you will probably get it wrong. Just use the one that's built into the core libraries.
If you really need to deal with messy inputs that java.net.URI rejects, see RFC 3986 Appendix B:
Appendix B. Parsing a URI Reference with a Regular Expression
As the "first-match-wins" algorithm is identical to the "greedy"
disambiguation method used by POSIX regular expressions, it is
natural and commonplace to use a regular expression for parsing the
potential five components of a URI reference.
The following line is the regular expression for breaking-down a
well-formed URI reference into its components.
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9
The numbers in the second line above are only to assist readability;
they indicate the reference points for each subexpression (i.e., each
paired parenthesis).

import java.net.*;
import java.io.*;
public class ParseURL {
public static void main(String[] args) throws Exception {
URL aURL = new URL("http://example.com:80/docs/books/tutorial"
+ "/index.html?name=networking#DOWNLOADING");
System.out.println("protocol = " + aURL.getProtocol()); //http
System.out.println("authority = " + aURL.getAuthority()); //example.com:80
System.out.println("host = " + aURL.getHost()); //example.com
System.out.println("port = " + aURL.getPort()); //80
System.out.println("path = " + aURL.getPath()); // /docs/books/tutorial/index.html
System.out.println("query = " + aURL.getQuery()); //name=networking
System.out.println("filename = " + aURL.getFile()); ///docs/books/tutorial/index.html?name=networking
System.out.println("ref = " + aURL.getRef()); //DOWNLOADING
}
}
Read more

Here is a short and simple line using InternetDomainName.topPrivateDomain() in Guava: InternetDomainName.from(new URL(url).getHost()).topPrivateDomain().toString()
Given http://www.google.com/blah, that will give you google.com. Or, given http://www.google.co.mx, it will give you google.co.mx.
As Sa Qada commented in another answer on this post, this question has been asked earlier: Extract main domain name from a given url. The best answer to that question is from Satya, who suggests Guava's InternetDomainName.topPrivateDomain()
public boolean isTopPrivateDomain()
Indicates whether this domain name is composed of exactly one
subdomain component followed by a public suffix. For example, returns
true for google.com and foo.co.uk, but not for www.google.com or
co.uk.
Warning: A true result from this method does not imply that the
domain is at the highest level which is addressable as a host, as many
public suffixes are also addressable hosts. For example, the domain
bar.uk.com has a public suffix of uk.com, so it would return true from
this method. But uk.com is itself an addressable host.
This method can be used to determine whether a domain is probably the
highest level for which cookies may be set, though even that depends
on individual browsers' implementations of cookie controls. See RFC
2109 for details.
Putting that together with URL.getHost(), which the original post already contains, gives you:
import com.google.common.net.InternetDomainName;
import java.net.URL;
public class DomainNameMain {
public static void main(final String... args) throws Exception {
final String urlString = "http://www.google.com/blah";
final URL url = new URL(urlString);
final String host = url.getHost();
final InternetDomainName name = InternetDomainName.from(host).topPrivateDomain();
System.out.println(urlString);
System.out.println(host);
System.out.println(name);
}
}

I wrote a method (see below) which extracts a url's domain name and which uses simple String matching. What it actually does is extract the bit between the first "://" (or index 0 if there's no "://" contained) and the first subsequent "/" (or index String.length() if there's no subsequent "/"). The remaining, preceding "www(_)*." bit is chopped off. I'm sure there'll be cases where this won't be good enough but it should be good enough in most cases!
Mike Samuel's post above says that the java.net.URI class could do this (and was preferred to the java.net.URL class) but I encountered problems with the URI class. Notably, URI.getHost() gives a null value if the url does not include the scheme, i.e. the "http(s)" bit.
/**
* Extracts the domain name from {#code url}
* by means of String manipulation
* rather than using the {#link URI} or {#link URL} class.
*
* #param url is non-null.
* #return the domain name within {#code url}.
*/
public String getUrlDomainName(String url) {
String domainName = new String(url);
int index = domainName.indexOf("://");
if (index != -1) {
// keep everything after the "://"
domainName = domainName.substring(index + 3);
}
index = domainName.indexOf('/');
if (index != -1) {
// keep everything before the '/'
domainName = domainName.substring(0, index);
}
// check for and remove a preceding 'www'
// followed by any sequence of characters (non-greedy)
// followed by a '.'
// from the beginning of the string
domainName = domainName.replaceFirst("^www.*?\\.", "");
return domainName;
}

I made a small treatment after the URI object creation
if (url.startsWith("http:/")) {
if (!url.contains("http://")) {
url = url.replaceAll("http:/", "http://");
}
} else {
url = "http://" + url;
}
URI uri = new URI(url);
String domain = uri.getHost();
return domain.startsWith("www.") ? domain.substring(4) : domain;

In my case i only needed the main domain and not the subdomain (no "www" or whatever the subdomain is) :
public static String getUrlDomain(String url) throws URISyntaxException {
URI uri = new URI(url);
String domain = uri.getHost();
String[] domainArray = domain.split("\\.");
if (domainArray.length == 1) {
return domainArray[0];
}
return domainArray[domainArray.length - 2] + "." + domainArray[domainArray.length - 1];
}
With this method the url "https://rest.webtoapp.io/llSlider?lg=en&t=8" will have for domain "webtoapp.io".

val host = url.split("/")[2]

All the above are good. This one seems really simple to me and easy to understand. Excuse the quotes. I wrote it for Groovy inside a class called DataCenter.
static String extractDomainName(String url) {
int start = url.indexOf('://')
if (start < 0) {
start = 0
} else {
start += 3
}
int end = url.indexOf('/', start)
if (end < 0) {
end = url.length()
}
String domainName = url.substring(start, end)
int port = domainName.indexOf(':')
if (port >= 0) {
domainName = domainName.substring(0, port)
}
domainName
}
And here are some junit4 tests:
#Test
void shouldFindDomainName() {
assert DataCenter.extractDomainName('http://example.com/path/') == 'example.com'
assert DataCenter.extractDomainName('http://subpart.example.com/path/') == 'subpart.example.com'
assert DataCenter.extractDomainName('http://example.com') == 'example.com'
assert DataCenter.extractDomainName('http://example.com:18445/path/') == 'example.com'
assert DataCenter.extractDomainName('example.com/path/') == 'example.com'
assert DataCenter.extractDomainName('example.com') == 'example.com'
}

try this one : java.net.URL;
JOptionPane.showMessageDialog(null, getDomainName(new URL("https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains")));
public String getDomainName(URL url){
String strDomain;
String[] strhost = url.getHost().split(Pattern.quote("."));
String[] strTLD = {"com","org","net","int","edu","gov","mil","arpa"};
if(Arrays.asList(strTLD).indexOf(strhost[strhost.length-1])>=0)
strDomain = strhost[strhost.length-2]+"."+strhost[strhost.length-1];
else if(strhost.length>2)
strDomain = strhost[strhost.length-3]+"."+strhost[strhost.length-2]+"."+strhost[strhost.length-1];
else
strDomain = strhost[strhost.length-2]+"."+strhost[strhost.length-1];
return strDomain;}

There is a similar question Extract main domain name from a given url. If you take a look at this answer , you will see that it is very easy. You just need to use java.net.URL and String utility - Split

One of the way I did and worked for all of the cases is using Guava Library and regex in combination.
public static String getDomainNameWithGuava(String url) throws MalformedURLException,
URISyntaxException {
String host =new URL(url).getHost();
String domainName="";
try{
domainName = InternetDomainName.from(host).topPrivateDomain().toString();
}catch (IllegalStateException | IllegalArgumentException e){
domainName= getDomain(url,true);
}
return domainName;
}
getDomain() can be any common method with regex.

private static final String hostExtractorRegexString = "(?:https?://)?(?:www\\.)?(.+\\.)(com|au\\.uk|co\\.in|be|in|uk|org\\.in|org|net|edu|gov|mil)";
private static final Pattern hostExtractorRegexPattern = Pattern.compile(hostExtractorRegexString);
public static String getDomainName(String url){
if (url == null) return null;
url = url.trim();
Matcher m = hostExtractorRegexPattern.matcher(url);
if(m.find() && m.groupCount() == 2) {
return m.group(1) + m.group(2);
}
return null;
}
Explanation :
The regex has 4 groups. The first two are non-matching groups and the next two are matching groups.
The first non-matching group is "http" or "https" or ""
The second non-matching group is "www." or ""
The second matching group is the top level domain
The first matching group is anything after the non-matching groups and anything before the top level domain
The concatenation of the two matching groups will give us the domain/host name.
PS : Note that you can add any number of supported domains to the regex.

If the input url is user input. this method gives the most appropriate host name. if not found gives back the input url.
private String getHostName(String urlInput) {
urlInput = urlInput.toLowerCase();
String hostName=urlInput;
if(!urlInput.equals("")){
if(urlInput.startsWith("http") || urlInput.startsWith("https")){
try{
URL netUrl = new URL(urlInput);
String host= netUrl.getHost();
if(host.startsWith("www")){
hostName = host.substring("www".length()+1);
}else{
hostName=host;
}
}catch (MalformedURLException e){
hostName=urlInput;
}
}else if(urlInput.startsWith("www")){
hostName=urlInput.substring("www".length()+1);
}
return hostName;
}else{
return "";
}
}

To get the actual domain name, without the subdomain, I use:
private String getDomainName(String url) throws URISyntaxException {
String hostName = new URI(url).getHost();
if (!hostName.contains(".")) {
return hostName;
}
String[] host = hostName.split("\\.");
return host[host.length - 2];
}
Note that this won't work with second-level domains (like .co.uk).

// groovy
String hostname ={url -> url[(url.indexOf('://')+ 3)..-1]​.split('/')[0]​ }
hostname('http://hello.world.com/something') // return 'hello.world.com'
hostname('docker://quay.io/skopeo/stable') // return 'quay.io'

const val WWW = "www."
fun URL.domain(): String {
val domain: String = this.host
return if (domain.startsWith(ConstUtils.WWW)) {
domain.substring(ConstUtils.WWW.length)
} else {
domain
}
}

I use regex solution
public static String getDomainName(String url) {
return url.replaceAll("http(s)?://|www\\.|wap\\.|/.*", "");
}
It cleans url from "http/https/www./wap." and from all unnecessary things after / like "/questions" in "https://stackoverflow.com/questions" and we get just "stackoverflow.com"

Related

Using Selenium an JUnit to parse HTML Document for links

NullPointerException at if (hrefAttr.contains("?"))
I'm running into a problem. I'm using selenium and JUnit to parse through links and compare them to a list of links provided from a CSV file.
Everything was going well until I realized that I have to test the URLs and the query strings separately. I attempted to create an if statement saying that if the href attribute contained a "?" split the entire URL into an array containing two strings. The URL destination being the first string indexed and the query string being the second string indexed. and return the URL destination and append it to an ID. If there was no "?" in the URL string, just return the URL string and append it to an ID
I think the logic looks accurate but I keep returning a Null Pointer Exception at Line 76 (where the href.contains("?") condition is located. Code below:
public static ArrayList<String> getURLSFromHTML(WebDriver driver) {
// prepares variable for array of html link URLs
ArrayList <String> pageLinksList = new ArrayList<String>();
// prepares array to place all of the <a></a> tags found in the HTML
List <WebElement> aElements = driver.findElements(By.tagName("a"));
// loops through all the <a></a> tags found in the HTML
for (WebElement aElement : aElements) {
/*
* grabs the href attribute value and stores it into a variable
* grabs the QA_ID attribute value and stores it in a variable
* concatenates the QA_ID value with the href value and stores them in a variable
*/
String hrefAttr = aElement.getAttribute("href");
String QA_ID = aElement.getAttribute("QA_ID");
String linkConcat;
if (hrefAttr.contains("?")) {
String[] splitHref = hrefAttr.split("\\?");
String URL = splitHref[0];
linkConcat = QA_ID + "_" + URL;
} else {
linkConcat = QA_ID + "_" + hrefAttr;
}
String urlIgnoreAttr = aElement.getAttribute("URL_ignore");
String combIgnore = QA_ID + "_" + urlIgnoreAttr;
String combIgnoreVal = "ignore";
/*
* if the QA_ID is not null then add value to pageLinksList
* if URL_ignore attribute="ignore" in html, then add combIgnore value to pageLinksList
* else add linkConcat to pageLinksList
*/
if(!Objects.isNull(QA_ID)) {
if (Objects.equals(urlIgnoreAttr, combIgnoreVal)) {
pageLinksList.add(combIgnore);
}else {
pageLinksList.add(linkConcat);
}
}
}
System.out.println(pageLinksList);
return pageLinksList;
}
Please help!
The obvious solution is to check for null:
if (hrefAttr != null && hrefAttr.contains("?")) {
String[] splitHref = hrefAttr.split("\\?");
String URL = splitHref[0];
linkConcat = QA_ID + "_" + URL;
} else {
linkConcat = QA_ID + "_" + hrefAttr;
}
An anchor tag without href attribute can still be valid. Without html source we cannot explain the reason for the missing href attributes. The else branch will not throw a NPE, but it my be useless with hrefAttr == null.

How to check if the subdomain is also from same domain using java

i have a list of url's i need to filter specific domain and subdomain. say i have some domains like
http://www.example.com
http://test.example.com
http://test2.example.com
I need to extract urls which from domain example.com.
Working on project that required me to determine if two URLs are from the same sub domain (even when there are nested domains). I worked up a modification from the guide above. This holds out pretty well thus far:
public static boolean isOneSubdomainOfTheOther(String a, String b) {
try {
URL first = new URL(a);
String firstHost = first.getHost();
firstHost = firstHost.startsWith("www.") ? firstHost.substring(4) : firstHost;
URL second = new URL(b);
String secondHost = second.getHost();
secondHost = secondHost.startsWith("www.") ? secondHost.substring(4) : secondHost;
/*
Test if one is a substring of the other
*/
if (firstHost.contains(secondHost) || secondHost.contains(firstHost)) {
String[] firstPieces = firstHost.split("\\.");
String[] secondPieces = secondHost.split("\\.");
String[] longerHost = {""};
String[] shorterHost = {""};
if (firstPieces.length >= secondPieces.length) {
longerHost = firstPieces;
shorterHost = secondPieces;
} else {
longerHost = secondPieces;
shorterHost = firstPieces;
}
//int longLength = longURL.length;
int minLength = shorterHost.length;
int i = 1;
/*
Compare from the tail of both host and work backwards
*/
while (minLength > 0) {
String tail1 = longerHost[longerHost.length - i];
String tail2 = shorterHost[shorterHost.length - i];
if (tail1.equalsIgnoreCase(tail2)) {
//move up one place to the left
minLength--;
} else {
//domains do not match
return false;
}
i++;
}
if (minLength == 0) //shorter host exhausted. Is a sub domain
return true;
}
} catch (MalformedURLException ex) {
ex.printStackTrace();
}
return false;
}
Figure I'd leave it here for future reference of a similar problem.
I understand you are probably looking for a fancy solution using URL class or something but it is not required. Simply think of a way to extract "example.com" from each of the urls.
Note: example.com is essentially a different domain than say example.net. Thus extracting just "example" is technically the wrong thing to do.
We can divide a sample url say:
http://sub.example.com/page1.html
Step 1: Split the url with delimiter " / " to extract the part containing the domain.
Each such part may be looked at in form of the following blocks (which may be empty)
[www][subdomain][basedomain]
Step 2: Discard "www" (if present). We are left with [subdomain][basedomain]
Step 3: Split the string with delimiter " . "
Step 4: Find the total number of strings generated from the split. If there are 2 strings, both of them are the target domain (example and com). If there are >=3 strings, get the last 3 strings. If the length of last string is 3, then the last 2 strings comprise the domain (example and com). If the length of last string is 2, then the last 3 strings comprise the domain (example and co and uk)
I think this should do the trick (I do hope this wasn't a homework :D )
//You may clean this method to make it more optimum / better
private String getRootDomain(String url){
String[] domainKeys = url.split("/")[2].split("\\.");
int length = domainKeys.length;
int dummy = domainKeys[0].equals("www")?1:0;
if(length-dummy == 2)
return domainKeys[length-2] + "." + domainKeys[length-1];
else{
if(domainKeys[length-1].length == 2) {
return domainKeys[length-3] + "." + domainKeys[length-2] + "." + domainKeys[length-1];
}
else{
return domainKeys[length-2] + "." + domainKeys[length-1];
}
}
}

How to combine 2 java methods into one efficiently

I'm trying to create a validate java class that receives 4 inputs from an object passed as 1 from the requester. The class needs to convert float inputs to string and evaluate each input to meet a certain format and then throw exceptions complete with error message and code when it fails.
What I have is in two methods and would like to know if there is a better way to combine these two classes into one validate method for the main class to call. I don't seem to be able to get around using the pattern/matcher concept to insure the inputs are formatted correctly. Any help you can give would be very much appreciated.
public class Validator {
private static final String MoneyPattern ="^\\d{1,7}(\\.\\d{1,2})$" ;
private static final String PercentagePattern = "^\\d{1,3}\\.\\d{1,2}$";
private static final String CalendarYearPattern = "^20[1-9][0-9]$";
private int errorcode = 0;
private String errormessage = null;
public Validator(MyInput input){
}
private boolean verifyInput(){
String Percentage = ((Float) input.getPercentage().toString();
String Income = ((Float) input.getIncome().toString();
String PublicPlan = ((Float) input.getPublicPlan().toString();
String Year = ((Float) input.getYear();
try {
if (!doesMatch(Income, MoneyPattern)) {
errormessage = errormessage + "income,";
}
if (!doesMatch(PublicPlan, MoneyPattern)) {
errormessage = errormessage + "insurance plan,";
}
if (!doesMatch(Percentage, PercentagePattern)) {
errormessage = errormessage + "Percentage Plan,";
}
if (!doesMatch(Year, CalendarYearPattern)) {
errormessage = errormessage + "Year,";
}
} catch (Exception e){
errorcode = 111;
errormessage = e.getMessage();
}
}
private boolean doesMatch(String s, String pattern) throws Exception{
try {
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(s);
if (!s.equals("")){
if(m.find()){
return true;
} else {
return false;
}
}else {
return false;
}
} catch (PatternSyntaxException pse){
errorcode = 111;
errormessage = pse.getMessage();
}
}
}
This code is borked from the word "go". You have a constructor into which you pass a MyInput reference, but there's no code in the ctor and no private data member to receive it. It looks like you expect to use input in your doesMatch() method, but it's a NullPointerException waiting to happen.
Your code doesn't follow the Sun Java coding standards; variable names should be lower case.
Why you wouldn't do that input validation in the ctor, when you actually receive the value, is beyond me. Perhaps you really meant to pass it into that verifyInput() method.
I would worry about correctness and readability before concerning myself with efficiency.
I'd have methods like this:
public boolean isValidMoney(String money) { // put the regex here }
public boolean isValidYear(String year) { // the regex here }
I think I'd prefer a real Money class to a String. There's no abstraction whatsoever.
Here's one bit of honesty:
private static final String CalendarYearPattern = "^20[1-9][0-9]$";
I guess you either don't think this code will still be running in the 22nd century or you won't be here to maintain it.
One way of doing this would be with DynamicBeans.
package com.acme.validator;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.commons.beanutils.PropertyUtils;
public class Validator {
//A simple optimisation of the pattern
private static final Pattern MoneyPattern = Pattern.compile("^\\d{1,7}(\\.\\d{1,2})$");
private static final Pattern PercentagePattern = Pattern.compile("^\\d{1,3}\\.\\d{1,2}$");
private static final Pattern CalendarYearPattern = Pattern.compile("^20[1-9][0-9]$");
public String Validator(MyInput input) {
String errormessage = "";
/*
* Setting these up as Maps.
* Ideally this would be a 'simple bean'
* but that goes beyond the scope of the
* original question
*/
Map<String,Pattern> patternMap = new HashMap<String,Pattern>();
patternMap.put("percentage", PercentagePattern);
patternMap.put("publicPlan", MoneyPattern);
patternMap.put("income", MoneyPattern);
patternMap.put("year", CalendarYearPattern);
Map<String,String> errorMap = new HashMap<String,String>();
errorMap.put("percentage", "Percentage Plan,");
errorMap.put("publicPlan", "insurance plan,");
errorMap.put("income", "income,");
errorMap.put("year", "Year,");
for (String key : patternMap.keySet()) {
try {
String match = ((Float) PropertyUtils.getSimpleProperty(input, key)).toString();
Matcher m = patternMap.get(key).matcher(match);
if ("".equals(match) || !m.find()) {
errormessage = errormessage + errorMap.get(key);
}
} catch (Exception e) {
errormessage = e.getMessage(); //since getMessage() could be null, you need to work out some way of handling this in the response
//don't know the point of the error code so remove this altogether
break; //Assume an exception trumps any validation failure
}
}
return errormessage;
}
}
I've made a few assumptions about the validation rules (for simplicity used 2 maps but you could also use a single map and a bean containing both the Pattern and the Message and even the 'error code' if that is important).
The key 'flaw' in your original setup and what would hamper the solution above, is that you are using 'year' as Float in the input bean.
(new Float(2012)).toString()
The above returns "2012.0". This will always fail your pattern. When you start messing about with the different types of objects potentially in the input bean, you may need to consider ensuring they are String at the time of creating the input bean and not, as is the case here, when they are retrieved.
Good Luck with the rest of your Java experience.

How to parse a cookie string

I would like to take a Cookie string (as it might be returned in a Set-Cookie header) and be able to easily modify parts of it, specifically the expiration date.
I see there are several different Cookie classes, such as BasicClientCookie, available but I don't see any easy way to parse the string into one of those objects.
I see in api level 9 they added HttpCookie which has a parse method, but I need something to work in previous versions.
Any ideas?
Thanks
How about java.net.HttpCookie:
List<HttpCookie> cookies = HttpCookie.parse(header);
I believe you'll have to parse it out manually. Try this:
BasicClientCookie parseRawCookie(String rawCookie) throws Exception {
String[] rawCookieParams = rawCookie.split(";");
String[] rawCookieNameAndValue = rawCookieParams[0].split("=");
if (rawCookieNameAndValue.length != 2) {
throw new Exception("Invalid cookie: missing name and value.");
}
String cookieName = rawCookieNameAndValue[0].trim();
String cookieValue = rawCookieNameAndValue[1].trim();
BasicClientCookie cookie = new BasicClientCookie(cookieName, cookieValue);
for (int i = 1; i < rawCookieParams.length; i++) {
String rawCookieParamNameAndValue[] = rawCookieParams[i].trim().split("=");
String paramName = rawCookieParamNameAndValue[0].trim();
if (paramName.equalsIgnoreCase("secure")) {
cookie.setSecure(true);
} else {
if (rawCookieParamNameAndValue.length != 2) {
throw new Exception("Invalid cookie: attribute not a flag or missing value.");
}
String paramValue = rawCookieParamNameAndValue[1].trim();
if (paramName.equalsIgnoreCase("expires")) {
Date expiryDate = DateFormat.getDateTimeInstance(DateFormat.FULL)
.parse(paramValue);
cookie.setExpiryDate(expiryDate);
} else if (paramName.equalsIgnoreCase("max-age")) {
long maxAge = Long.parseLong(paramValue);
Date expiryDate = new Date(System.getCurrentTimeMillis() + maxAge);
cookie.setExpiryDate(expiryDate);
} else if (paramName.equalsIgnoreCase("domain")) {
cookie.setDomain(paramValue);
} else if (paramName.equalsIgnoreCase("path")) {
cookie.setPath(paramValue);
} else if (paramName.equalsIgnoreCase("comment")) {
cookie.setPath(paramValue);
} else {
throw new Exception("Invalid cookie: invalid attribute name.");
}
}
}
return cookie;
}
I haven't actually compiled or run this code, but it should be a strong start. You'll probably have to mess with the date parsing a bit: I'm not sure that the date format used in cookies is actually the same as DateFormat.FULL. (Check out this related question, which addresses handling the date format in cookies.) Also, note that there are some cookie attributes not handled by BasicClientCookie such as version and httponly.
Finally, this code assumes that the name and value of the cookie appear as the first attribute: I'm not sure if that's necessarily true, but that's how every cookie I've ever seen is ordered.
You can use Apache HttpClient's facilities for that.
Here's an excerpt from CookieJar:
CookieSpec cookieSpec = new BrowserCompatSpec();
List<Cookie> parseCookies(URI uri, List<String> cookieHeaders) {
ArrayList<Cookie> cookies = new ArrayList<Cookie>();
int port = (uri.getPort() < 0) ? 80 : uri.getPort();
boolean secure = "https".equals(uri.getScheme());
CookieOrigin origin = new CookieOrigin(uri.getHost(), port,
uri.getPath(), secure);
for (String cookieHeader : cookieHeaders) {
BasicHeader header = new BasicHeader(SM.SET_COOKIE, cookieHeader);
try {
cookies.addAll(cookieSpec.parse(header, origin));
} catch (MalformedCookieException e) {
L.d(e);
}
}
return cookies;
}
Funny enough, but java.net.HttpCookie class cannot parse cookie strings with domain and/or path parts that this exact java.net.HttpCookie class has converted to strings.
For example:
HttpCookie newCookie = new HttpCookie("cookieName", "cookieValue");
newCookie.setDomain("cookieDomain.com");
newCookie.setPath("/");
As this class implements neither Serializable nor Parcelable, it's tempting to store cookies as strings. So you write something like:
saveMyCookieAsString(newCookie.toString());
This statement will save the cookie in the following format:
cookieName="cookieValue";$Path="/";$Domain="cookiedomain.com"
And then you want to restore this cookie, so you get the string:
String cookieAsString = restoreMyCookieString();
and try to parse it:
List<HttpCookie> cookiesList = HttpCookie.parse(cookieAsString);
StringBuilder myCookieAsStringNow = new StringBuilder();
for(HttpCookie httpCookie: cookiesList) {
myCookieAsStringNow.append(httpCookie.toString());
}
now myCookieAsStringNow.toString(); produces
cookieName=cookieValue
Domain and path parts are just gone. The reason: parse method is case sensitive to words like "domain" and "path". Possible workaround: provide another toString() method like:
public static String httpCookieToString(HttpCookie httpCookie) {
StringBuilder result = new StringBuilder()
.append(httpCookie.getName())
.append("=")
.append("\"")
.append(httpCookie.getValue())
.append("\"");
if(!TextUtils.isEmpty(httpCookie.getDomain())) {
result.append("; domain=")
.append(httpCookie.getDomain());
}
if(!TextUtils.isEmpty(httpCookie.getPath())){
result.append("; path=")
.append(httpCookie.getPath());
}
return result.toString();
}
I find it funny (especially, for classes like java.net.HttpCookie which are aimed to be used by a lot and lot of people) and I hope it will be useful for someone.
With a regular expression like :
([^=]+)=([^\;]+);\s?
you can parse a cookie like this :
.COOKIEAUTH=5DEF0BF530F749AD46F652BDF31C372526A42FEB9D40162167CB39C4D43FC8AF1C4B6DF0C24ECB1945DFF7952C70FDA1E4AF12C1803F9D089E78348C4B41802279897807F85905D6B6D2D42896BA2A267E9F564814631B4B31EE41A483C886B14B5A1E76FD264FB230E87877CB9A4A2A7BDB0B0101BC2C1AF3A029CC54EE4FBC;
expires=Sat, 30-Jul-2011 01:22:34 GMT;
path=/; HttpOnly
in a few lines of code.
If you happen to have Netty HTTP codec installed, you can also use io.netty.handler.codec.http.cookie.ServerCookieDecoder.LAX|STRICT. Very convenient.
The advantage of Yanchenko's approach with Apache Http client is that is validates the cookies consistent with the spec based on the origin. The regular expression approach won't do that, but perhaps you don't need to.
public class CookieUtil {
public List<Cookie> parseCookieString(String cookies) {
List<Cookie> cookieList = new ArrayList<Cookie>();
Pattern cookiePattern = Pattern.compile("([^=]+)=([^\\;]*);?\\s?");
Matcher matcher = cookiePattern.matcher(cookies);
while (matcher.find()) {
int groupCount = matcher.groupCount();
System.out.println("matched: " + matcher.group(0));
for (int groupIndex = 0; groupIndex <= groupCount; ++groupIndex) {
System.out.println("group[" + groupIndex + "]=" + matcher.group(groupIndex));
}
String cookieKey = matcher.group(1);
String cookieValue = matcher.group(2);
Cookie cookie = new BasicClientCookie(cookieKey, cookieValue);
cookieList.add(cookie);
}
return cookieList;
}
}
I've attached a small example using yanchenkos regex. It needs to be tweaked just a little. Without the '?' quantity modifer on the trailing ';' the trailing attribute for any cookie will not be matched. After that, if you care about the other attributes you can use Doug's code, properly encapsulated, to parse the other match groups.
Edit: Also, note '*' qualifier on the value of the cookie itself. Values are optional and you can get cookies like "de=", i.e. no value at all. Looking at the regex again, I don't think it will handle the secure and discard cookie attributes which do not have an '='.
If you want to parse to javax.servlet.http.Cookie, you may first parse to java.net.HttpCookie and then convert to Cookie. But theoretically it's may be incompatible because of cookie's version.
HttpCookie httpCookie = ...
Cookie cookie = toServletCookie(httpCookie);
private static boolean isNotEmpty(String str) {
return !(str == null || str.trim().isEmpty());
}
public static Cookie toServletCookie(HttpCookie httpCookie) {
Cookie cookie = new Cookie(httpCookie.getName(), httpCookie.getValue());
if (isNotEmpty(httpCookie.getDomain())) {
cookie.setDomain(httpCookie.getDomain());
}
if (isNotEmpty(httpCookie.getPath())) {
cookie.setPath(httpCookie.getPath());
}
cookie.setHttpOnly(httpCookie.isHttpOnly());
cookie.setSecure(httpCookie.getSecure());
if (isNotEmpty(httpCookie.getComment())) {
cookie.setComment(httpCookie.getComment());
}
cookie.setMaxAge((int) Math.min(httpCookie.getMaxAge(), Integer.MAX_VALUE));
return cookie;
}
CookieManager cookieManager = new CookieManager();
CookieHandler.setDefault(cookieManager);
HttpCookie cookie = new HttpCookie("lang", "en");
cookie.setDomain("Your URL");
cookie.setPath("/");
cookie.setVersion(0);
cookieManager.getCookieStore().add(new URI("https://Your URL/"), cookie);
List<HttpCookie> Cookies = cookieManager.getCookieStore().get(new URI("https://Your URL/"));
String s = Cookies.get(0).getValue();
val headers = ..........
val headerBuilder = Headers.Builder()
headers?.forEach {
val values = it.split(";")
values.forEach { v ->
if (v.contains("=")) {
headerBuilder.add(v.replace("=", ":"))
}
}
}
val headers = headerBuilder.build()

How to normalize a URL in Java?

URL normalization (or URL canonicalization) is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized or canonical URL so it is possible to determine if two syntactically different URLs are equivalent.
Strategies include adding trailing slashes, https => http, etc. The Wikipedia page lists many.
Got a favorite method of doing this in Java? Perhaps a library (Nutch?), but I'm open. Smaller and fewer dependencies is better.
I'll handcode something for now and keep an eye on this question.
EDIT: I want to aggressively normalize to count URLs as the same if they refer to the same content. For example, I ignore the parameters utm_source, utm_medium, utm_campaign. For example, I ignore subdomain if the title is the same.
Have you taken a look at the URI class?
http://docs.oracle.com/javase/7/docs/api/java/net/URI.html#normalize()
I found this question last night, but there wasn't an answer I was looking for so I made my own. Here it is incase somebody in the future wants it:
/**
* - Covert the scheme and host to lowercase (done by java.net.URL)
* - Normalize the path (done by java.net.URI)
* - Add the port number.
* - Remove the fragment (the part after the #).
* - Remove trailing slash.
* - Sort the query string params.
* - Remove some query string params like "utm_*" and "*session*".
*/
public class NormalizeURL
{
public static String normalize(final String taintedURL) throws MalformedURLException
{
final URL url;
try
{
url = new URI(taintedURL).normalize().toURL();
}
catch (URISyntaxException e) {
throw new MalformedURLException(e.getMessage());
}
final String path = url.getPath().replace("/$", "");
final SortedMap<String, String> params = createParameterMap(url.getQuery());
final int port = url.getPort();
final String queryString;
if (params != null)
{
// Some params are only relevant for user tracking, so remove the most commons ones.
for (Iterator<String> i = params.keySet().iterator(); i.hasNext();)
{
final String key = i.next();
if (key.startsWith("utm_") || key.contains("session"))
{
i.remove();
}
}
queryString = "?" + canonicalize(params);
}
else
{
queryString = "";
}
return url.getProtocol() + "://" + url.getHost()
+ (port != -1 && port != 80 ? ":" + port : "")
+ path + queryString;
}
/**
* Takes a query string, separates the constituent name-value pairs, and
* stores them in a SortedMap ordered by lexicographical order.
* #return Null if there is no query string.
*/
private static SortedMap<String, String> createParameterMap(final String queryString)
{
if (queryString == null || queryString.isEmpty())
{
return null;
}
final String[] pairs = queryString.split("&");
final Map<String, String> params = new HashMap<String, String>(pairs.length);
for (final String pair : pairs)
{
if (pair.length() < 1)
{
continue;
}
String[] tokens = pair.split("=", 2);
for (int j = 0; j < tokens.length; j++)
{
try
{
tokens[j] = URLDecoder.decode(tokens[j], "UTF-8");
}
catch (UnsupportedEncodingException ex)
{
ex.printStackTrace();
}
}
switch (tokens.length)
{
case 1:
{
if (pair.charAt(0) == '=')
{
params.put("", tokens[0]);
}
else
{
params.put(tokens[0], "");
}
break;
}
case 2:
{
params.put(tokens[0], tokens[1]);
break;
}
}
}
return new TreeMap<String, String>(params);
}
/**
* Canonicalize the query string.
*
* #param sortedParamMap Parameter name-value pairs in lexicographical order.
* #return Canonical form of query string.
*/
private static String canonicalize(final SortedMap<String, String> sortedParamMap)
{
if (sortedParamMap == null || sortedParamMap.isEmpty())
{
return "";
}
final StringBuffer sb = new StringBuffer(350);
final Iterator<Map.Entry<String, String>> iter = sortedParamMap.entrySet().iterator();
while (iter.hasNext())
{
final Map.Entry<String, String> pair = iter.next();
sb.append(percentEncodeRfc3986(pair.getKey()));
sb.append('=');
sb.append(percentEncodeRfc3986(pair.getValue()));
if (iter.hasNext())
{
sb.append('&');
}
}
return sb.toString();
}
/**
* Percent-encode values according the RFC 3986. The built-in Java URLEncoder does not encode
* according to the RFC, so we make the extra replacements.
*
* #param string Decoded string.
* #return Encoded string per RFC 3986.
*/
private static String percentEncodeRfc3986(final String string)
{
try
{
return URLEncoder.encode(string, "UTF-8").replace("+", "%20").replace("*", "%2A").replace("%7E", "~");
}
catch (UnsupportedEncodingException e)
{
return string;
}
}
}
Because you also want to identify URLs which refer to the same content, I found this paper from the WWW2007 pretty interesting: Do Not Crawl in the DUST: Different URLs with Similar Text. It provides you with a nice theoretical approach.
No, there is nothing in the standard libraries to do this. Canonicalization includes things like decoding unnecessarily encoded characters, converting hostnames to lowercase, etc.
e.g. http://ACME.com/./foo%26bar becomes:
http://acme.com/foo&bar
URI's normalize() does not do this.
The RL library:
https://github.com/backchatio/rl
goes quite a ways beyond java.net.URL.normalize().
It's in Scala, but I imagine it should be useable from Java.
You can do this with the Restlet framework using Reference.normalize(). You should also be able to remove the elements you don't need quite conveniently with this class.
In Java, normalize parts of a URL
Example of a URL: https://i0.wp.com:55/lplresearch.com/wp-content/feb.png?ssl=1&myvar=2#myfragment
protocol: https
domain name: i0.wp.com
subdomain: i0
port: 55
path: /lplresearch.com/wp-content/uploads/2019/01/feb.png?ssl=1
query: ?ssl=1"
parameters: &myvar=2
fragment: #myfragment
Code to do the URL parsing:
import java.util.*;
import java.util.regex.*;
public class regex {
public static String getProtocol(String the_url){
Pattern p = Pattern.compile("^(http|https|smtp|ftp|file|pop)://.*");
Matcher m = p.matcher(the_url);
return m.group(1);
}
public static String getParameters(String the_url){
Pattern p = Pattern.compile(".*(\\?[-a-zA-Z0-9_.#!$&''()*+,;=]+)(#.*)*$");
Matcher m = p.matcher(the_url);
return m.group(1);
}
public static String getFragment(String the_url){
Pattern p = Pattern.compile(".*(#.*)$");
Matcher m = p.matcher(the_url);
return m.group(1);
}
public static void main(String[] args){
String the_url =
"https://i0.wp.com:55/lplresearch.com/" +
"wp-content/feb.png?ssl=1&myvar=2#myfragment";
System.out.println(getProtocol(the_url));
System.out.println(getFragment(the_url));
System.out.println(getParameters(the_url));
}
}
Prints
https
#myfragment
?ssl=1&myvar=2
You can then push and pull on the parts of the URL until they are up to muster.
Im have a simple way to solve it. Here is my code
public static String normalizeURL(String oldLink)
{
int pos=oldLink.indexOf("://");
String newLink="http"+oldLink.substring(pos);
return newLink;
}

Categories

Resources