Using Selenium an JUnit to parse HTML Document for links

Using Selenium an JUnit to parse HTML Document for links - java

NullPointerException at if (hrefAttr.contains("?"))
I'm running into a problem. I'm using selenium and JUnit to parse through links and compare them to a list of links provided from a CSV file.
Everything was going well until I realized that I have to test the URLs and the query strings separately. I attempted to create an if statement saying that if the href attribute contained a "?" split the entire URL into an array containing two strings. The URL destination being the first string indexed and the query string being the second string indexed. and return the URL destination and append it to an ID. If there was no "?" in the URL string, just return the URL string and append it to an ID
I think the logic looks accurate but I keep returning a Null Pointer Exception at Line 76 (where the href.contains("?") condition is located. Code below:
public static ArrayList<String> getURLSFromHTML(WebDriver driver) {
// prepares variable for array of html link URLs
ArrayList <String> pageLinksList = new ArrayList<String>();
// prepares array to place all of the <a></a> tags found in the HTML
List <WebElement> aElements = driver.findElements(By.tagName("a"));
// loops through all the <a></a> tags found in the HTML
for (WebElement aElement : aElements) {
/*
* grabs the href attribute value and stores it into a variable
* grabs the QA_ID attribute value and stores it in a variable
* concatenates the QA_ID value with the href value and stores them in a variable
*/
String hrefAttr = aElement.getAttribute("href");
String QA_ID = aElement.getAttribute("QA_ID");
String linkConcat;
if (hrefAttr.contains("?")) {
String[] splitHref = hrefAttr.split("\\?");
String URL = splitHref[0];
linkConcat = QA_ID + "_" + URL;
} else {
linkConcat = QA_ID + "_" + hrefAttr;
}
String urlIgnoreAttr = aElement.getAttribute("URL_ignore");
String combIgnore = QA_ID + "_" + urlIgnoreAttr;
String combIgnoreVal = "ignore";
/*
* if the QA_ID is not null then add value to pageLinksList
* if URL_ignore attribute="ignore" in html, then add combIgnore value to pageLinksList
* else add linkConcat to pageLinksList
*/
if(!Objects.isNull(QA_ID)) {
if (Objects.equals(urlIgnoreAttr, combIgnoreVal)) {
pageLinksList.add(combIgnore);
}else {
pageLinksList.add(linkConcat);
}
}
}
System.out.println(pageLinksList);
return pageLinksList;
}
Please help!

The obvious solution is to check for null:
if (hrefAttr != null && hrefAttr.contains("?")) {
String[] splitHref = hrefAttr.split("\\?");
String URL = splitHref[0];
linkConcat = QA_ID + "_" + URL;
} else {
linkConcat = QA_ID + "_" + hrefAttr;
}
An anchor tag without href attribute can still be valid. Without html source we cannot explain the reason for the missing href attributes. The else branch will not throw a NPE, but it my be useless with hrefAttr == null.

Related

java: Can't check this string

Can’t convert String into json, and it seems that it will be superfluous for the entire string.
Was thinking maybe json might have helped me out here, but it doesn't seem to give me what I want or I don't know how it will be work.
How I can check the string?
I need to check:
METHOD: GET and URL: http://google.com/
also to check the BODY contains the fields userId, replId and view (no values, only keys)
I was trying to find a way to check that:
if (msg.contains("METHOD: GET") && msg.contains("URL: http://google.com/") && msg.contains("BODY: etc...")) {
System.out.println("ok");
}
It doesn't work. Some values from BODY that are dynamic and that's why for BODY the check won't pass if it’s so hardcoded String. And I guess there're any better ways to do that.
I'd like to have something like:
Assert.assertEquals(
msg,
the expected value for METHOD, which contains GET); // same here for URL: http://google.com/
Assert.assertEquals(
msg,
the expected value for BODY that has userId, replId, and view fields); // or make this assertion for each field separately, such as there is an assertion for the userId field, the same assertions for replId and view
And here's the String:
String msg = "METHOD: GET\n" +
"URL: http://google.com/\n" +
"token: 32Asdd1QQdsdsg$ff\n" +
"code: 200\n" +
"stand: test\n" +
"BODY: {\"userId\":\"11022:7\",\"bdaId\":\"110220\",\"replId\":\"fffDss0400rDF\",\"local\":\"not\",\"ttpm\":\"000\",\"view\":true}";
I can't think of any way to check that. Any ideas?

You can use the java.util.List Interface (of type String) and place the string contents into that list. Then you can use the List#contains() method, for example:
String msg = "METHOD: GET\n" +
"URL: http://google.com/\n" +
"token: 32Asdd1QQdsdsg$ff\n" +
"code: 200\n" +
"stand: test\n" +
"BODY: {\"userId\":\"11022:7\",\"bdaId\":\"110220\",\"replId\":\"fffDss0400rDF\",\"local\":\"not\",\"ttpm\":\"000\",\"view\":true}";
// Split contents of msg into list.
java.util.List<String> list = Arrays.asList(msg.split("\n"));
if (list.contains("METHOD: GET")) {
System.out.println("YUP! Got: --> 'METHOD: GET'");
}
else {
System.out.println("NOPE! Don't have: --> 'METHOD: GET'");
}

I've tried to use Assert:
String[] arr1 = msg.split("\n");
Map<String, String> allFieldsMessage = new HashMap<>();
for (String s : arr1) {
String key = s.trim().split(": ")[0];
String value = s.trim().split(": ")[1];
allFieldsMessage.put(key, value);
}
Assert.assertEquals(
allFieldsMessage.get("METHOD"),
"GET"
);
And the same for URL. But my problem is in BODY part. I thought maybe try to parse this particular part of String into json and then only check the necessary keys.

How to append the id to the url after assigning the id to the String

Scenario: I need to append the id to the url.
What I have done :
I have taken the last id from the table and stored it in a list:
Then I get the text of the id and is Stored in a String.
List<WebElement> id = driver.findElements(By.xpath("(//table[contains(#class,'mat-table')]//tr/td[1])[last()]"));
int rowsize = id.size();
for(int i=0;i<rowsize;i++)
{
String text = id.get(i).getText();
System.out.println("Get the id:"+text);
Then I use that text and append it to the URL
String confirmationURL = "https://test-websites.net/#/email?type=confirm";
String newurl = confirmationURL+"&id=text"; = **This part iam giving the text as id ... which is
wrong and I need to enter the id which I got from the list ....**
driver.get(newurl);
So Basically the url should be like: https://test-websites.net /#/email?type=confirm&id=47474
Can someone pls give inputs on what should be done?

You can create a new list of URLS, and can use add method to append text.
List<WebElement> id = driver.findElements(By.xpath("(//table[contains(#class,'mat-table')]//tr/td[1])[last()]"));
String confirmationURL = "https://test-websites.net/#/email?type=confirm";
List<String> newurls = new ArrayList<String>();
int rowsize = id.size();
for(int i = 0; i < rowsize; i++) {
String text = id.get(i).getText();
System.out.println("Get the id:"+text);
newurls.add(confirmationURL + "&id=" + text);
}
after successfully execution of this code, you'd have a newurls list with URLs ending with id's from //table[contains(#class,'mat-table')]//tr/td[1])[last()] xpath.

Java: Get properties of an object by parsing XML-file

I got a question regarding XML and parsing it. I use JDOM to parse my XML-File, but I got a little Problem.
A sample of my XML-File looks like this:
<IO name="Bus" type="Class">
<ResourceAttribute name="Bandwidth" type="KiloBitPerSecond" value="50" />
</IO>
Bus is a object instance of the class IO. The object got the name and type properties. Additional it has some attributes, like in the sample, the Attribute Bandwidth with the value of 50 and the datatype KiloBitPerSecond.
So when I want to loop over the file with:
for(Element packages : listPackages)
{
Map<String, Values> valueMap = new HashMap<String, Values>();
List<Element> objectInstanceList = packages.getChildren();
for(Element objects : objectInstanceList)
{
List<Element> listObjectClasses = objects.getChildren();
for(Element classes : listObjectClasses)
{
List<Element> listObjectAttributes = classes.getChildren();
for(Element objectAttributes : listObjectAttributes)
{
List<Attribute> listAttributes = objectAttributes.getAttributes();
for(Attribute attributes : listAttributes)
{
String name = attributes.getName();
String value = attributes.getValue();
AttributeType datatype = attributes.getAttributeType();
Values v = new Values(name, datatype, value);
valueMap.put(classes.getName(), v);
System.out.println(name + ":" + value);
}
}
}
}
//System.out.println(valueMap);
}
values is a class which defines the object attribute:
public class Values{
private String name;
//private AttributeType datatype;
private String value;
Thats the rest of the Code. I got two question relating that. The first one got more priority at the moment.
How do I get the values of the object(Attribute.Name = Bandwidth; Attribute.Value = 50) ? Istead that I get
name:Bus
type:Class
I thought about an additional for-loop, but the JDOM class attribute dont have a method called getAttributes().
Thats just second priority because without question 1 I cannot go further. As you see in the sample, an Attribute got 3 properties, name, type and value. How can I extract that triple put of the sample. JDOM seems just to know 2 properties for an Attribute, name and value.
thanks a lot in advance and hopefully I managed to express my self.
Edit: Added an additional for-loop in it, so the output now is:
name:Bandwidth
type:KiloBitPerSecond
value:50
That means name is the name of that property and value is the value of name. Didnt know that. At least question one is clearer now and I can try working on 2, but the new information makes 2 clearer to me.

In xml the opening tag of elements are encosoed between < and > (or />) , after the < comes the name of the element, then comes a list of attributes in the format name="value". An element can be closed inline with /> or with a closing tag </[element name]>
It would be preferable to use recursion to parse your xml instead of badly readable/maintainable nested for loops.
Here is how it could look like:
#Test
public void parseXmlRec() throws JDOMException, IOException {
String xml = "<root>"
+ "<Package>"
+ "<IO name=\"Bus\" type=\"Class\">\r\n" +
" <ResourceAttribute name=\"Bandwidth\" type=\"KiloBitPerSecond\" value=\"50\" />\r\n" +
" </IO>"
+ "</Package>"
+ "</root>";
InputStream is = new ByteArrayInputStream(xml.getBytes());
SAXBuilder sb = new SAXBuilder();
Document document = sb.build(is);
is.close();
Element root = document.getRootElement();
List<Element> children = root.getChildren();
for(Element element : children) {
parseelement(element);
}
}
private void parseelement(Element element) {
System.out.println("Element:" + element.getName());
String name = element.getAttributeValue("name");
if(name != null) {
System.out.println("name: " + name);
}
String type = element.getAttributeValue("type");
if(type != null) {
System.out.println("type: " + type);
}
String value = element.getAttributeValue("value");
if(value != null) {
System.out.println("value: " + value);
}
List<Element> children = element.getChildren();
if(children != null) {
for(Element child : children) {
parseelement(child);
}
}
}
This outputs:
Element: Package
Element: IO
name: Bus
type: Class
Element: ResourceAttribute
name: Bandwidth
type: KiloBitPerSecond
value: 50
While parsing, check the name of each element and instanciate the coresponding objects. For that I would suggest to write a separate method to handle each element. For example:
void parsePackage(Element packageElement) { ... }
parseIO(Element ioElement) { ... }
void parseResourceAttribute(Element resourceAttributeElement) { ... }

How to check if the subdomain is also from same domain using java

i have a list of url's i need to filter specific domain and subdomain. say i have some domains like
http://www.example.com
http://test.example.com
http://test2.example.com
I need to extract urls which from domain example.com.

Working on project that required me to determine if two URLs are from the same sub domain (even when there are nested domains). I worked up a modification from the guide above. This holds out pretty well thus far:
public static boolean isOneSubdomainOfTheOther(String a, String b) {
try {
URL first = new URL(a);
String firstHost = first.getHost();
firstHost = firstHost.startsWith("www.") ? firstHost.substring(4) : firstHost;
URL second = new URL(b);
String secondHost = second.getHost();
secondHost = secondHost.startsWith("www.") ? secondHost.substring(4) : secondHost;
/*
Test if one is a substring of the other
*/
if (firstHost.contains(secondHost) || secondHost.contains(firstHost)) {
String[] firstPieces = firstHost.split("\\.");
String[] secondPieces = secondHost.split("\\.");
String[] longerHost = {""};
String[] shorterHost = {""};
if (firstPieces.length >= secondPieces.length) {
longerHost = firstPieces;
shorterHost = secondPieces;
} else {
longerHost = secondPieces;
shorterHost = firstPieces;
}
//int longLength = longURL.length;
int minLength = shorterHost.length;
int i = 1;
/*
Compare from the tail of both host and work backwards
*/
while (minLength > 0) {
String tail1 = longerHost[longerHost.length - i];
String tail2 = shorterHost[shorterHost.length - i];
if (tail1.equalsIgnoreCase(tail2)) {
//move up one place to the left
minLength--;
} else {
//domains do not match
return false;
}
i++;
}
if (minLength == 0) //shorter host exhausted. Is a sub domain
return true;
}
} catch (MalformedURLException ex) {
ex.printStackTrace();
}
return false;
}
Figure I'd leave it here for future reference of a similar problem.

I understand you are probably looking for a fancy solution using URL class or something but it is not required. Simply think of a way to extract "example.com" from each of the urls.
Note: example.com is essentially a different domain than say example.net. Thus extracting just "example" is technically the wrong thing to do.
We can divide a sample url say:
http://sub.example.com/page1.html
Step 1: Split the url with delimiter " / " to extract the part containing the domain.
Each such part may be looked at in form of the following blocks (which may be empty)
[www][subdomain][basedomain]
Step 2: Discard "www" (if present). We are left with [subdomain][basedomain]
Step 3: Split the string with delimiter " . "
Step 4: Find the total number of strings generated from the split. If there are 2 strings, both of them are the target domain (example and com). If there are >=3 strings, get the last 3 strings. If the length of last string is 3, then the last 2 strings comprise the domain (example and com). If the length of last string is 2, then the last 3 strings comprise the domain (example and co and uk)
I think this should do the trick (I do hope this wasn't a homework :D )
//You may clean this method to make it more optimum / better
private String getRootDomain(String url){
String[] domainKeys = url.split("/")[2].split("\\.");
int length = domainKeys.length;
int dummy = domainKeys[0].equals("www")?1:0;
if(length-dummy == 2)
return domainKeys[length-2] + "." + domainKeys[length-1];
else{
if(domainKeys[length-1].length == 2) {
return domainKeys[length-3] + "." + domainKeys[length-2] + "." + domainKeys[length-1];
}
else{
return domainKeys[length-2] + "." + domainKeys[length-1];
}
}
}

Get domain name from given url

Given a URL, I want to extract domain name(It should not include 'www' part). Url can contain http/https. Here is the java code that I wrote. Though It seems to work fine, is there any better approach or are there some edge cases, that could fail.
public static String getDomainName(String url) throws MalformedURLException{
if(!url.startsWith("http") && !url.startsWith("https")){
url = "http://" + url;
}
URL netUrl = new URL(url);
String host = netUrl.getHost();
if(host.startsWith("www")){
host = host.substring("www".length()+1);
}
return host;
}
Input: http://google.com/blah
Output: google.com

If you want to parse a URL, use java.net.URI. java.net.URL has a bunch of problems -- its equals method does a DNS lookup which means code using it can be vulnerable to denial of service attacks when used with untrusted inputs.
"Mr. Gosling -- why did you make url equals suck?" explains one such problem. Just get in the habit of using java.net.URI instead.
public static String getDomainName(String url) throws URISyntaxException {
URI uri = new URI(url);
String domain = uri.getHost();
return domain.startsWith("www.") ? domain.substring(4) : domain;
}
should do what you want.
Though It seems to work fine, is there any better approach or are there some edge cases, that could fail.
Your code as written fails for the valid URLs:
httpfoo/bar -- relative URL with a path component that starts with http.
HTTP://example.com/ -- protocol is case-insensitive.
//example.com/ -- protocol relative URL with a host
www/foo -- a relative URL with a path component that starts with www
wwwexample.com -- domain name that does not starts with www. but starts with www.
Hierarchical URLs have a complex grammar. If you try to roll your own parser without carefully reading RFC 3986, you will probably get it wrong. Just use the one that's built into the core libraries.
If you really need to deal with messy inputs that java.net.URI rejects, see RFC 3986 Appendix B:
Appendix B. Parsing a URI Reference with a Regular Expression
As the "first-match-wins" algorithm is identical to the "greedy"
disambiguation method used by POSIX regular expressions, it is
natural and commonplace to use a regular expression for parsing the
potential five components of a URI reference.
The following line is the regular expression for breaking-down a
well-formed URI reference into its components.
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9
The numbers in the second line above are only to assist readability;
they indicate the reference points for each subexpression (i.e., each
paired parenthesis).

import java.net.*;
import java.io.*;
public class ParseURL {
public static void main(String[] args) throws Exception {
URL aURL = new URL("http://example.com:80/docs/books/tutorial"
+ "/index.html?name=networking#DOWNLOADING");
System.out.println("protocol = " + aURL.getProtocol()); //http
System.out.println("authority = " + aURL.getAuthority()); //example.com:80
System.out.println("host = " + aURL.getHost()); //example.com
System.out.println("port = " + aURL.getPort()); //80
System.out.println("path = " + aURL.getPath()); // /docs/books/tutorial/index.html
System.out.println("query = " + aURL.getQuery()); //name=networking
System.out.println("filename = " + aURL.getFile()); ///docs/books/tutorial/index.html?name=networking
System.out.println("ref = " + aURL.getRef()); //DOWNLOADING
}
}
Read more

Here is a short and simple line using InternetDomainName.topPrivateDomain() in Guava: InternetDomainName.from(new URL(url).getHost()).topPrivateDomain().toString()
Given http://www.google.com/blah, that will give you google.com. Or, given http://www.google.co.mx, it will give you google.co.mx.
As Sa Qada commented in another answer on this post, this question has been asked earlier: Extract main domain name from a given url. The best answer to that question is from Satya, who suggests Guava's InternetDomainName.topPrivateDomain()
public boolean isTopPrivateDomain()
Indicates whether this domain name is composed of exactly one
subdomain component followed by a public suffix. For example, returns
true for google.com and foo.co.uk, but not for www.google.com or
co.uk.
Warning: A true result from this method does not imply that the
domain is at the highest level which is addressable as a host, as many
public suffixes are also addressable hosts. For example, the domain
bar.uk.com has a public suffix of uk.com, so it would return true from
this method. But uk.com is itself an addressable host.
This method can be used to determine whether a domain is probably the
highest level for which cookies may be set, though even that depends
on individual browsers' implementations of cookie controls. See RFC
2109 for details.
Putting that together with URL.getHost(), which the original post already contains, gives you:
import com.google.common.net.InternetDomainName;
import java.net.URL;
public class DomainNameMain {
public static void main(final String... args) throws Exception {
final String urlString = "http://www.google.com/blah";
final URL url = new URL(urlString);
final String host = url.getHost();
final InternetDomainName name = InternetDomainName.from(host).topPrivateDomain();
System.out.println(urlString);
System.out.println(host);
System.out.println(name);
}
}

I wrote a method (see below) which extracts a url's domain name and which uses simple String matching. What it actually does is extract the bit between the first "://" (or index 0 if there's no "://" contained) and the first subsequent "/" (or index String.length() if there's no subsequent "/"). The remaining, preceding "www(_)*." bit is chopped off. I'm sure there'll be cases where this won't be good enough but it should be good enough in most cases!
Mike Samuel's post above says that the java.net.URI class could do this (and was preferred to the java.net.URL class) but I encountered problems with the URI class. Notably, URI.getHost() gives a null value if the url does not include the scheme, i.e. the "http(s)" bit.
/**
* Extracts the domain name from {#code url}
* by means of String manipulation
* rather than using the {#link URI} or {#link URL} class.
*
* #param url is non-null.
* #return the domain name within {#code url}.
*/
public String getUrlDomainName(String url) {
String domainName = new String(url);
int index = domainName.indexOf("://");
if (index != -1) {
// keep everything after the "://"
domainName = domainName.substring(index + 3);
}
index = domainName.indexOf('/');
if (index != -1) {
// keep everything before the '/'
domainName = domainName.substring(0, index);
}
// check for and remove a preceding 'www'
// followed by any sequence of characters (non-greedy)
// followed by a '.'
// from the beginning of the string
domainName = domainName.replaceFirst("^www.*?\\.", "");
return domainName;
}

I made a small treatment after the URI object creation
if (url.startsWith("http:/")) {
if (!url.contains("http://")) {
url = url.replaceAll("http:/", "http://");
}
} else {
url = "http://" + url;
}
URI uri = new URI(url);
String domain = uri.getHost();
return domain.startsWith("www.") ? domain.substring(4) : domain;

In my case i only needed the main domain and not the subdomain (no "www" or whatever the subdomain is) :
public static String getUrlDomain(String url) throws URISyntaxException {
URI uri = new URI(url);
String domain = uri.getHost();
String[] domainArray = domain.split("\\.");
if (domainArray.length == 1) {
return domainArray[0];
}
return domainArray[domainArray.length - 2] + "." + domainArray[domainArray.length - 1];
}
With this method the url "https://rest.webtoapp.io/llSlider?lg=en&t=8" will have for domain "webtoapp.io".

val host = url.split("/")[2]

All the above are good. This one seems really simple to me and easy to understand. Excuse the quotes. I wrote it for Groovy inside a class called DataCenter.
static String extractDomainName(String url) {
int start = url.indexOf('://')
if (start < 0) {
start = 0
} else {
start += 3
}
int end = url.indexOf('/', start)
if (end < 0) {
end = url.length()
}
String domainName = url.substring(start, end)
int port = domainName.indexOf(':')
if (port >= 0) {
domainName = domainName.substring(0, port)
}
domainName
}
And here are some junit4 tests:
#Test
void shouldFindDomainName() {
assert DataCenter.extractDomainName('http://example.com/path/') == 'example.com'
assert DataCenter.extractDomainName('http://subpart.example.com/path/') == 'subpart.example.com'
assert DataCenter.extractDomainName('http://example.com') == 'example.com'
assert DataCenter.extractDomainName('http://example.com:18445/path/') == 'example.com'
assert DataCenter.extractDomainName('example.com/path/') == 'example.com'
assert DataCenter.extractDomainName('example.com') == 'example.com'
}

try this one : java.net.URL;
JOptionPane.showMessageDialog(null, getDomainName(new URL("https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains")));
public String getDomainName(URL url){
String strDomain;
String[] strhost = url.getHost().split(Pattern.quote("."));
String[] strTLD = {"com","org","net","int","edu","gov","mil","arpa"};
if(Arrays.asList(strTLD).indexOf(strhost[strhost.length-1])>=0)
strDomain = strhost[strhost.length-2]+"."+strhost[strhost.length-1];
else if(strhost.length>2)
strDomain = strhost[strhost.length-3]+"."+strhost[strhost.length-2]+"."+strhost[strhost.length-1];
else
strDomain = strhost[strhost.length-2]+"."+strhost[strhost.length-1];
return strDomain;}

There is a similar question Extract main domain name from a given url. If you take a look at this answer , you will see that it is very easy. You just need to use java.net.URL and String utility - Split

One of the way I did and worked for all of the cases is using Guava Library and regex in combination.
public static String getDomainNameWithGuava(String url) throws MalformedURLException,
URISyntaxException {
String host =new URL(url).getHost();
String domainName="";
try{
domainName = InternetDomainName.from(host).topPrivateDomain().toString();
}catch (IllegalStateException | IllegalArgumentException e){
domainName= getDomain(url,true);
}
return domainName;
}
getDomain() can be any common method with regex.

private static final String hostExtractorRegexString = "(?:https?://)?(?:www\\.)?(.+\\.)(com|au\\.uk|co\\.in|be|in|uk|org\\.in|org|net|edu|gov|mil)";
private static final Pattern hostExtractorRegexPattern = Pattern.compile(hostExtractorRegexString);
public static String getDomainName(String url){
if (url == null) return null;
url = url.trim();
Matcher m = hostExtractorRegexPattern.matcher(url);
if(m.find() && m.groupCount() == 2) {
return m.group(1) + m.group(2);
}
return null;
}
Explanation :
The regex has 4 groups. The first two are non-matching groups and the next two are matching groups.
The first non-matching group is "http" or "https" or ""
The second non-matching group is "www." or ""
The second matching group is the top level domain
The first matching group is anything after the non-matching groups and anything before the top level domain
The concatenation of the two matching groups will give us the domain/host name.
PS : Note that you can add any number of supported domains to the regex.

If the input url is user input. this method gives the most appropriate host name. if not found gives back the input url.
private String getHostName(String urlInput) {
urlInput = urlInput.toLowerCase();
String hostName=urlInput;
if(!urlInput.equals("")){
if(urlInput.startsWith("http") || urlInput.startsWith("https")){
try{
URL netUrl = new URL(urlInput);
String host= netUrl.getHost();
if(host.startsWith("www")){
hostName = host.substring("www".length()+1);
}else{
hostName=host;
}
}catch (MalformedURLException e){
hostName=urlInput;
}
}else if(urlInput.startsWith("www")){
hostName=urlInput.substring("www".length()+1);
}
return hostName;
}else{
return "";
}
}

To get the actual domain name, without the subdomain, I use:
private String getDomainName(String url) throws URISyntaxException {
String hostName = new URI(url).getHost();
if (!hostName.contains(".")) {
return hostName;
}
String[] host = hostName.split("\\.");
return host[host.length - 2];
}
Note that this won't work with second-level domains (like .co.uk).

// groovy
String hostname ={url -> url[(url.indexOf('://')+ 3)..-1].split('/')[0] }
hostname('http://hello.world.com/something') // return 'hello.world.com'
hostname('docker://quay.io/skopeo/stable') // return 'quay.io'

const val WWW = "www."
fun URL.domain(): String {
val domain: String = this.host
return if (domain.startsWith(ConstUtils.WWW)) {
domain.substring(ConstUtils.WWW.length)
} else {
domain
}
}

I use regex solution
public static String getDomainName(String url) {
return url.replaceAll("http(s)?://|www\\.|wap\\.|/.*", "");
}
It cleans url from "http/https/www./wap." and from all unnecessary things after / like "/questions" in "https://stackoverflow.com/questions" and we get just "stackoverflow.com"

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Using Selenium an JUnit to parse HTML Document for links - java

Related

java: Can't check this string

How to append the id to the url after assigning the id to the String

Java: Get properties of an object by parsing XML-file

How to check if the subdomain is also from same domain using java

Get domain name from given url

Categories

Resources