How to get name of website from any string url [closed]

How to get name of website from any string url [closed] - java

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have given String which contains any valid url.
I have to find only name of website from given url.
I have also ignore sub domains.
like
http://www.yahoo.com => yahoo
www.google.co.in => google
http://in.com => in
http://india.gov.in/ => india
https://in.yahoo.com/ => yahoo
http://philotheoristic.tumblr.com/ =>tumblr
http://philotheoristic.tumblr.com/
https://in.movies.yahoo.com/ =>yahoo
How to do this

Yo can make use of URL
From Documentation - http://docs.oracle.com/javase/tutorial/networking/urls/urlInfo.html
import java.net.*;
import java.io.*;
public class ParseURL {
public static void main(String[] args) throws MalformedURLException {
URL aURL = new URL("http://example.com:80/docs/books/tutorial"
+ "/index.html?name=networking#DOWNLOADING");
System.out.println("protocol = " + aURL.getProtocol());
System.out.println("authority = " + aURL.getAuthority());
System.out.println("host = " + aURL.getHost());
System.out.println("port = " + aURL.getPort());
System.out.println("path = " + aURL.getPath());
System.out.println("query = " + aURL.getQuery());
System.out.println("filename = " + aURL.getFile());
System.out.println("ref = " + aURL.getRef());
}
}
Here is the output displayed by the program:
protocol = http
authority = example.com:80
host = example.com // name of website
port = 80
path = /docs/books/tutorial/index.html
query = name=networking
filename = /docs/books/tutorial/index.html?name=networking
ref = DOWNLOADING
So by using aURL.getHost() you can get website name. To ignore sub domains you can split it with "." Therefore it becomes aURL.getHost().split(".")[0] to get only name.

Regular expressions may help you:
String str = "www.google.co.in";
String [] res = str.split("(\\.|//)+(?=\\w)");
System.out.println(res[1]);
A regular expression is a way to represent a set of strings. This set is composed by any string matching the expression. In the code above, the string used as split argument is the regular expression that matches: Any "." followed by an alphanumeric text OR "//" followed by an alphanumeric text.
So these "." and "//" substrings are the separators used to split the string in parts, being the first one the site name.
In "www.google.co.in", the string would be splited this way: goole, co, in. Since the solution is using the first element of the spit array, the result is: google.

I found similar contents. although some different.
http://www.yahoo.com => Yahoo
http://www.google.co.in => Google
http://in.com => In.com Offers Videos, News, Photos, Celebs, Live TV Channels.....
http://india.gov.in/ => National Portal of India
https://in.yahoo.com/ => Yahoo India
http://philotheoristic.tumblr.com/ => Philotheoristic
https://in.movies.yahoo.com/ => Yahoo India Movies - Bollywood News, Movie Reviews & Hindi Movie Videos
here is the code
public class TitleExtractor {
/* the CASE_INSENSITIVE flag accounts for
* sites that use uppercase title tags.
* the DOTALL flag accounts for sites that have
* line feeds in the title text */
private static final Pattern TITLE_TAG =
Pattern.compile("\\<title>(.*)\\</title>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
/**
* #param url the HTML page
* #return title text (null if document isn't HTML or lacks a title tag)
* #throws IOException
*/
public static String getPageTitle(String url) throws IOException {
URL u = new URL(url);
URLConnection conn = u.openConnection();
// ContentType is an inner class defined below
ContentType contentType = getContentTypeHeader(conn);
if (!contentType.contentType.equals("text/html"))
return null; // don't continue if not HTML
else {
// determine the charset, or use the default
Charset charset = getCharset(contentType);
if (charset == null)
charset = Charset.defaultCharset();
// read the response body, using BufferedReader for performance
InputStream in = conn.getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(in, charset));
int n = 0, totalRead = 0;
char[] buf = new char[1024];
StringBuilder content = new StringBuilder();
// read until EOF or first 8192 characters
while (totalRead < 8192 && (n = reader.read(buf, 0, buf.length)) != -1) {
content.append(buf, 0, n);
totalRead += n;
}
reader.close();
// extract the title
Matcher matcher = TITLE_TAG.matcher(content);
if (matcher.find()) {
/* replace any occurrences of whitespace (which may
* include line feeds and other uglies) as well
* as HTML brackets with a space */
return matcher.group(1).replaceAll("[\\s\\<>]+", " ").trim();
}
else
return null;
}
}
/**
* Loops through response headers until Content-Type is found.
* #param conn
* #return ContentType object representing the value of
* the Content-Type header
*/
private static ContentType getContentTypeHeader(URLConnection conn) {
int i = 0;
boolean moreHeaders = true;
do {
String headerName = conn.getHeaderFieldKey(i);
String headerValue = conn.getHeaderField(i);
if (headerName != null && headerName.equals("Content-Type"))
return new ContentType(headerValue);
i++;
moreHeaders = headerName != null || headerValue != null;
}
while (moreHeaders);
return null;
}
private static Charset getCharset(ContentType contentType) {
if (contentType != null && contentType.charsetName != null && Charset.isSupported(contentType.charsetName))
return Charset.forName(contentType.charsetName);
else
return null;
}
/**
* Class holds the content type and charset (if present)
*/
private static final class ContentType {
private static final Pattern CHARSET_HEADER = Pattern.compile("charset=([-_a-zA-Z0-9]+)", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
private String contentType;
private String charsetName;
private ContentType(String headerValue) {
if (headerValue == null)
throw new IllegalArgumentException("ContentType must be constructed with a not-null headerValue");
int n = headerValue.indexOf(";");
if (n != -1) {
contentType = headerValue.substring(0, n);
Matcher matcher = CHARSET_HEADER.matcher(headerValue);
if (matcher.find())
charsetName = matcher.group(1);
}
else
contentType = headerValue;
}
}
}
Making use of this class is simple:
String title = TitleExtractor.getPageTitle("http://en.wikipedia.org/");
System.out.println(title);
here is the link:
http://www.gotoquiz.com/web-coding/programming/java-programming/how-to-extract-titles-from-web-pages-in-java/
I hope it is help you.

There is no any possible way to find out valid website name from url. But if you are trying to cut a particular part of url string, you can do this by string operation as follows
if(url.endsWith("co.in"){
website = url.substring(indexOfLostThirdDot, indexofco.in)
}

Related

Java how to verify hebrew text from the letter [duplicate]

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 4 years ago.
I need to verify hebrew text from the letter
the letter's body like:
שלום,
תואם ייעוץ וידאו עם המטופל John Salivan. מועד הייעוץ נקבע לתאריך
23/02/2019 בשעה 20:45.
לביצוע הייעוץ יש להכנס
but my regex doesn't match text
public static void findBadLines(String fileName) {
Pattern regexp = Pattern.compile(".*שלום,.*תואם ייעוץ וידאו עם המטופל John Salivan. .*מועד הייעוץ נקבע לתאריך .* בשעה.*..*לביצוע הייעוץ יש להכנס .*");
Matcher matcher = regexp.matcher("");
Path path = Paths.get(fileName);
//another way of getting all the lines:
//Files.readAllLines(path, ENCODING);
try (
BufferedReader reader = Files.newBufferedReader(path, ENCODING);
LineNumberReader lineReader = new LineNumberReader(reader);
){
String line = null;
while ((line = lineReader.readLine()) != null) {
matcher.reset(line); //reset the input
if (!matcher.find()) {
String msg = "Line " + lineReader.getLineNumber() + " is bad: " + line;
throw new IllegalStateException(msg);
}
}
}
catch (IOException ex){
ex.printStackTrace();
}
}
final static Charset ENCODING = StandardCharsets.UTF_8;
}

Do I get that right, you wan't to check if there is any hebrew text in a given input?
If so use that regex .*[\u0590-\u05ff]+.*
[\u0590-\u05ff]+ matches one or more hebrew characters, the .* before and after you need to match the rest of your input.
Respectively
Pattern regexp = Pattern.compile(".*[\u0590-\u05ff]+.*");
//...
matcher.reset(line); //reset the input
if (!matcher.matches()) {
String msg = "Line " + lineReader.getLineNumber() + " is bad: " + line;
throw new IllegalStateException(msg);
}

Splitting a String without spaces

I have the following string which is generated by an external program (OpenVAS) and returned to my program successfully as a string.
<create_target_response id="b4c8de55-94d8-4e08-b20e-955f97a714f1" status_text="OK, resource created" status="201"></create_target_response>
I am trying to split the string to give me the "b4c8d....14f1" without the inverted commas. I have tried all sorts of escape methods and keep getting the else method "String does not contain a Target ID". I have tried removing the IF statement checking for the string, but continue to have the same issue. The goal is to get my id string into jTextField6. String Lob contains the full string as above.
if (Lob.contains("id=\"")){
// put the split here
String[] parts = Lob.split("id=\"");
String cut1 = parts[1];
String[] part2 = cut1.split("\"");
String TaskFinal = part2[0];
jTextField6.setText(TaskFinal);
}
else {
throw new IllegalArgumentException("String does not contain a Target ID");
}
} catch (IOException e) {
e.printStackTrace();
}
It seems I only need to escape the " and not the = (Java kicks up an error if i do)
Thanks in advance
EDIT: Code as it stands now using jSoup lib - The 'id' string won't display. Any ideas?
Thanks
private void jButton1ActionPerformed(java.awt.event.ActionEvent evt) {
// TODO add your handling code here:
String TargIP = jTextField1.getText(); // Get IP Address
String TargName = jTextField5.getText(); // Get Target Name
String Vag = "8d32ad99-ac84-4fdc-b196-2b379f861def";
String Lob = "";
final String dosCommand = "cmd /c omp -u admin -w admin --xml=\"<create_target><name>" + TargName + "</name><hosts>" + TargIP + "</hosts></create_target>\"";
3</comment><config id='daba56c8-73ec-11df-a475-002264764cea'/><target id='" + Vag + "'/></create_task>\"";
final String location = "C:\\";
try {
final Process process = Runtime.getRuntime().exec(
dosCommand + " " + location);
final InputStream in = process.getInputStream();
int ch;
while((ch = in.read()) != -1) {
System.out.print((char)ch);
Lob = String.valueOf((char)ch);
jTextArea2.append(Lob);
}
} catch (IOException e) {
e.printStackTrace();
}
String id = Jsoup.parse(Lob).getAllElements().attr("id");
System.out.println(id); // This doesn't output?
}

Split on the basis of ". You can get all the key values.
String str = "<create_target_response id=\"b4c8de55-94d8-4e08-b20e-955f97a714f1\" status_text=\"OK, resource created\" status=\"201\"></create_target_response>";
String[] tokens = str.split("\\\"");
System.out.println(tokens[1]);
System.out.println(tokens[5]);
output:
b4c8de55-94d8-4e08-b20e-955f97a714f1
201

This will get you your job id more easily:
int idStart = Lob.indexOf("id=")+("id=\"").length();
System.out.println(Lob.substring(idStart,Lob.indexOf("\"",idStart)));

Everyone's telling you to use an XML parser (and they're right) but noone's showing you how.
Here goes:
String lob = ...
Using Jsoup from http://jsoup.org, actually an HTML parser but also handles XML neatly:
String id = Jsoup.parse(lob).getAllElements().attr("id");
// b4c8de55-94d8-4e08-b20e-955f97a714f1
With built-in Java XML APIs, less concise but no addtional libraries:
Document dom = DocumentBuilderFactory.newInstance().newDocumentBuilder()
.parse(new InputSource(new StringReader(lob)));
String id = dom.getDocumentElement().getAttribute("id");
// b4c8de55-94d8-4e08-b20e-955f97a714f1

This is a lot simpler than you're making it, to my mind. First, split on space, then check if an = is present. If it is, split on the =, and finally remove the " from the second token.
The tricky bit is the spaces inside of the "". This will require some regular expressions, which you can work out from this question.
Example
String input; // Assume this contains the whole string.
String pattern; // Have fun working out the regex.
String[] values = input.split(pattern);
for(String value : values)
{
if(value.contains("=")) {
String[] pair = value.split("=");
String key = pair[0];
String value = pair[1].replaceAll("\"");
// Do something with the values.
}
}
Advantage of my approach
Is that provided the input follows the format of key="value" key="value", you can parse anything that comes through, rather than hard coding the name of the attributes.
And if this is XML..
Then use an XML parser. There is a good (awesome) answer that explains why you shouldn't be using Stringmanipulation to parse XML/HTML. Here is the answer.

You can use a regex to extract what is needed; what is more, it looks like the value of id is a UUID. Therefore:
private static final Pattern PATTERN
= Pattern.compile("\\bid=\"([^\"]+)\"");
// In code...
public String getId(final String input)
{
final Matcher m = PATTERN.matcher(input);
if (!m.find())
throw new IllegalArgumentException("String does not contain a Target ID");
final String uuid = m.group(1);
try {
UUID.fromString(uuid);
} catch (IllegalArgumentException ignored) {
throw new IllegalArgumentException("String does not contain a Target ID");
}
return uuid;
}

Extract jsp page from the whole URL

I am using request.getHeader("Referer") to get the previous page URL from where I came. But I am getting here the complete URL for example http://hostname/name/myPage.jsp?param=7. Is there any way to extract myPage.jsp?param=7from the whole URL? Or I need to process the string?I just need myPage.jsp?param=7.

You can simply reconstruct URL by using this function. Use only the things you need from this function.
public static String getUrl(HttpServletRequest req) {
String scheme = req.getScheme(); // http
String serverName = req.getServerName(); // hostname.com
int serverPort = req.getServerPort(); // 80
String contextPath = req.getContextPath(); // /mywebapp
String servletPath = req.getServletPath(); // /servlet/MyServlet
String pathInfo = req.getPathInfo(); // /a/b;c=123
String queryString = req.getQueryString(); // d=789
// Reconstruct original requesting URL
String url = scheme+"://"+serverName+":"+serverPort+contextPath+servletPath;
if (pathInfo != null) {
url += pathInfo;
}
if (queryString != null) {
url += "?"+queryString;
}
return url;
}
or If this function does not fulfil your need, then you can always use String manipulation:
public static String extractFileName(String path) {
if (path == null) {
return null;
}
String newpath = path.replace('\\', '/');
int start = newpath.lastIndexOf("/");
if (start == -1) {
start = 0;
} else {
start = start + 1;
}
String pageName = newpath.substring(start, newpath.length());
return pageName;
}
Pass in /sub/dir/path.html returns path.html
Hope this helps. :)

class URI - http://docs.oracle.com/javase/7/docs/api/java/net/URI.html
just build a new URI instance with the string you have (http://hostname/name/myPage.jsp?param=7) and then you can access parts. what you want is probably getPath()+getQuery()

Pattern p = Pattern.compile("[a-zA-Z]+.jsp.*");
Matcher m = p.matcher("http://hostname/name/myPage.jsp?param=7");
if(m.find())
{
System.out.println(m.group());
}

java pattern to obtain the pagename with extension

For the URL http://questions/ask/stackoverflow.xhtml, the requirement is obtain stackoverflow.
What is the pattern used to obtain this page name?
The substring can be used but I read that the performance for pattern Matcher would be better.

I would guess that a regular expression solution would be more complicated (and likely slower). Here's how I would do it without them:
public static String getFilename(String s) {
int lastSlash = s.lastIndexOf("/");
if (lastSlash < 0) return null;
int nextDot = s.indexOf(".", lastSlash);
return s.substring(lastSlash+1, (nextDot<0) ? s.length() : nextDot);
}
String url = "http://questions/ask/stackoverflow.xhtml";
getFilename(url); // => "stackoverflow"
Of course, if the URL doesn't have a filename then you'll get the hostname instead. You're probably best off parsing a URL, extracting the file part of it, and removing the path and extension. Something like this:
public static String getFilename2(String s) {
URL url = null;
try {
url = new URL(s);
} catch (MalformedURLException mue) { return null; }
String filePart = url.getFile();
if (filePart.equals("")) return "";
File f = new File(filePart);
String filename = f.getName();
int lastDot = filename.lastIndexOf(".");
return (lastDot<0) ? filename : filename.substring(0, lastDot);
}

For that particular URL you can use:
String url = "http://questions/ask/stackoverflow.xhtml";
String pname = url.split("/")[4].split("\\.")[0];
For the more useful (in terms of regex not in performance) Pattern based solution consider this:
String url = "http://questions/ask/stackoverflow.xhtml";
Pattern pt = Pattern.compile("/(?![^/]*/)([^.]*)\\.");
Matcher matcher = pt.matcher(url);
if(matcher.find()) {
System.out.println("Matched: [" + matcher.group(1) + ']');
// prints Matched: [stackoverflow]
}

Retrieve Google results programmatically

How do I create a Java program that enters the words "Hello World" into Google and then retrieves the html from the results page? I'm not trying to use the Robot class.

URL url = new URL("http://www.google.com/search?q=hello+world");
url.openStream(); // returns an InputStream which you can read with e.g. a BufferedReader
If you make repeated programmatic requests to Google in this way they will start to redirect you to "we're sorry but you look like a robot" pages pretty quick.
What you may be better doing is using Google's custom search api.

For performing google search through a program, you will need a developer api key and a custom search engine id. You can get the developer api key and custom search engine id from below urls.
https://cloud.google.com/console/project'>Google Developers Console
https://www.google.com/cse/all'>Google Custom Search
After you got the both the key and id use it in below program. Change apiKey and customSearchEngineKey with your keys.
For step by step information please visit - http://www.basicsbehind.com/google-search-programmatically/
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
public class CustomGoogleSearch {
final static String apiKey = "AIzaSyAFmFdHiFK783aSsdbq3lWQDL7uOSbnD-QnCnGbY";
final static String customSearchEngineKey = "00070362344324199532843:wkrTYvnft8ma";
final static String searchURL = "https://www.googleapis.com/customsearch/v1?";
public static String search(String pUrl) {
try {
URL url = new URL(pUrl);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
BufferedReader br = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String line;
StringBuffer buffer = new StringBuffer();
while ((line = br.readLine()) != null) {
buffer.append(line);
}
return buffer.toString();
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
private static String buildSearchString(String searchString, int start, int numOfResults) {
String toSearch = searchURL + "key=" + apiKey + "&cx=" + customSearchEngineKey + "&q=";
// replace spaces in the search query with +
String newSearchString = searchString.replace(" ", "%20");
toSearch += newSearchString;
// specify response format as json
toSearch += "&alt=json";
// specify starting result number
toSearch += "&start=" + start;
// specify the number of results you need from the starting position
toSearch += "&num=" + numOfResults;
System.out.println("Seacrh URL: " + toSearch);
return toSearch;
}
public static void main(String[] args) throws Exception {
String url = buildSearchString("BasicsBehind", 1, 10);
String result = search(url);
System.out.println(result);
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to get name of website from any string url [closed] - java

There is no any possible way to find out valid website name from url. But if you are trying to cut a particular part of url string, you can do this by string operation as follows if(url.endsWith("co.in"){ website = url.substring(indexOfLostThirdDot, indexofco.in) }

Related

Java how to verify hebrew text from the letter [duplicate]

Splitting a String without spaces

Extract jsp page from the whole URL

java pattern to obtain the pagename with extension

Retrieve Google results programmatically

Categories

Resources