java pattern to obtain the pagename with extension

java pattern to obtain the pagename with extension - java

For the URL http://questions/ask/stackoverflow.xhtml, the requirement is obtain stackoverflow.
What is the pattern used to obtain this page name?
The substring can be used but I read that the performance for pattern Matcher would be better.

I would guess that a regular expression solution would be more complicated (and likely slower). Here's how I would do it without them:
public static String getFilename(String s) {
int lastSlash = s.lastIndexOf("/");
if (lastSlash < 0) return null;
int nextDot = s.indexOf(".", lastSlash);
return s.substring(lastSlash+1, (nextDot<0) ? s.length() : nextDot);
}
String url = "http://questions/ask/stackoverflow.xhtml";
getFilename(url); // => "stackoverflow"
Of course, if the URL doesn't have a filename then you'll get the hostname instead. You're probably best off parsing a URL, extracting the file part of it, and removing the path and extension. Something like this:
public static String getFilename2(String s) {
URL url = null;
try {
url = new URL(s);
} catch (MalformedURLException mue) { return null; }
String filePart = url.getFile();
if (filePart.equals("")) return "";
File f = new File(filePart);
String filename = f.getName();
int lastDot = filename.lastIndexOf(".");
return (lastDot<0) ? filename : filename.substring(0, lastDot);
}

For that particular URL you can use:
String url = "http://questions/ask/stackoverflow.xhtml";
String pname = url.split("/")[4].split("\\.")[0];
For the more useful (in terms of regex not in performance) Pattern based solution consider this:
String url = "http://questions/ask/stackoverflow.xhtml";
Pattern pt = Pattern.compile("/(?![^/]*/)([^.]*)\\.");
Matcher matcher = pt.matcher(url);
if(matcher.find()) {
System.out.println("Matched: [" + matcher.group(1) + ']');
// prints Matched: [stackoverflow]
}

Related

Java Matcher overreplacing groups

public class HexASCIITest {
public static void main(String[] args) throws DecoderException, UnsupportedEncodingException {
String test = "src=\"test/__test/path/path2/AA_5F00_20140915_5F00_15_5F00_11_5F00_55_5F00_image_5F005F00_name.jpg\"";
Pattern patternImages = Pattern.compile("src=\"[^\"]*?/__test/[^/]*?/[^/]*?/([^\"/]*?)\"");
Matcher matcherImages = patternImages.matcher(test);
while(matcherImages.find()) {
String imageName = matcherImages.group(1);
Pattern pattern = Pattern.compile("_((?:[01234567890ABCDEF]{4}){1,})_");
Matcher matcher = pattern.matcher(imageName);
while(matcher.find()) {
byte[] bytes = Hex.decodeHex(matcher.group(1).toCharArray());
String imagePath = new String(bytes, "latin1");
imagePath = imagePath.replaceAll("\0", "");
imageName = imageName.replaceFirst("_((?:[01234567890ABCDEF]{4}){1,})_", imagePath.trim());
}
System.out.println(imageName);
}
}
}
Hi guys, this is a program of mine, that should actually turn the HEX codes to ASCII, but it seems i am having logic problems, could anyone assist me ?
The initial image name is : AA_5F00_20140915_5F00_15_5F00_11_5F00_55_5F00_image_5F005F00_name.jpg
After all of the replaces : AA_15_11_55__image_5F005F00_name.jpg
Which is not how it is supposed to work as the date 20140915 is gone and 5F005F00 is still there. Thank you for your help !

Found it
Regex should be - Pattern pattern = Pattern.compile("(([0123456789ABCDEF]{4}){1,})");
Then - > byte[] bytes = Hex.decodeHex(matcher.group(2).toCharArray());
and finally the replace
imageName = imageName.replaceFirst(matcher.group(1), imagePath.trim());

Apply regex on url string

I have a url like this,
http://abc-xyz.com/AppName/service?id=1234&isStudent&stream&shouldUpdateRecord=Y&courseType
I want to apply a regex before making a rest call to a 3rd party system. That regex should remove all the keys without a value. i.e from this given url, my regex should remove "&isStudent", "&stream" and "&courseType" and I should be left with,
http://abc-xyz.com/AppName/service?id=1234&shouldUpdateRecord=Y
Any pointers?

I can't do it in one regex, because the number of key-only parameters is variable. But I can do it with a short program like this
public class Playground {
public static void main(String[] args) {
String testInput = "http://abc-xyz.com/AppName/service?id=1234&isStudent&stream&shouldUpdateRecord=Y&courseType";
String[] tokens = testInput.split("\\?");
String urlPrefix = tokens[0];
String paramString = tokens[1];
String[] params = paramString.split("&");
StringBuilder sb = new StringBuilder();
sb.append(urlPrefix + "?");
String keyValueRegex = "(\\w+)=(\\w+)";
String amp = ""; // first time special
for (String param : params) {
if (param.matches(keyValueRegex)) {
sb.append(amp + param);
amp = "&"; // second time and onwards
}
}
System.out.println(sb.toString());
}
}
The output of this program is this:
http://abc-xyz.com/AppName/service?id=1234&shouldUpdateRecord=Y

How to get name of website from any string url [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have given String which contains any valid url.
I have to find only name of website from given url.
I have also ignore sub domains.
like
http://www.yahoo.com => yahoo
www.google.co.in => google
http://in.com => in
http://india.gov.in/ => india
https://in.yahoo.com/ => yahoo
http://philotheoristic.tumblr.com/ =>tumblr
http://philotheoristic.tumblr.com/
https://in.movies.yahoo.com/ =>yahoo
How to do this

Yo can make use of URL
From Documentation - http://docs.oracle.com/javase/tutorial/networking/urls/urlInfo.html
import java.net.*;
import java.io.*;
public class ParseURL {
public static void main(String[] args) throws MalformedURLException {
URL aURL = new URL("http://example.com:80/docs/books/tutorial"
+ "/index.html?name=networking#DOWNLOADING");
System.out.println("protocol = " + aURL.getProtocol());
System.out.println("authority = " + aURL.getAuthority());
System.out.println("host = " + aURL.getHost());
System.out.println("port = " + aURL.getPort());
System.out.println("path = " + aURL.getPath());
System.out.println("query = " + aURL.getQuery());
System.out.println("filename = " + aURL.getFile());
System.out.println("ref = " + aURL.getRef());
}
}
Here is the output displayed by the program:
protocol = http
authority = example.com:80
host = example.com // name of website
port = 80
path = /docs/books/tutorial/index.html
query = name=networking
filename = /docs/books/tutorial/index.html?name=networking
ref = DOWNLOADING
So by using aURL.getHost() you can get website name. To ignore sub domains you can split it with "." Therefore it becomes aURL.getHost().split(".")[0] to get only name.

Regular expressions may help you:
String str = "www.google.co.in";
String [] res = str.split("(\\.|//)+(?=\\w)");
System.out.println(res[1]);
A regular expression is a way to represent a set of strings. This set is composed by any string matching the expression. In the code above, the string used as split argument is the regular expression that matches: Any "." followed by an alphanumeric text OR "//" followed by an alphanumeric text.
So these "." and "//" substrings are the separators used to split the string in parts, being the first one the site name.
In "www.google.co.in", the string would be splited this way: goole, co, in. Since the solution is using the first element of the spit array, the result is: google.

I found similar contents. although some different.
http://www.yahoo.com => Yahoo
http://www.google.co.in => Google
http://in.com => In.com Offers Videos, News, Photos, Celebs, Live TV Channels.....
http://india.gov.in/ => National Portal of India
https://in.yahoo.com/ => Yahoo India
http://philotheoristic.tumblr.com/ => Philotheoristic
https://in.movies.yahoo.com/ => Yahoo India Movies - Bollywood News, Movie Reviews & Hindi Movie Videos
here is the code
public class TitleExtractor {
/* the CASE_INSENSITIVE flag accounts for
* sites that use uppercase title tags.
* the DOTALL flag accounts for sites that have
* line feeds in the title text */
private static final Pattern TITLE_TAG =
Pattern.compile("\\<title>(.*)\\</title>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
/**
* #param url the HTML page
* #return title text (null if document isn't HTML or lacks a title tag)
* #throws IOException
*/
public static String getPageTitle(String url) throws IOException {
URL u = new URL(url);
URLConnection conn = u.openConnection();
// ContentType is an inner class defined below
ContentType contentType = getContentTypeHeader(conn);
if (!contentType.contentType.equals("text/html"))
return null; // don't continue if not HTML
else {
// determine the charset, or use the default
Charset charset = getCharset(contentType);
if (charset == null)
charset = Charset.defaultCharset();
// read the response body, using BufferedReader for performance
InputStream in = conn.getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(in, charset));
int n = 0, totalRead = 0;
char[] buf = new char[1024];
StringBuilder content = new StringBuilder();
// read until EOF or first 8192 characters
while (totalRead < 8192 && (n = reader.read(buf, 0, buf.length)) != -1) {
content.append(buf, 0, n);
totalRead += n;
}
reader.close();
// extract the title
Matcher matcher = TITLE_TAG.matcher(content);
if (matcher.find()) {
/* replace any occurrences of whitespace (which may
* include line feeds and other uglies) as well
* as HTML brackets with a space */
return matcher.group(1).replaceAll("[\\s\\<>]+", " ").trim();
}
else
return null;
}
}
/**
* Loops through response headers until Content-Type is found.
* #param conn
* #return ContentType object representing the value of
* the Content-Type header
*/
private static ContentType getContentTypeHeader(URLConnection conn) {
int i = 0;
boolean moreHeaders = true;
do {
String headerName = conn.getHeaderFieldKey(i);
String headerValue = conn.getHeaderField(i);
if (headerName != null && headerName.equals("Content-Type"))
return new ContentType(headerValue);
i++;
moreHeaders = headerName != null || headerValue != null;
}
while (moreHeaders);
return null;
}
private static Charset getCharset(ContentType contentType) {
if (contentType != null && contentType.charsetName != null && Charset.isSupported(contentType.charsetName))
return Charset.forName(contentType.charsetName);
else
return null;
}
/**
* Class holds the content type and charset (if present)
*/
private static final class ContentType {
private static final Pattern CHARSET_HEADER = Pattern.compile("charset=([-_a-zA-Z0-9]+)", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
private String contentType;
private String charsetName;
private ContentType(String headerValue) {
if (headerValue == null)
throw new IllegalArgumentException("ContentType must be constructed with a not-null headerValue");
int n = headerValue.indexOf(";");
if (n != -1) {
contentType = headerValue.substring(0, n);
Matcher matcher = CHARSET_HEADER.matcher(headerValue);
if (matcher.find())
charsetName = matcher.group(1);
}
else
contentType = headerValue;
}
}
}
Making use of this class is simple:
String title = TitleExtractor.getPageTitle("http://en.wikipedia.org/");
System.out.println(title);
here is the link:
http://www.gotoquiz.com/web-coding/programming/java-programming/how-to-extract-titles-from-web-pages-in-java/
I hope it is help you.

There is no any possible way to find out valid website name from url. But if you are trying to cut a particular part of url string, you can do this by string operation as follows
if(url.endsWith("co.in"){
website = url.substring(indexOfLostThirdDot, indexofco.in)
}

Splitting a String without spaces

I have the following string which is generated by an external program (OpenVAS) and returned to my program successfully as a string.
<create_target_response id="b4c8de55-94d8-4e08-b20e-955f97a714f1" status_text="OK, resource created" status="201"></create_target_response>
I am trying to split the string to give me the "b4c8d....14f1" without the inverted commas. I have tried all sorts of escape methods and keep getting the else method "String does not contain a Target ID". I have tried removing the IF statement checking for the string, but continue to have the same issue. The goal is to get my id string into jTextField6. String Lob contains the full string as above.
if (Lob.contains("id=\"")){
// put the split here
String[] parts = Lob.split("id=\"");
String cut1 = parts[1];
String[] part2 = cut1.split("\"");
String TaskFinal = part2[0];
jTextField6.setText(TaskFinal);
}
else {
throw new IllegalArgumentException("String does not contain a Target ID");
}
} catch (IOException e) {
e.printStackTrace();
}
It seems I only need to escape the " and not the = (Java kicks up an error if i do)
Thanks in advance
EDIT: Code as it stands now using jSoup lib - The 'id' string won't display. Any ideas?
Thanks
private void jButton1ActionPerformed(java.awt.event.ActionEvent evt) {
// TODO add your handling code here:
String TargIP = jTextField1.getText(); // Get IP Address
String TargName = jTextField5.getText(); // Get Target Name
String Vag = "8d32ad99-ac84-4fdc-b196-2b379f861def";
String Lob = "";
final String dosCommand = "cmd /c omp -u admin -w admin --xml=\"<create_target><name>" + TargName + "</name><hosts>" + TargIP + "</hosts></create_target>\"";
3</comment><config id='daba56c8-73ec-11df-a475-002264764cea'/><target id='" + Vag + "'/></create_task>\"";
final String location = "C:\\";
try {
final Process process = Runtime.getRuntime().exec(
dosCommand + " " + location);
final InputStream in = process.getInputStream();
int ch;
while((ch = in.read()) != -1) {
System.out.print((char)ch);
Lob = String.valueOf((char)ch);
jTextArea2.append(Lob);
}
} catch (IOException e) {
e.printStackTrace();
}
String id = Jsoup.parse(Lob).getAllElements().attr("id");
System.out.println(id); // This doesn't output?
}

Split on the basis of ". You can get all the key values.
String str = "<create_target_response id=\"b4c8de55-94d8-4e08-b20e-955f97a714f1\" status_text=\"OK, resource created\" status=\"201\"></create_target_response>";
String[] tokens = str.split("\\\"");
System.out.println(tokens[1]);
System.out.println(tokens[5]);
output:
b4c8de55-94d8-4e08-b20e-955f97a714f1
201

This will get you your job id more easily:
int idStart = Lob.indexOf("id=")+("id=\"").length();
System.out.println(Lob.substring(idStart,Lob.indexOf("\"",idStart)));

Everyone's telling you to use an XML parser (and they're right) but noone's showing you how.
Here goes:
String lob = ...
Using Jsoup from http://jsoup.org, actually an HTML parser but also handles XML neatly:
String id = Jsoup.parse(lob).getAllElements().attr("id");
// b4c8de55-94d8-4e08-b20e-955f97a714f1
With built-in Java XML APIs, less concise but no addtional libraries:
Document dom = DocumentBuilderFactory.newInstance().newDocumentBuilder()
.parse(new InputSource(new StringReader(lob)));
String id = dom.getDocumentElement().getAttribute("id");
// b4c8de55-94d8-4e08-b20e-955f97a714f1

This is a lot simpler than you're making it, to my mind. First, split on space, then check if an = is present. If it is, split on the =, and finally remove the " from the second token.
The tricky bit is the spaces inside of the "". This will require some regular expressions, which you can work out from this question.
Example
String input; // Assume this contains the whole string.
String pattern; // Have fun working out the regex.
String[] values = input.split(pattern);
for(String value : values)
{
if(value.contains("=")) {
String[] pair = value.split("=");
String key = pair[0];
String value = pair[1].replaceAll("\"");
// Do something with the values.
}
}
Advantage of my approach
Is that provided the input follows the format of key="value" key="value", you can parse anything that comes through, rather than hard coding the name of the attributes.
And if this is XML..
Then use an XML parser. There is a good (awesome) answer that explains why you shouldn't be using Stringmanipulation to parse XML/HTML. Here is the answer.

You can use a regex to extract what is needed; what is more, it looks like the value of id is a UUID. Therefore:
private static final Pattern PATTERN
= Pattern.compile("\\bid=\"([^\"]+)\"");
// In code...
public String getId(final String input)
{
final Matcher m = PATTERN.matcher(input);
if (!m.find())
throw new IllegalArgumentException("String does not contain a Target ID");
final String uuid = m.group(1);
try {
UUID.fromString(uuid);
} catch (IllegalArgumentException ignored) {
throw new IllegalArgumentException("String does not contain a Target ID");
}
return uuid;
}

Extract jsp page from the whole URL

I am using request.getHeader("Referer") to get the previous page URL from where I came. But I am getting here the complete URL for example http://hostname/name/myPage.jsp?param=7. Is there any way to extract myPage.jsp?param=7from the whole URL? Or I need to process the string?I just need myPage.jsp?param=7.

You can simply reconstruct URL by using this function. Use only the things you need from this function.
public static String getUrl(HttpServletRequest req) {
String scheme = req.getScheme(); // http
String serverName = req.getServerName(); // hostname.com
int serverPort = req.getServerPort(); // 80
String contextPath = req.getContextPath(); // /mywebapp
String servletPath = req.getServletPath(); // /servlet/MyServlet
String pathInfo = req.getPathInfo(); // /a/b;c=123
String queryString = req.getQueryString(); // d=789
// Reconstruct original requesting URL
String url = scheme+"://"+serverName+":"+serverPort+contextPath+servletPath;
if (pathInfo != null) {
url += pathInfo;
}
if (queryString != null) {
url += "?"+queryString;
}
return url;
}
or If this function does not fulfil your need, then you can always use String manipulation:
public static String extractFileName(String path) {
if (path == null) {
return null;
}
String newpath = path.replace('\\', '/');
int start = newpath.lastIndexOf("/");
if (start == -1) {
start = 0;
} else {
start = start + 1;
}
String pageName = newpath.substring(start, newpath.length());
return pageName;
}
Pass in /sub/dir/path.html returns path.html
Hope this helps. :)

class URI - http://docs.oracle.com/javase/7/docs/api/java/net/URI.html
just build a new URI instance with the string you have (http://hostname/name/myPage.jsp?param=7) and then you can access parts. what you want is probably getPath()+getQuery()

Pattern p = Pattern.compile("[a-zA-Z]+.jsp.*");
Matcher m = p.matcher("http://hostname/name/myPage.jsp?param=7");
if(m.find())
{
System.out.println(m.group());
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java pattern to obtain the pagename with extension - java

For the URL http://questions/ask/stackoverflow.xhtml, the requirement is obtain stackoverflow. What is the pattern used to obtain this page name? The substring can be used but I read that the performance for pattern Matcher would be better.

Related

Java Matcher overreplacing groups

Apply regex on url string

How to get name of website from any string url [closed]

Splitting a String without spaces

Extract jsp page from the whole URL

Categories

Resources