Pattern from String

Pattern from String - java

i want to extract pattern from a string for ex:
string x== "1234567 - israel.ekpo#massivelogdata.net cc55ZZ35 1789 Hello Grok";
pattern its should generate is = "%{EMAIL:username} %{USERNAME:password} %{INT:yearOfBirth}"
basically i want to create patter for logs generated in the java application. any idea how to do that ?

In past i've do some with reguar expression, but in my case the string having ever the same composition pattern or order.
I this case, you can done 3 matching pattern and make the find operation 3 times in order of pattern.
If not so, you must use an text analyzer or search tool.

It's recommended to use grow library to extract data from logs.
Example:
public final class GrokStage {
private static final void displayResults(final Map<String, String> results) {
if (results != null) {
for(Map.Entry<String, String> entry : results.entrySet()) {
System.out.println(entry.getKey() + "=" + entry.getValue());
}
}
}
public static void main(String[] args) {
final String rawDataLine1 = "1234567 - israel.ekpo#massivelogdata.net cc55ZZ35 1789 Hello Grok";
final String expression = "%{EMAIL:username} %{USERNAME:password} %{INT:yearOfBirth}";
final GrokDictionary dictionary = new GrokDictionary();
// Load the built-in dictionaries
dictionary.addBuiltInDictionaries();
// Resolve all expressions loaded
dictionary.bind();
// Take a look at how many expressions have been loaded
System.out.println("Dictionary Size: " + dictionary.getDictionarySize());
Grok compiledPattern = dictionary.compileExpression(expression);
displayResults(compiledPattern.extractNamedGroups(rawDataLine1));
}
}
Output:
username=israel.ekpo#massivelogdata.net
password=cc55ZZ35
yearOfBirth=1789
Note:
This are the patterns used before:
EMAIL %{\S+}#%{\b\w+\b}\.%{[a-zA-Z]+}
USERNAME [a-zA-Z0-9._-]+
INT (?:[+-]?(?:[0-9]+))
More info about grok-patterns: BuiltInDictionary.java

Related

Language detections with apache Tika

I am currently trying to get along with Apache Tika and set up a language detection that checks all keyValues of my various properties files for the correct language of the respective file. Unfortunately the detection is not really good..All keys are not recognized with the correct language and I don't know how I can do it better. An api solution is out of the question, because I have the order to find a free way and most free connections only allow 1000 calls per day (in german alone I have more than 14000 keys).
If you know how I can make the current code better or maybe have another solution, please let me know!
Thanks a lot,
Pascal
Thats my Current code:
import java.util.Set;
import org.apache.tika.language.LanguageIdentifier;
public class detect {
#SuppressWarnings("deprecation")
public static void main(String[] args) throws Exception {
final MyPropAllKeys mPAK = new MyPropAllKeys("messages_forCheck.properties");
final Set<Object> keys = mPAK.getAllKeys();
for (final Object key : keys) {
final String keyString = key.toString();
final String keyValueString = mPAK.getPropertyValue(keyString);
detect(keyValueString, key);
}
}
public static void detect(String keyValueString, Object key) {
final LanguageIdentifier languageIdentifier = new LanguageIdentifier(keyValueString);
final String language = languageIdentifier.getLanguage();
if (!language.equals("de")) {
System.out.println(language + " " + key + ": " + keyValueString);
}
}
}
For Example thats some of the Results:
pt de.segal.baoss.platform.entity.BackgroundTaskType.MASS_INVOICE_DOCUMENT_CREATION: Rechnungsdokumente erzeugen
sk de.segal.baoss.purchase.supplier.creditorNumber: Kreditorennummer
no de.segal.baoss.module.crm.revenueLastYear: Umsatz vergangenes Jahr
no de.segal.baoss.module.op.customerReturn.action.createCreditEntry: Gutschrift erstellen
All are definitely German

Saving and Loading Trained Stanford classifier in java

I have a dataset of 1 million labelled sentences and using it for finding sentiment through Maximum Entropy. I am using Stanford Classifier for the same:-
public class MaximumEntropy {
static ColumnDataClassifier cdc;
public static float calMaxEntropySentiment(String text) {
initializeProperties();
float sentiment = (getMaxEntropySentiment(text));
return sentiment;
}
public static void initializeProperties() {
cdc = new ColumnDataClassifier(
"\\stanford-classifier-2016-10-31\\properties.prop");
}
public static int getMaxEntropySentiment(String tweet) {
String filteredTweet = TwitterUtils.filterTweet(tweet);
System.out.println("Reading training file");
Classifier<String, String> cl = cdc.makeClassifier(cdc.readTrainingExamples(
"\\stanford-classifier-2016-10-31\\labelled_sentences.txt"));
Datum<String, String> d = cdc.makeDatumFromLine(filteredTweet);
System.out.println(filteredTweet + " ==> " + cl.classOf(d) + " " + cl.scoresOf(d));
// System.out.println("Class score is: " +
// cl.scoresOf(d).getCount(cl.classOf(d)));
if (cl.classOf(d) == "0") {
return 0;
} else {
return 4;
}
}
}
My data is labelled 0 or 1. Now for each tweet the whole dataset is being read and it is taking a lot of time considering the size of dataset.
My query is that is there any way to first train the classifier and then load it when a tweet's sentiment is to be found. I think this approach will take less time. Correct me if I am wrong.
The following link provides this but there is nothing for JAVA API.
Saving and Loading Classifier
Any help would be appreciated.

Yes; the easiest way to do this is using Java's default serialization mechanism to serialize a classifier. A useful helper here is the IOUtils class:
IOUtils.writeObjectToFile(classifier, "/path/to/file");
To read the classifier:
Classifier<String, String> cl = IOUtils.readObjectFromFile(new File("/path/to/file");

Find and Delete a Sub-String in Java

I'm trying to find a Java sub-string and then delete it without deleting the rest of the string.
I am taking XML as input and would like to delete a deprecated tag, so for instance:
public class whatever {
public static void main(String[] args) {
String uploadedXML = "<someStuff>Bats!</someStuff> <name></name>";
CharSequence deleteRaise = "<name>";
// If an Addenda exists we continue with the process
if (xml_in.contains(deleteRaise)){
// delete
} else {
// Carry on
}
}
In there I would like to delete the <name> and </name> tags if they are included in the string while leaving <someStuff> and </someStuff>.
I already parsed the XML to a String so there's no problem there. I need to know how to find the specific strings and delete them.

You can use replaceAll(regex, str) to do this. If you're not familiar with regex, the ? just means there can be 0 or 1 occurrences of / in the string, so it covers <name> and </name>
String uploadedXML = "<someStuff>Bats!</someStuff> <name></name>";
String filter = "</?name>";
uploadedXML = uploadedXML.replaceAll(filter, "");
System.out.println(uploadedXML);
<someStuff>Bats!</someStuff>

String uploadedXML = "<someStuff>Bats!</someStuff> <name></name>";
String deleteRaise = "<name>";
String closeName = "</name>"
// If an Addenda exists we continue with the process
if (xml_in.contains(deleteRaise)){
uploadedXML.replace(uploadedXML.substring(uploadedXML.indexOf(deleteRaise),uploadedXML.indexOf(closeName)+1),"");
} else {
// Carry on
}enter code here

Get domain name from given url

Given a URL, I want to extract domain name(It should not include 'www' part). Url can contain http/https. Here is the java code that I wrote. Though It seems to work fine, is there any better approach or are there some edge cases, that could fail.
public static String getDomainName(String url) throws MalformedURLException{
if(!url.startsWith("http") && !url.startsWith("https")){
url = "http://" + url;
}
URL netUrl = new URL(url);
String host = netUrl.getHost();
if(host.startsWith("www")){
host = host.substring("www".length()+1);
}
return host;
}
Input: http://google.com/blah
Output: google.com

If you want to parse a URL, use java.net.URI. java.net.URL has a bunch of problems -- its equals method does a DNS lookup which means code using it can be vulnerable to denial of service attacks when used with untrusted inputs.
"Mr. Gosling -- why did you make url equals suck?" explains one such problem. Just get in the habit of using java.net.URI instead.
public static String getDomainName(String url) throws URISyntaxException {
URI uri = new URI(url);
String domain = uri.getHost();
return domain.startsWith("www.") ? domain.substring(4) : domain;
}
should do what you want.
Though It seems to work fine, is there any better approach or are there some edge cases, that could fail.
Your code as written fails for the valid URLs:
httpfoo/bar -- relative URL with a path component that starts with http.
HTTP://example.com/ -- protocol is case-insensitive.
//example.com/ -- protocol relative URL with a host
www/foo -- a relative URL with a path component that starts with www
wwwexample.com -- domain name that does not starts with www. but starts with www.
Hierarchical URLs have a complex grammar. If you try to roll your own parser without carefully reading RFC 3986, you will probably get it wrong. Just use the one that's built into the core libraries.
If you really need to deal with messy inputs that java.net.URI rejects, see RFC 3986 Appendix B:
Appendix B. Parsing a URI Reference with a Regular Expression
As the "first-match-wins" algorithm is identical to the "greedy"
disambiguation method used by POSIX regular expressions, it is
natural and commonplace to use a regular expression for parsing the
potential five components of a URI reference.
The following line is the regular expression for breaking-down a
well-formed URI reference into its components.
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9
The numbers in the second line above are only to assist readability;
they indicate the reference points for each subexpression (i.e., each
paired parenthesis).

import java.net.*;
import java.io.*;
public class ParseURL {
public static void main(String[] args) throws Exception {
URL aURL = new URL("http://example.com:80/docs/books/tutorial"
+ "/index.html?name=networking#DOWNLOADING");
System.out.println("protocol = " + aURL.getProtocol()); //http
System.out.println("authority = " + aURL.getAuthority()); //example.com:80
System.out.println("host = " + aURL.getHost()); //example.com
System.out.println("port = " + aURL.getPort()); //80
System.out.println("path = " + aURL.getPath()); // /docs/books/tutorial/index.html
System.out.println("query = " + aURL.getQuery()); //name=networking
System.out.println("filename = " + aURL.getFile()); ///docs/books/tutorial/index.html?name=networking
System.out.println("ref = " + aURL.getRef()); //DOWNLOADING
}
}
Read more

Here is a short and simple line using InternetDomainName.topPrivateDomain() in Guava: InternetDomainName.from(new URL(url).getHost()).topPrivateDomain().toString()
Given http://www.google.com/blah, that will give you google.com. Or, given http://www.google.co.mx, it will give you google.co.mx.
As Sa Qada commented in another answer on this post, this question has been asked earlier: Extract main domain name from a given url. The best answer to that question is from Satya, who suggests Guava's InternetDomainName.topPrivateDomain()
public boolean isTopPrivateDomain()
Indicates whether this domain name is composed of exactly one
subdomain component followed by a public suffix. For example, returns
true for google.com and foo.co.uk, but not for www.google.com or
co.uk.
Warning: A true result from this method does not imply that the
domain is at the highest level which is addressable as a host, as many
public suffixes are also addressable hosts. For example, the domain
bar.uk.com has a public suffix of uk.com, so it would return true from
this method. But uk.com is itself an addressable host.
This method can be used to determine whether a domain is probably the
highest level for which cookies may be set, though even that depends
on individual browsers' implementations of cookie controls. See RFC
2109 for details.
Putting that together with URL.getHost(), which the original post already contains, gives you:
import com.google.common.net.InternetDomainName;
import java.net.URL;
public class DomainNameMain {
public static void main(final String... args) throws Exception {
final String urlString = "http://www.google.com/blah";
final URL url = new URL(urlString);
final String host = url.getHost();
final InternetDomainName name = InternetDomainName.from(host).topPrivateDomain();
System.out.println(urlString);
System.out.println(host);
System.out.println(name);
}
}

I wrote a method (see below) which extracts a url's domain name and which uses simple String matching. What it actually does is extract the bit between the first "://" (or index 0 if there's no "://" contained) and the first subsequent "/" (or index String.length() if there's no subsequent "/"). The remaining, preceding "www(_)*." bit is chopped off. I'm sure there'll be cases where this won't be good enough but it should be good enough in most cases!
Mike Samuel's post above says that the java.net.URI class could do this (and was preferred to the java.net.URL class) but I encountered problems with the URI class. Notably, URI.getHost() gives a null value if the url does not include the scheme, i.e. the "http(s)" bit.
/**
* Extracts the domain name from {#code url}
* by means of String manipulation
* rather than using the {#link URI} or {#link URL} class.
*
* #param url is non-null.
* #return the domain name within {#code url}.
*/
public String getUrlDomainName(String url) {
String domainName = new String(url);
int index = domainName.indexOf("://");
if (index != -1) {
// keep everything after the "://"
domainName = domainName.substring(index + 3);
}
index = domainName.indexOf('/');
if (index != -1) {
// keep everything before the '/'
domainName = domainName.substring(0, index);
}
// check for and remove a preceding 'www'
// followed by any sequence of characters (non-greedy)
// followed by a '.'
// from the beginning of the string
domainName = domainName.replaceFirst("^www.*?\\.", "");
return domainName;
}

I made a small treatment after the URI object creation
if (url.startsWith("http:/")) {
if (!url.contains("http://")) {
url = url.replaceAll("http:/", "http://");
}
} else {
url = "http://" + url;
}
URI uri = new URI(url);
String domain = uri.getHost();
return domain.startsWith("www.") ? domain.substring(4) : domain;

In my case i only needed the main domain and not the subdomain (no "www" or whatever the subdomain is) :
public static String getUrlDomain(String url) throws URISyntaxException {
URI uri = new URI(url);
String domain = uri.getHost();
String[] domainArray = domain.split("\\.");
if (domainArray.length == 1) {
return domainArray[0];
}
return domainArray[domainArray.length - 2] + "." + domainArray[domainArray.length - 1];
}
With this method the url "https://rest.webtoapp.io/llSlider?lg=en&t=8" will have for domain "webtoapp.io".

val host = url.split("/")[2]

All the above are good. This one seems really simple to me and easy to understand. Excuse the quotes. I wrote it for Groovy inside a class called DataCenter.
static String extractDomainName(String url) {
int start = url.indexOf('://')
if (start < 0) {
start = 0
} else {
start += 3
}
int end = url.indexOf('/', start)
if (end < 0) {
end = url.length()
}
String domainName = url.substring(start, end)
int port = domainName.indexOf(':')
if (port >= 0) {
domainName = domainName.substring(0, port)
}
domainName
}
And here are some junit4 tests:
#Test
void shouldFindDomainName() {
assert DataCenter.extractDomainName('http://example.com/path/') == 'example.com'
assert DataCenter.extractDomainName('http://subpart.example.com/path/') == 'subpart.example.com'
assert DataCenter.extractDomainName('http://example.com') == 'example.com'
assert DataCenter.extractDomainName('http://example.com:18445/path/') == 'example.com'
assert DataCenter.extractDomainName('example.com/path/') == 'example.com'
assert DataCenter.extractDomainName('example.com') == 'example.com'
}

try this one : java.net.URL;
JOptionPane.showMessageDialog(null, getDomainName(new URL("https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains")));
public String getDomainName(URL url){
String strDomain;
String[] strhost = url.getHost().split(Pattern.quote("."));
String[] strTLD = {"com","org","net","int","edu","gov","mil","arpa"};
if(Arrays.asList(strTLD).indexOf(strhost[strhost.length-1])>=0)
strDomain = strhost[strhost.length-2]+"."+strhost[strhost.length-1];
else if(strhost.length>2)
strDomain = strhost[strhost.length-3]+"."+strhost[strhost.length-2]+"."+strhost[strhost.length-1];
else
strDomain = strhost[strhost.length-2]+"."+strhost[strhost.length-1];
return strDomain;}

There is a similar question Extract main domain name from a given url. If you take a look at this answer , you will see that it is very easy. You just need to use java.net.URL and String utility - Split

One of the way I did and worked for all of the cases is using Guava Library and regex in combination.
public static String getDomainNameWithGuava(String url) throws MalformedURLException,
URISyntaxException {
String host =new URL(url).getHost();
String domainName="";
try{
domainName = InternetDomainName.from(host).topPrivateDomain().toString();
}catch (IllegalStateException | IllegalArgumentException e){
domainName= getDomain(url,true);
}
return domainName;
}
getDomain() can be any common method with regex.

private static final String hostExtractorRegexString = "(?:https?://)?(?:www\\.)?(.+\\.)(com|au\\.uk|co\\.in|be|in|uk|org\\.in|org|net|edu|gov|mil)";
private static final Pattern hostExtractorRegexPattern = Pattern.compile(hostExtractorRegexString);
public static String getDomainName(String url){
if (url == null) return null;
url = url.trim();
Matcher m = hostExtractorRegexPattern.matcher(url);
if(m.find() && m.groupCount() == 2) {
return m.group(1) + m.group(2);
}
return null;
}
Explanation :
The regex has 4 groups. The first two are non-matching groups and the next two are matching groups.
The first non-matching group is "http" or "https" or ""
The second non-matching group is "www." or ""
The second matching group is the top level domain
The first matching group is anything after the non-matching groups and anything before the top level domain
The concatenation of the two matching groups will give us the domain/host name.
PS : Note that you can add any number of supported domains to the regex.

If the input url is user input. this method gives the most appropriate host name. if not found gives back the input url.
private String getHostName(String urlInput) {
urlInput = urlInput.toLowerCase();
String hostName=urlInput;
if(!urlInput.equals("")){
if(urlInput.startsWith("http") || urlInput.startsWith("https")){
try{
URL netUrl = new URL(urlInput);
String host= netUrl.getHost();
if(host.startsWith("www")){
hostName = host.substring("www".length()+1);
}else{
hostName=host;
}
}catch (MalformedURLException e){
hostName=urlInput;
}
}else if(urlInput.startsWith("www")){
hostName=urlInput.substring("www".length()+1);
}
return hostName;
}else{
return "";
}
}

To get the actual domain name, without the subdomain, I use:
private String getDomainName(String url) throws URISyntaxException {
String hostName = new URI(url).getHost();
if (!hostName.contains(".")) {
return hostName;
}
String[] host = hostName.split("\\.");
return host[host.length - 2];
}
Note that this won't work with second-level domains (like .co.uk).

// groovy
String hostname ={url -> url[(url.indexOf('://')+ 3)..-1].split('/')[0] }
hostname('http://hello.world.com/something') // return 'hello.world.com'
hostname('docker://quay.io/skopeo/stable') // return 'quay.io'

const val WWW = "www."
fun URL.domain(): String {
val domain: String = this.host
return if (domain.startsWith(ConstUtils.WWW)) {
domain.substring(ConstUtils.WWW.length)
} else {
domain
}
}

I use regex solution
public static String getDomainName(String url) {
return url.replaceAll("http(s)?://|www\\.|wap\\.|/.*", "");
}
It cleans url from "http/https/www./wap." and from all unnecessary things after / like "/questions" in "https://stackoverflow.com/questions" and we get just "stackoverflow.com"

How to normalize a URL in Java?

URL normalization (or URL canonicalization) is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized or canonical URL so it is possible to determine if two syntactically different URLs are equivalent.
Strategies include adding trailing slashes, https => http, etc. The Wikipedia page lists many.
Got a favorite method of doing this in Java? Perhaps a library (Nutch?), but I'm open. Smaller and fewer dependencies is better.
I'll handcode something for now and keep an eye on this question.
EDIT: I want to aggressively normalize to count URLs as the same if they refer to the same content. For example, I ignore the parameters utm_source, utm_medium, utm_campaign. For example, I ignore subdomain if the title is the same.

Have you taken a look at the URI class?
http://docs.oracle.com/javase/7/docs/api/java/net/URI.html#normalize()

I found this question last night, but there wasn't an answer I was looking for so I made my own. Here it is incase somebody in the future wants it:
/**
* - Covert the scheme and host to lowercase (done by java.net.URL)
* - Normalize the path (done by java.net.URI)
* - Add the port number.
* - Remove the fragment (the part after the #).
* - Remove trailing slash.
* - Sort the query string params.
* - Remove some query string params like "utm_*" and "*session*".
*/
public class NormalizeURL
{
public static String normalize(final String taintedURL) throws MalformedURLException
{
final URL url;
try
{
url = new URI(taintedURL).normalize().toURL();
}
catch (URISyntaxException e) {
throw new MalformedURLException(e.getMessage());
}
final String path = url.getPath().replace("/$", "");
final SortedMap<String, String> params = createParameterMap(url.getQuery());
final int port = url.getPort();
final String queryString;
if (params != null)
{
// Some params are only relevant for user tracking, so remove the most commons ones.
for (Iterator<String> i = params.keySet().iterator(); i.hasNext();)
{
final String key = i.next();
if (key.startsWith("utm_") || key.contains("session"))
{
i.remove();
}
}
queryString = "?" + canonicalize(params);
}
else
{
queryString = "";
}
return url.getProtocol() + "://" + url.getHost()
+ (port != -1 && port != 80 ? ":" + port : "")
+ path + queryString;
}
/**
* Takes a query string, separates the constituent name-value pairs, and
* stores them in a SortedMap ordered by lexicographical order.
* #return Null if there is no query string.
*/
private static SortedMap<String, String> createParameterMap(final String queryString)
{
if (queryString == null || queryString.isEmpty())
{
return null;
}
final String[] pairs = queryString.split("&");
final Map<String, String> params = new HashMap<String, String>(pairs.length);
for (final String pair : pairs)
{
if (pair.length() < 1)
{
continue;
}
String[] tokens = pair.split("=", 2);
for (int j = 0; j < tokens.length; j++)
{
try
{
tokens[j] = URLDecoder.decode(tokens[j], "UTF-8");
}
catch (UnsupportedEncodingException ex)
{
ex.printStackTrace();
}
}
switch (tokens.length)
{
case 1:
{
if (pair.charAt(0) == '=')
{
params.put("", tokens[0]);
}
else
{
params.put(tokens[0], "");
}
break;
}
case 2:
{
params.put(tokens[0], tokens[1]);
break;
}
}
}
return new TreeMap<String, String>(params);
}
/**
* Canonicalize the query string.
*
* #param sortedParamMap Parameter name-value pairs in lexicographical order.
* #return Canonical form of query string.
*/
private static String canonicalize(final SortedMap<String, String> sortedParamMap)
{
if (sortedParamMap == null || sortedParamMap.isEmpty())
{
return "";
}
final StringBuffer sb = new StringBuffer(350);
final Iterator<Map.Entry<String, String>> iter = sortedParamMap.entrySet().iterator();
while (iter.hasNext())
{
final Map.Entry<String, String> pair = iter.next();
sb.append(percentEncodeRfc3986(pair.getKey()));
sb.append('=');
sb.append(percentEncodeRfc3986(pair.getValue()));
if (iter.hasNext())
{
sb.append('&');
}
}
return sb.toString();
}
/**
* Percent-encode values according the RFC 3986. The built-in Java URLEncoder does not encode
* according to the RFC, so we make the extra replacements.
*
* #param string Decoded string.
* #return Encoded string per RFC 3986.
*/
private static String percentEncodeRfc3986(final String string)
{
try
{
return URLEncoder.encode(string, "UTF-8").replace("+", "%20").replace("*", "%2A").replace("%7E", "~");
}
catch (UnsupportedEncodingException e)
{
return string;
}
}
}

Because you also want to identify URLs which refer to the same content, I found this paper from the WWW2007 pretty interesting: Do Not Crawl in the DUST: Different URLs with Similar Text. It provides you with a nice theoretical approach.

No, there is nothing in the standard libraries to do this. Canonicalization includes things like decoding unnecessarily encoded characters, converting hostnames to lowercase, etc.
e.g. http://ACME.com/./foo%26bar becomes:
http://acme.com/foo&bar
URI's normalize() does not do this.

The RL library:
https://github.com/backchatio/rl
goes quite a ways beyond java.net.URL.normalize().
It's in Scala, but I imagine it should be useable from Java.

You can do this with the Restlet framework using Reference.normalize(). You should also be able to remove the elements you don't need quite conveniently with this class.

In Java, normalize parts of a URL
Example of a URL: https://i0.wp.com:55/lplresearch.com/wp-content/feb.png?ssl=1&myvar=2#myfragment
protocol: https
domain name: i0.wp.com
subdomain: i0
port: 55
path: /lplresearch.com/wp-content/uploads/2019/01/feb.png?ssl=1
query: ?ssl=1"
parameters: &myvar=2
fragment: #myfragment
Code to do the URL parsing:
import java.util.*;
import java.util.regex.*;
public class regex {
public static String getProtocol(String the_url){
Pattern p = Pattern.compile("^(http|https|smtp|ftp|file|pop)://.*");
Matcher m = p.matcher(the_url);
return m.group(1);
}
public static String getParameters(String the_url){
Pattern p = Pattern.compile(".*(\\?[-a-zA-Z0-9_.#!$&''()*+,;=]+)(#.*)*$");
Matcher m = p.matcher(the_url);
return m.group(1);
}
public static String getFragment(String the_url){
Pattern p = Pattern.compile(".*(#.*)$");
Matcher m = p.matcher(the_url);
return m.group(1);
}
public static void main(String[] args){
String the_url =
"https://i0.wp.com:55/lplresearch.com/" +
"wp-content/feb.png?ssl=1&myvar=2#myfragment";
System.out.println(getProtocol(the_url));
System.out.println(getFragment(the_url));
System.out.println(getParameters(the_url));
}
}
Prints
https
#myfragment
?ssl=1&myvar=2
You can then push and pull on the parts of the URL until they are up to muster.

Im have a simple way to solve it. Here is my code
public static String normalizeURL(String oldLink)
{
int pos=oldLink.indexOf("://");
String newLink="http"+oldLink.substring(pos);
return newLink;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Pattern from String - java

In past i've do some with reguar expression, but in my case the string having ever the same composition pattern or order. I this case, you can done 3 matching pattern and make the find operation 3 times in order of pattern. If not so, you must use an text analyzer or search tool.

Related

Language detections with apache Tika

Saving and Loading Trained Stanford classifier in java

Find and Delete a Sub-String in Java

Get domain name from given url

How to normalize a URL in Java?

Categories

Resources