Deleting all words matching a regex pattern

Deleting all words matching a regex pattern - java

I would like to remove the character sequences like "htsap://" or "ftsap://" from a String. Is it possible?
Let me illustrate my needs with an example.
Actual input String:
"Every Web page has a http unique address called a URL (Uniform Resource Locator) which identifies where it is located on the Web. For "ftsap://"example, the URL for CSM Library's home page is: "htsap://"www.smccd.edu/accounts/csmlibrary/index.htm The basic parts of a URL often provide \"clues\" to htsap://where a web page originates and who might be responsible for the information at that page or site."
Expected resulting String:
"Every Web page has a http unique address called a URL (Uniform Resource Locator) which identifies where it is located on the Web. For example, the URL for CSM Library's home page is: www.smccd.edu/accounts/csmlibrary/index.htm The basic parts of a URL often provide \"clues\" to where a web page originates and who might be responsible for the information at that page or site."
Patterns I tried: (not very sure it is a right way)
((.*?)(?=("htsap://|ftsap://")))
and:
((.*?)(?=("htsap://|ftsap://")))(.*)
Could anyone please suggest here?

Since you're escaping your quotes within your sample Strings, I'll assume you're working in Java.
You should try:
final String res = input.replaceAll("\"?\\w+://\"?", "");
Here is a link to a working example of what does this regex match exactly!
How it works:
It matches and removes any sequence of alphanumeric characters (and underscores), followed by :// and possibly preceded and/or followed by ".
EDIT: How to achieve the same result using a Matcher?
final String input = "Every Web page has a http unique address called a URL (Uniform Resource Locator) which identifies where it is located on the Web. For \"ftsap://\"example, the URL for CSM Library's home page is: \"htsap://\"www.smccd.edu/accounts/csmlibrary/index.htm The basic parts of a URL often provide \"clues\" to htsap://where a web page originates and who might be responsible for the information at that page or site.";
final Pattern p = Pattern.compile("\"?\\w+://\"?");
final StringBuilder b = new StringBuilder(input);
Matcher m;
while((m = p.matcher(b.toString())).find()) {
b.replace(m.start(), m.end(), "");
}
System.out.println(b.toString());

Use this regex:
"(ftsap|htsap).//"
And replace it with ''
Regex explained:
"(ftsap|htsap).//" with flag g
Debuggex Demo

Related

Java Regex to Match URL

I need to create Regexes to match URLs of the following forms
/collected/{deliveryId}/deliverer/{userId}
/customer/{userId}/status/active
/users/{userId}/role
Where delivery-id and user-id are UUIDs in the form of: 124r23452-124234234-123123423534 and the other string parts are constant.
For the first one I tried something like this but didnt work:
String urlRegex = "[a-zA-Z-]*/collected/deliverer/(?=\\S*[-])([a-zA-Z-]+)";

You can try this pattern : \/collected\/\w{0,9}\/deliverer\/\w{0,} and use https://regex101.com/ web site. This wikipedia page, gives also some good details on regex.

Regex to validate wildcard domains with special conditions

I am looking to validate wildcards against Samsung Knox Firewall. Please see below the full criteria for all domains:
A list of URLs for specified domain names to block DNS resolution. The format of the URL must be compliant with RFC's standards and must also match one of the following rules:
Full URL: "www.google.com"
Partial URL: "android.com"; "www.samsung"; "google". The
character "*" (wildcard) must be at the beginning and/or at the end
of the URL otherwise the URL is invalid.
Special case, matches any URL : "*"
Valid domains
The following examples are considered valid by Knox.
*.test.com
*test.com
*test
*test*
test.*
test1.test.*
Invalid domains
The following examples are considered invalid by Knox.
*test-
*test.
*test.com-
*test-.com
Is anybody able to offer a hand? I am struggling to accommodate for all of the requirements with this one.
Current code:
(?=^\*|.*\*$)^(?:\*\.?)?(?:(?:[a-z0-9-]+(?(?=\.)(?<!-)\.(?!-)))+[a-z]+)(?:\.?\*)?$
Edit: Actually, it looks like conditional regex may not even be supported in Java.

BASED ON YOUR PROVIDED EXAMPLES
If you're trying to pre-filter the domains, then this one matches all of your "Valid" examples and rejects all of your "Invalid" examples
^[\w*]([\w*-]+[\w*])?(\.[\w*]([\w*-]+[\w*])?)*$
If there's a file or carriage return separated field with all of these in it that you're trying to test, you may want to use the "multiline" switch like so:
(?m)^[\w*]([\w*-]+[\w*])?(\.[\w*]([\w*-]+[\w*])?)*$
since you tagged java, that would be encoded into a java string as follows:
"(?m)^[\\w*]([\\w*-]+[\\w*])?(\\.[\\w*]([\\w*-]+[\\w*])?)*$"
EDIT - Matching all the rules, in addition to your provided examples
This expression seems to work:
^(\*|(\*|\*\.)?\w+(\.\w+)*(\.\*|\*)?)$
Matching/Non-matching examples:
MATCHING NON-MATCHING
------------ ------------
* *test-
*.test.com *test.
*test.com *test.com-
*test *test-.com
*test* test*.com
test.* test.*com
test1.test.* -test.com

How do determine the final URL from a link in Java

This is a link generated from Google Alerts, and I would like to get where you get redirected. So I need the URL and I would have to retrieve it with Java. I have checked for the response, but no location header redirect.
https://www.google.com/url?rct=j&sa=t&url=http://naija247news.com/2016/03/nigerian-bond-yields-rise-after-cbns-interest-rate-hike-aimed-at-luring-investors/&ct=ga&cd=CAIyGjA3ZmJiYzk0ZDM0N2U2MjU6Y29tOmVuOlVT&usg=AFQjCNGs7HsYSodEUnECfdAatG6KgY18DA

Maybe something like this:
String URL = "https://www.google.com/url?rct=j&sa=t&url=http://naija247news.com/2016/03/nigerian-bond-yields-rise-after-cbns-interest-rate-hike-aimed-at-luring-investors/&ct=ga&cd=CAIyGjA3ZmJiYzk0ZDM0N2U2MjU6Y29tOmVuOlVT&usg=AFQjCNGs7HsYSodEUnECfdAatG6KgY18DA";
String subStr = URL.substring(URL.indexOf("url=") + 1, URL.indexOf("&ct"));
I forgot what the starting and ending position has to be exactly, which indexes. So you would have to verify that and check it creates a substring at the right position. But the basic idea is to cut out the URL you need and nothing more. This is an example for what you forwarded. It could be that you would have to search for something else to know the end of the substring, when you have a different URL (in the provided example I look for &ct, which maybe be not be the case in another URL). You will have to look up several URLs you have to know how to cut out the URL.

some information about pattern matching in a Java web crwaler using crawler4j library

I want implement a very simple web crawler using Java and I have find this library: crawler4j: http://code.google.com/p/crawler4j/
I need a crawler that do the following thing:
Start from an URL (specificated by me) and recognizes if in the current page there is a specific word such as a own name or a company name (also this word are specified by me)
If find this word, the current page URL have to be saved in a database.
So, there is no semantic analysis but only syntactic analysis (the crawler has to try to match the web page content with some token specified by me)
I would know if this token research (find if a word is contained in the current page) is a feature implemented by the abstract class WebCrawler of crawler4j or if I have to implement it by myself

As noted by user1887511 it is dead simple to implement. Adapted from here.
static String wordToFind = "...";
public void visit(Page page) {
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String text = htmlParseData.getText();
if(text.indexOf(wordToFind)!=-1)
saveToDB(page.getWebURL().getURL()):
}
}

You have to implement it yourself, a starting point in the code would be the visit() subclass/method, this is called when a page is visited... and parsed to you, then you can do whatever you want with the page-text ...for instance using regex patterns.

Match two urls with regular expressions

I have a list of urls and I want to match those url's with this url using regular expressions
http://investor.somehost.com/*
here * means anything after that or you can say it's a wildcard...
String href = url.getURL();
here href contains all the url's.
suppose firstentry contains that above url (http://investor.somehost.com/*)
So how can I compare href with firstentry such that if href starts with this url then do this thing...

If you just want to determine whether a String starts with a particular prefix, use startsWith(String prefix).
Example:
String href = "http://google.com/mail";
if(href.startsWith("http://google.com")) {
//... Do stuff
}

"^http://investor\\.somehost\\.com/"
will match any string starting with http://investor.somehost.com/. If you want only valid URLs, you could use
"^http://investor\\.somehost\\.com/(([-._~:#!$&'()*+,;=a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])+(/([-._~:#!$&'()*+,;=a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])*)*)?"
If you want to allow queries,
"^http://investor\\.somehost\\.com/(([-._~:#!$&'()*+,;=a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])+(/([-._~:#!$&'()*+,;=a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])*)*)?(\?([-._~:#!$&'()*+,;=a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])*)?"
If you also need fragments,
"^http://investor\\.somehost\\.com/(([-._~:#!$&'()*+,;=a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])+(/([-._~:#!$&'()*+,;=a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])*)*)?(\?([-._~:#!$&'()*+,;=/?a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])*)?(#([-._~:#!$&'()*+,;=/?a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])*)?"
End any of these with $ if you don't want to allow trailing (non-URL) parts of the string.

I have a regular expression on this post that provides the regular expression to extract the domain part of a url no matyer where in a string it mau occur. Its for javascript so remove the leading '/' amd trailing '/ig'. Use it to extract the domains and compare them with a simple equals check.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Deleting all words matching a regex pattern - java

Use this regex: "(ftsap|htsap).//" And replace it with '' Regex explained: "(ftsap|htsap).//" with flag g Debuggex Demo

Related

Java Regex to Match URL

Regex to validate wildcard domains with special conditions

How do determine the final URL from a link in Java

some information about pattern matching in a Java web crwaler using crawler4j library

Match two urls with regular expressions

Categories

Resources