IP and hostname detection - java

I know basics of java but I am not too experienced with regex or patterns, so please excuse me if im asking something super simple..
Im writing a method that detects IP addresses and hostnames. I used the regex from this answere here. The problem I am encountering though is that sentences without symbols are counted as host names
Heres my code:
Pattern validHostname = Pattern.compile("^(([a-z]|[a-z][a-z0-9-]*[a-z0-9]).)*([a-z]|[a-z][a-z0-9-]*[a-z0-9])$",Pattern.CASE_INSENSITIVE);
Pattern validIpAddress = Pattern.compile("^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])([:]\\d\\d*\\d*\\d*\\d*)*$",Pattern.CASE_INSENSITIVE);
String msg = c.getMessage();
boolean found=false;
//Randomly picks from a list to replace the detected ip/hostname
int rand=(int)(Math.random()*whitelisted.size());
String replace=whitelisted.get(rand);
Matcher matchIP = validIpAddress.matcher(msg);
Matcher matchHost = validHostname.matcher(msg);
while(matchIP.find()){
if(adreplace)
msg=msg.replace(matchIP.group(),replace);
else
msg=msg.replace(matchIP.group(),"");
found=true;
c.setMessage(msg);
}
while(matchHost.find()){
if(adreplace)
msg=msg.replace(matchHost.group(),replace);
else
msg=msg.replace(matchHost.group(),"");
found=true;
c.setMessage(msg);
}
return c;

Description
Without sample text and desired output, I'll try my best to answer your question.
I would rewrite you host name expression like this:
A: ^(?:[a-z][a-z0-9-]*[a-z0-9](?=\.[a-z]|$)\.?)+$ will allow single word names like abcdefg
B: ^(?=(?:.*?\.){2})(?:[a-z][a-z0-9-]*[a-z0-9](?=\.[a-z]|$)\.?)+$ requires the string to contain at least two period like abc.defg.com. This will not allow a period to appear at the beginning or end, or sequential periods. The number inside the lookahead {2} describes the minimum number of dots which must appear. You can change this number as you see fit.
^ match the start of the string anchor
(?: start non-capture group improves performance
[a-z][a-z0-9-]*[a-z0-9] match text, taken from your original expression
(?=\.[a-z]|$) look ahead to see if the next character is a dot followed by an a-z character, or the end of the string
\.? consume a single dot if it exists
) close the capture group
+ require the contents of the capture group to exist 1 or more times
$ match the end of the string anchor
Host names:
A Allows host name without dots
B Requires host name to have a dot
Live Demo with a sentence with no symbols
I would also rewrite the IP expression
^(?:(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])(?::\d*)?$
The major differences here are that I:
removed the multiple \d* from the end because expression \d*\d*\d*\d*\d*\d* is equivalent to \d*
changed the character class [:] to a single character :
I turned the capture groups (...) into non-capture groups (?...) which performs a little better.

Related

Replace repeated letters in a word with exception

I would like to have a regex expression that (in java) will replace every repeated consonant into single letter, all repeated consonants but an initial "inn".
I explain myself better with some examples:
asso > aso
assso > aso
assocco > asoco
innasso > innaso
I found a way to replace all repeated letters with
Pattern.compile("([^aeiou])+\1").matcher(text).replaceAll("$1")
I found a way to recognize if a word does not start with "inn":
Pattern.compile("^(?!inn).+").matcher(text).matches()
but I don't know how to merge them, ie, degeminate all geminates consonants but the initial 'nn' if the word starts with 'inn'.
Anyone can help me? (I would like to solve this with a regex, in order to apply replaceAll)
Thank you
I'm not sure why you must do this all with a single regexp, but if you must... try using negative lookbehind:
Pattern.compile("((?<!^i(?=nn))[^aeiou])+\\1")
This gobbledygook broken down:
(?=X) means: Don't consume anything, just check if X occurs here. If not, it's not a match.
(?<!X) means 'negative lookbehind': It doesn't consume any characters, but it fails to match if X occurs at this exact spot. So, if at this exact spot we're on the first character in the text, and it is an 'i', then it's a failure no matter what.
(?<!^i(?=nn)) does not consume anything, but it fails for any position where the following holds: Immediately before the 'cursor' there is an i, and before that, the start of the string. After the 'cursor' there are 2 n's. If that all holds, fail. Otherwise do nothing (continue processing).
The rest is then just what you wrote already.
One option could be to capture a word that starts with inn using a negative lookbehind (?<!\S) in group 1, and capture matching [^aeiou] in group 2 and repeat the backreference to that group 1 or more times.
(?<!\S)(inn)|([^aeiou\r\n])\2+
Explanation
(?<!\S) Negative lookbehind, assert what is on the left is not a non whitespace char
(inn) Capture group 1, match inn
| Or
( Capture group 2
[^aeiou\r\n] Match any char except the listed
)\2+ Close group and repeat 1+ times what was captured in group 2
Regex demo | Java demo
In the replacement use the 2 capturing groups $1$2
For example
final String regex = "(?<!\\S)(inn)|([^aeiou\\r\\n])\\2+";
final String string = "asso\n"
+ "assso\n"
+ "assocco\n"
+ "innasso";
final String subst = "$1$2";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);
final String result = matcher.replaceAll(subst);
System.out.println(result);
Output
aso
aso
asoco
innaso

Regex to match user and user#domain

A user can login as "user" or as "user#domain". I only want to extract "user" in both cases. I am looking for a matcher expression to fit it, but im struggling.
final Pattern userIdPattern = Pattern.compile("(.*)[#]{0,1}.*");
final Matcher fieldMatcher = userIdPattern.matcher("user#test");
final String userId = fieldMatcher.group(1)
userId returns "user#test". I tried various expressions but it seems that nothing fits my requirement :-(
Any ideas?
If you use "(.*)[#]{0,1}.*" pattern with .matches(), the (.*) grabs the whole line first, then, when the regex index is still at the end of the line, the [#]{0,1} pattern triggers and matches at the end of the line because it can match 0 # chars, and then .* again matches at that very location as it matches any 0+ chars. Thus, the whole line lands in your Group 1.
You may use
String userId = s.replaceFirst("^([^#]+).*", "$1");
See the regex demo.
Details
^ - start of string
([^#]+) - Group 1 (referred to with $1 from the replacement pattern): any 1+ chars other than #
.* - the rest of the string.
A little bit of googling came up with this:
(.*?)(?=#|$)
Will match everthing before an optional #
I would suggest keeping it simple and not relying on regex in this case if you are using java and have a simple case like you provided.
You could simply do something like this:
String userId = "user#test";
if (userId.indexOf("#") != -1)
userId = userId.substring(0, userId.indexOf("#"));
// from here on userId will be "user".
This will always either strip out the "#test" or just skip stripping it out when it is not there.
Using regex in most cases makes the code less maintainable by another dev in the future because most devs are not very good with regular expressions, at least in my experience.
You included the # as optional, so the match tries to get the longest user name. As you didn't put the restriction of a username is not allowed to have #s in it, it matched the longest string.
Just use:
[^#]*
as the matching subexpr for usernames (and use $0 to get the matched string)
Or you can use this one that can be used to find several matches (and to get both the user part and the domain part):
\b([^#\s]*)(#[^#\s]*)?\b
The \b force your string to be tied to word boundaries, then the first group matches non-space and non-# chars (any number, better to use + instead of * there, as usernames must have at least one char) followed (optionally) by a # and another string of non-space and non-# chars). In this case, $0 matches the whole email addres, $1 matches the username part, and $2 the #domain part (you can refine to only the domain part, adding a new pair of parenthesis, as in
b([^#\s]*)(#([^#\s]*))?\b
See demo.

parsing numerical address

I have been trying to parse a numerical address from a string using regex.
So far, I have been able to successfully get the numerical address (partially) 63.88.73.26:80 from the string http://63.88.73.26:80/. However I have been trying to skip over the :80/, and have had no luck.
What I have tried so far is:
Pattern.compile("[0-999].*[0-999][\\p{Digit}]", Pattern.DOTALL);
however does still includes :80
I dont know what I am missing here, I have tried to check for \p{Digit} at the end, but that doesn't do much either
Thanks for your time!
You are looking for a positive look ahead (?=...). This will match only if it is followed by a specific expression, the one in the positive look ahead's parenthesis. In it's simplest form you could have
[0-9\.]+(?=:[0-9]{0,4})
Though you may want to change the [0-9\.]+ part (match 1 or more digit or full stop) with something more complete to check that you have a properly formed address
Check out regexr.com where you can fiddle your expression to your heart's content until it works...
Note that Pshemo indicated the right approach with URL and getHost():
Gets the host name of this URL, if applicable. The format of the host conforms to RFC 2732, i.e. for a literal IPv6 address, this method will return the IPv6 address enclosed in square brackets ('[' and ']').
Thus, it is best to use the proper tool here:
import java.net.*;
....
String str = new URL("http:" + "//63.88.73.26:80/").getHost();
System.out.println(str); // => 63.88.73.26
See the Java demo
You mention that you want to learn regex, so let's inspect your pattern:
[0-999] - matches any 1 digit, a single digit (0-9 creates a range that matches 0..9, and the two 9s are redundant and can be removed)
.* - any 0+ chars, greedily, i.e. up to the last...
[0-999] - see above (any 1 digit)
[\\p{Digit}] - any Unicode digit
That means, you match a string starting with a digit and up to the last occurrence of 2 consecutive digits.
You need a sequence of digits and dots. There are multiple ways to extract such strings.
Using verbose pattern with exact character specification together with how many occurrences you need: [0-9]{1,3}(?:\.[0-9]{1,3}){3} (the whole match - matcher.group() - holds the required value).
Using the "brute-force" character class approach (see Jonathan's answer), but I'd use a capturing group instead of a lookahead and use an unescaped dot since inside a character class it is treated as a literal dot: ([0-9.]+):[0-9] (now, the value is in matcher.group(1))
A "fancy" "get-string-between-two-strings" approach: all text other than : and / between http:// and : must be captured into a group - https?://([^:/]+): (again, the value is in matcher.group(1))
Some sample code (Approach #1):
Pattern ptrn = Pattern.compile("[0-9]{1,3}(?:\\.[0-9]{1,3}){3}");
Matcher matcher = ptrn.matcher("http://63.88.73.26:80/");
if (matcher.find()) {
System.out.println(matcher.group());
}
Must read: Character Classes or Character Sets.

Matching regex groups with Java

I am trying to split a line with regex by using groups, but it's not working as I expected.
I want to match for example this line:
Ex. #1: temp name(this is the data)
and also this:
Ex. #2: temp name()
I used this regex:
[\s]*temp[\s]+[\s]*([A-Za-z]+)[\s]*[(]\s*(.*)+[)]\s*[{]\s*
which means: grab anything that starts with temp then put in group #1 the "name" then grab whatever inside the bracket and put it in group #2.
However, group #2 is always empty.
This is my code to fetch the data:
Pattern PATTERN = Pattern.compile("[\\s]*temp[\\s]+[\\s]*([A-Za-z]+)[\\s]*[(]\\s*(.*)+[)]\\s*");
Matcher m = PATTERN.matcher("temp name(this is the data)");
m.matches();
String name = m.group(1);
String data = m.group(2); // always empty
What am I doing wrong?
Your pattern doesn't match because it requires an open curly brace at the end, but your input doesn't have one.
Ignoring that small problem, the main problem is the little + after your capture group (.*)+. The plus requires one or more matches of .* and the group returned is the last match of the many. The term .* is greedy, so it consumes everything up to the bracket. The only way to match again is to consume nothing. So the last match of group 2 is blank.
To fix it, remove the + after group 2:
Pattern PATTERN = Pattern.compile("\\s*temp\\s+([A-Za-z]+)\\s*[(]\\s*(.*)[)]\\s*");
Note also how I removed other unnecessary characters from your regex, eg the single-character character classes - ie [\\s] is identical to \s. And \\s+\\s* is identical to just \\s+, because + is greedy.
I also removed the trailing curly bracket, which you can restore if your input data actually has it (your question showed input of "temp name(this is the data)", which has no trailing curly bracket).
Your regex should be this:
Pattern pattern = Pattern.compile("\\s*temp\\s+([A-Za-z]+)\\s*[(]\\s*(.*)[)]\\s*");
You had (.*)+ which means one or more matches of .*. This results in nothing being captured.
Testing:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Example {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("\\s*temp\\s+([A-Za-z]+)\\s*[(]\\s*(.*)[)]\\s*");
Matcher m = pattern.matcher("temp name(this is the data)");
if(m.matches()) {
System.out.println(m.group(1));
System.out.println(m.group(2));
}
}
}
Output:
name
this is the data
[\s] is equivalent with \s
[\s]+[\s]* is equivalent with \s+
[(] is equivalent with \( (same for [)] and [}])
This would leave your regexp as:
\s*temp\s+([A-Za-z]+)\s*\(\s*(.*)+\)\s*\{\s*
Assuming you actually want to match temp name(...) { (your regexp is looking for a {, while in your question you do not specify that):
(.*)+ is your problem. You're saying: "Match any number (including 0) chatacters and put them in a capture group, repeat that at least once".
Regexp are by default greedy (= they consume as much as possible), so the capture group will first contain everything within the two brackets, then the + will try to match the entire group again, and will match it with "" (the emtpy string) as this fulfils the capture group's pattern. This will elave your capture group emtpy.
What you want instead is \s*temp\s+([A-Za-z]+)\s*\(\s*(.*)\)\s*\{\s*
The reason you are getting empty groups is because you are creating multiple capture groups every time you put something between (), even if it is nested.
To make a group so it doesnt capture you can designate it as a non-capturing group by using ?: for example (?:sometest(this is the value we want)) will return just one group while (sometest(this is the value we want)) will return 2 groups.
For your particular regex, I have refined and simplified it, as you had capture groups you did not need.
Simple solution:
\\s*temp\\s+([A-Za-z]+)\\s*\\(\\s*(.*)\\)\\s*\{\\s*
given the input:
Ex. #1: temp name(this is the data) {
Ex. #2: temp name() {
$1 = name, $2 = data
Pay attention to the fact that your regex contains a trailing curly brace. You can modify the regex to match without it and it will result in this:
\\s*temp\\s+([A-Za-z]+)\\s*\\(\\s*(.*)\\)\\s*
https://regex101.com/r/tD0tO0/1

Regex in Java: Capture last {n} words

Hi I am trying to do regex in java, I need to capture the last {n} words. (There may be a variable num of whitespaces between words). Requirement is it has to be done in regex.
So e.g. in
The man is very tall.
For n = 2, I need to capture
very tall.
So I tried
(\S*\s*){2}$
But this does not match in java because the initial words have to be consumed first. So I tried
^(.*)(\S*\s*){2}$
But .* consumes everything, and the last 2 words are ignored.
I have also tried
^\S?\s?(\S*\s*){2}$
Anyone know a way around this please?
You had almost got it in your first attempt.
Just change + to *.
The plus sign means at least one character, because there wasn't any space the match had failed.
On the other hand the asterisk means from zero to more, so it will work.
Look it live here: (?:\S*\s*){2}$
Using replaceAll method, you could try this regex: ((?:\\S*\\s*){2}$)|.
Your regex contains - as you already mention - a greedy subpattern that eats up the whole string and sine (\S*\s*){2} can match an empty string, it matches an empty location at the end of the input string.
Lazy dot matching (changing .* to .*?) won't do the whole job since the capturing group is quantified, and the Matcher.group(1) will be set to the last captured non-whitespaces with optional whitespaces. You need to set the capturing group around the quantified group.
Since you most likely are using Matcher#matches, you can use
String str = "The man is very tall.";
Pattern ptrn = Pattern.compile("(.*?)((?:\\S*\\s*){2})"); // no need for `^`/`$` with matches()
Matcher matcher = ptrn.matcher(str);
if (matcher.matches()) { // Group 2 contains the last 2 "words"
System.out.println(matcher.group(2)); // => very tall.
}
See IDEONE demo

Categories

Resources