I have been trying to parse a numerical address from a string using regex.
So far, I have been able to successfully get the numerical address (partially) 63.88.73.26:80 from the string http://63.88.73.26:80/. However I have been trying to skip over the :80/, and have had no luck.
What I have tried so far is:
Pattern.compile("[0-999].*[0-999][\\p{Digit}]", Pattern.DOTALL);
however does still includes :80
I dont know what I am missing here, I have tried to check for \p{Digit} at the end, but that doesn't do much either
Thanks for your time!
You are looking for a positive look ahead (?=...). This will match only if it is followed by a specific expression, the one in the positive look ahead's parenthesis. In it's simplest form you could have
[0-9\.]+(?=:[0-9]{0,4})
Though you may want to change the [0-9\.]+ part (match 1 or more digit or full stop) with something more complete to check that you have a properly formed address
Check out regexr.com where you can fiddle your expression to your heart's content until it works...
Note that Pshemo indicated the right approach with URL and getHost():
Gets the host name of this URL, if applicable. The format of the host conforms to RFC 2732, i.e. for a literal IPv6 address, this method will return the IPv6 address enclosed in square brackets ('[' and ']').
Thus, it is best to use the proper tool here:
import java.net.*;
....
String str = new URL("http:" + "//63.88.73.26:80/").getHost();
System.out.println(str); // => 63.88.73.26
See the Java demo
You mention that you want to learn regex, so let's inspect your pattern:
[0-999] - matches any 1 digit, a single digit (0-9 creates a range that matches 0..9, and the two 9s are redundant and can be removed)
.* - any 0+ chars, greedily, i.e. up to the last...
[0-999] - see above (any 1 digit)
[\\p{Digit}] - any Unicode digit
That means, you match a string starting with a digit and up to the last occurrence of 2 consecutive digits.
You need a sequence of digits and dots. There are multiple ways to extract such strings.
Using verbose pattern with exact character specification together with how many occurrences you need: [0-9]{1,3}(?:\.[0-9]{1,3}){3} (the whole match - matcher.group() - holds the required value).
Using the "brute-force" character class approach (see Jonathan's answer), but I'd use a capturing group instead of a lookahead and use an unescaped dot since inside a character class it is treated as a literal dot: ([0-9.]+):[0-9] (now, the value is in matcher.group(1))
A "fancy" "get-string-between-two-strings" approach: all text other than : and / between http:// and : must be captured into a group - https?://([^:/]+): (again, the value is in matcher.group(1))
Some sample code (Approach #1):
Pattern ptrn = Pattern.compile("[0-9]{1,3}(?:\\.[0-9]{1,3}){3}");
Matcher matcher = ptrn.matcher("http://63.88.73.26:80/");
if (matcher.find()) {
System.out.println(matcher.group());
}
Must read: Character Classes or Character Sets.
Related
I want to validate a string which allows only alpha numeric values and only
one dot character and only underscore character in java .
String fileName = (String) request.getParameter("read");
I need to validate the fileName retrieving from the request and should
satisfy the above criteria
I tried in "^[a-zA-Z0-9_'.']*$" , but this allows more than one dot character
I need to validate my string in the given scenarios ,
1 . Filename contains only alpha numeric values .
2 . It allows only one dot character (.) , example : fileRead.pdf ,
fileWrite.txt etc
3 . it allows only underscore characters . All the other symbols should be
declined
Can any one help me on this ?
You should use String.matches() method :
System.out.println("My_File_Name.txt".matches("\\w+\\.\\w+"));
You can also use java.util.regex package.
java.util.regex.Pattern pattern =
java.util.regex.Pattern.compile("\\w+\\.\\w+");
java.util.regex.Matcher matcher = pattern.matcher("My_File_Name.txt");
System.out.println(matcher.matches());
For more information about REGEX and JAVA, look at this page :
https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
You could use two negative lookaheads here:
^((?!.*\..*\.)(?!.*_.*_)[A-Za-z0-9_.])*$
Each lookahead asserts that either a dot or an underscore does not occur two times, implying that it can occur at most once.
It wasn't completely clear whether you require one dot and/or underscore. I assumed not, but my regex could be easily modified to this requirement.
Demo
You can first check the special characters which have the number limits.
Here is the code:
int occurance = StringUtils.countOccurrencesOf("123123..32131.3", ".");
or
int count = StringUtils.countMatches("123123..32131.3", ".");
If it does not match your request you can discard it before regex check.
If there is no problem you can now put your String to alphanumeric value check.
I have the following text
My thing 0.02
My thing 100.2
My thing 65
My thing
0.03
My thing
13
My thing
45.67 stuff
I want to extract the 'My thing' and the number associated with it can split it and put it into an map (I know the keys will over-wreite each other in this example- its just the example Im using here- My thing will actually be incorporated into its own map so it isn't an issue)
Mything=0.02,Mything=100.2,Mything=65,Mything=0.03,Mything=13,Mything=45.67
I tried
Pattern match_pattern = Pattern.compile(start.trim()+"\\n.*?\\d*\\.\\d*\\s",Pattern.DOTALL);
but this doesn't quite do what I want
The pattern for an integer or decimal might be \d+(\.\d+)? so if you want to look for start followed by that number and optional whitespace in between you might try the pattern start + "\\s*\\d+(\\.\\d+)?" (line breaks are whitespace as well) and apply the pattern to multiline text (i.e. don't apply it to individual lines). If there can be anything in between (not just whitespace) you'll want to use .* along with the DOT_ALL flag instead of \s*.
Breakdown of the expression start + "\\s*\\d+(\\.\\d+)?"
start contains a subexpression which is provided from elsewhere. If you want to make sure it is treated as a literal (i.e. special characters like * etc. are not interpreted wrap it with \Q and \E, i.e. "\\Q" + start + "\\E")
\s* (or \\s* in a Java string literal) means "any whitespace" which also includes line breaks
\d+(\.\d+)? (or \\d+(\\.\\d+)? in a Java string literal) means "one or more digits followed by zero or one group consisting of a dot and one or more digits" - this means the "dot and one or more digits" part is optional but if there is a dot it must be followed by at least one digit.
Additional note: if you want to access the capturing groups e.g. to extract the number you'll want to use a non-capturing group for the optional part and wrap the entire (sub-)expression in a capturing group, e.g. (\d+(?:\.\d+)?). In that case, if you'd use Pattern and Matcher, you could access the number using group(1) - or if you wrap start in a group as well (like "(\\Q" + start + "\\E)\\s*(\\d+(?:\\.\\d+)?)") you'd get the first part as group(1) and the second part as group(2).
If you simply want to extract the records you could do it like
String s = "My thing 0.02\nMy thing 100.2\nMy thing 65\nMy thing\n"+
"0.03\nMy thing\n13\nMy thing\n 45.67 stuff\n";
Matcher m = Pattern.compile("(My thing)\\s*(\\d+(?:\\.\\d+)?)").matcher(s);
Then loop through the matches and add to the dictionary, or what ever... ;)
while (m.find()) {
// Add to dictionary, group 1 is key, 2 is value
System.out.println("Found: " + m.group(0)+ ":" + m.group(1)+":" + m.group(2));
}
See it here at ideone.
I know basics of java but I am not too experienced with regex or patterns, so please excuse me if im asking something super simple..
Im writing a method that detects IP addresses and hostnames. I used the regex from this answere here. The problem I am encountering though is that sentences without symbols are counted as host names
Heres my code:
Pattern validHostname = Pattern.compile("^(([a-z]|[a-z][a-z0-9-]*[a-z0-9]).)*([a-z]|[a-z][a-z0-9-]*[a-z0-9])$",Pattern.CASE_INSENSITIVE);
Pattern validIpAddress = Pattern.compile("^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])([:]\\d\\d*\\d*\\d*\\d*)*$",Pattern.CASE_INSENSITIVE);
String msg = c.getMessage();
boolean found=false;
//Randomly picks from a list to replace the detected ip/hostname
int rand=(int)(Math.random()*whitelisted.size());
String replace=whitelisted.get(rand);
Matcher matchIP = validIpAddress.matcher(msg);
Matcher matchHost = validHostname.matcher(msg);
while(matchIP.find()){
if(adreplace)
msg=msg.replace(matchIP.group(),replace);
else
msg=msg.replace(matchIP.group(),"");
found=true;
c.setMessage(msg);
}
while(matchHost.find()){
if(adreplace)
msg=msg.replace(matchHost.group(),replace);
else
msg=msg.replace(matchHost.group(),"");
found=true;
c.setMessage(msg);
}
return c;
Description
Without sample text and desired output, I'll try my best to answer your question.
I would rewrite you host name expression like this:
A: ^(?:[a-z][a-z0-9-]*[a-z0-9](?=\.[a-z]|$)\.?)+$ will allow single word names like abcdefg
B: ^(?=(?:.*?\.){2})(?:[a-z][a-z0-9-]*[a-z0-9](?=\.[a-z]|$)\.?)+$ requires the string to contain at least two period like abc.defg.com. This will not allow a period to appear at the beginning or end, or sequential periods. The number inside the lookahead {2} describes the minimum number of dots which must appear. You can change this number as you see fit.
^ match the start of the string anchor
(?: start non-capture group improves performance
[a-z][a-z0-9-]*[a-z0-9] match text, taken from your original expression
(?=\.[a-z]|$) look ahead to see if the next character is a dot followed by an a-z character, or the end of the string
\.? consume a single dot if it exists
) close the capture group
+ require the contents of the capture group to exist 1 or more times
$ match the end of the string anchor
Host names:
A Allows host name without dots
B Requires host name to have a dot
Live Demo with a sentence with no symbols
I would also rewrite the IP expression
^(?:(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])(?::\d*)?$
The major differences here are that I:
removed the multiple \d* from the end because expression \d*\d*\d*\d*\d*\d* is equivalent to \d*
changed the character class [:] to a single character :
I turned the capture groups (...) into non-capture groups (?...) which performs a little better.
I'm confronted with a String:
[something] -number OR number [something]
I want to be able to cast the number. I do not know at which position is occures. I cannot build a sub-string because there's no obvious separator.
Is there any method how I could extract the number from the String by matching a pattern like
[-]?[0..9]+
, where the minus is optional? The String can contain special characters, which actually drives me crazy defining a regex.
-?\b\d+\b
That's broken down by:
-? (optional minus sign)
\b word boundary
\d+ 1 or more digits
[EDIT 2] - nod to Alan Moore
Unfortuantely Java doesn't have verbatim strings, so you'll have to escape the Regex above as:
String regex = "-?\\b\\d+\\b"
I'd also recommend a site like http://regexlib.com/RETester.aspx or a program like Expresso to help you test and design your regular expressions
[EDIT] - after some good comments
If haven't done something like *?(-?\d+).* (from #Voo) because I wasn't sure if you wanted to match the entire string, or just the digits. Both versions should tell you if there are digits in the string, and if you want the actual digits, use the first regex and look for group[0]. There are clever ways to name groups or multiple captures, but that would be a complicated answer to a straight forward question...
I need to check that a file contains some amounts that match a specific format:
between 1 and 15 characters (numbers or ",")
may contains at most one "," separator for decimals
must at least have one number before the separator
this amount is supposed to be in the middle of a string, bounded by alphabetical characters (but we have to exclude the malformed files).
I currently have this:
\d{1,15}(,\d{1,14})?
But it does not match with the requirement as I might catch up to 30 characters here.
Unfortunately, for some reasons that are too long to explain here, I cannot simply pick a substring or use any other java call. The match has to be in a single, java-compatible, regular expression.
^(?=.{1,15}$)\d+(,\d+)?$
^ start of the string
(?=.{1,15}$) positive lookahead to make sure that the total length of string is between 1 and 15
\d+ one or more digit(s)
(,\d+)? optionally followed by a comma and more digits
$ end of the string (not really required as we already checked for it in the lookahead).
You might have to escape backslashes for Java: ^(?=.{1,15}$)\\d+(,\\d+)?$
update: If you're looking for this in the middle of another string, use word boundaries \b instead of string boundaries (^ and $).
\b(?=[\d,]{1,15}\b)\d+(,\d+)?\b
For java:
"\\b(?=[\\d,]{1,15}\\b)\\d+(,\\d+)?\\b"
More readable version:
"\\b(?=[0-9,]{1,15}\\b)[0-9]+(,[0-9]+)?\\b"