java regex tricky pattern - java

I'm stucked for a while with a regex that does me the following:
split my sentences with this: "[\W+]"
but if it finds a word like this: "aaa-aa" (not "aaa - aa" or "aaa--aaa-aa"), the word isnt splitted, but the whole word.
Basically, i want to split a sentece per words, but also considering "aaa-aa" is a word. I'have sucessfully done that by creating two separate functions, one for spliting with \w, and other to find words like "aaa-aa". Finally, i then add both, and subctract each compound word.
For example, the sentence:
"Hello my-name is Richard"
First i collect {Hello, my, name, is, Richard}
then i collect {my-name}
then i add {my-name} to {Hello, my, name, is, Richard}
then i take out {my} and {name} in here {Hello, my, name, is, Richard}.
result: {Hello, my-name, is, Richard}
this approach does what i need, but for parsing large files, this becomes too heavy, because for each sentence there's too many copies needed. So my question is, there is anything i can do to include everything in one pattern? Like:
"split me the text using this pattern "[\W+], but if you find a word like this "aaa-aa", consider it a word and not two words.

If you want to use a split() rather than explicitly matching the words you are interested in, the following should do what you want: [\s-]{2,}|\s To break that down, you first split on two or more whitespaces and/or hyphens - so a single '-' won't match so 'one-two' will be left alone but something like 'one--two', 'one - two' or even 'one - --- - two' will be split into 'one' and 'two'. That still leaves the 'normal' case of a single whitespace - 'one two' - unmatched, so we add an or ('|') followed by a single whitespace (\s). Note that the order of the alternatives is important - RE subexpressions separated by '|' are evaluated left-to-right so we need to put the spaces-and-hyphens alternative first. If we did it the other way around, when presented with something like 'one -two' we'd match on the first whitespace and return 'one', '-two'.
If you want to interactively play around with Java REs I can thoroughly recommend http://myregexp.com/signedJar.html which allows you to edit the RE and see it matching against a sample string as you edit the RE.

Why not to use pattern \\s+? This does exactly what you want without any tricks: splits text by words separated by whitespace.

Your description isn't clear enough, but why not just split it up by spaces?

I am not sure whether this pattern would work, because I don't have developer tools for Java, you might try it though, it uses character class substraction, which is supported only in Java regex as far as I know:
[\W&&[^-]]+
it means match characters if they are [\W] and [^-], that is characters are [\W] and not [-].

Almost the same regular expression as in your previous question:
String sentence = "Hello my-name is Richard";
Pattern pattern = Pattern.compile("(?<!\\w)\\w+(-\\w+)?(?!\\w)");
Matcher matcher = pattern.matcher(sentence);
while (matcher.find()) {
System.out.println(matcher.group());
}
Just added the option (...)? to also match non-hypened words.

Related

Regular expression non-greedy but still

I have some larger text which in essence looks like this:
abc12..manycharshere...hi - abc23...manyothercharshere...jk
Obviously there are two items, each starting with "abc", the numbers (12 and 23) are interesting as well as the "hi" and "jk" at the end.
I would like to create a regular expression which allows me to parse out the numbers, but only if the two characters at the end match, i.e. I am looking for the number related to "jk", but the following regular expression matches the whole string and thus returns "12", not "23" even when non-greedy matching the area with the following:
abc([0-9]+).*?jk
Is there a way to construct a regular expression which matches text like the one above, i.e. retrieving "23" for items ending in "jk"?
Basically I would need something like "match abc followed by a number, but only if there is "jk" at the end before another instance of "abc followed by a number appears"
Note: the texts/matches are an abstraction here, the actual text is more complicated, espially the things that can appear as "manyothercharactershere", I simplified to show the underlying problem more clearly.
Use a regex like this. .*abc([0-9]+).*?jk
demo here
I think you want something like this,
abc([0-9]+)(?=(?:(?!jk|abc[0-9]).)*jk)
DEMO
You need to use negative lookahead here to make it work:
abc(?!.*?abc)([0-9]+).*?jk
RegEx Demo
Here (?!.*?abc) is negative lookahead that makes sure to match abc where it is NOT followed by another abc thus making sure closes string between abc and jk is matched.
Being non-greedy does not change the rule, that the first match is returned. So abc([0-9]+).*?jk will find the first jk after “abcnumber” rather than the last one, but still match the first “abcnumber”.
One way to solve this is to tell that the dot should not match abc([0-9]+):
abc([0-9]+)((?!abc([0-9]+)).)*jk
If it is not important to have the entire pattern being an exact match you can do it simpler:
.*(abc([0-9]+).*?jk)
In this case, it’s group 1 which contains your intended match. The pattern uses a greedy matchall to ensure that the last possible “abcnumber” is matched within the group.
Assuming that hyphen separates "items", this regex will capture the numbers from the target item:
abc([0-9]+)[^-]*?jk
See demo

Java Regex to validate String

I have just bought a book on Regex to try and get my head around it but I'm still really struggling with it. I am trying to create a java regex that will satisfy a string configuration that can;
Can contain lowercase letters ([a-z])
Can contain commas (,) but only between words
Can contain colon (:) but must be separated by words or multiply (*)
Can contain hyphens (-) but must be separated by words
Can contain multiply (*) but if used it must be the only character before/between/after the colon
Cannot contain spaces, 'words' are delimitated by a hyphens (-) or commas (,) or colon (:) or the end of the string
So for example the following would be true:
foo:bar
foo-bar:foo
foo,bar:foo
foo-bar,foo:bar,foo-bar
foo:bar:foo,bar
*:foo
foo:*
*:*:*
But the following would be false:
foo :bar
,foo:bar
foo-:bar
-foo:bar
foo,:bar-
foo:bar,
foo,*:bar
foo-*:bar
This is what I have so far:
^[a-z-]|*[:?][a-z-]|*[:?][a-z-]|*
Here is a regex that will work for all your cases:
([a-z]+([,-][a-z]+)*|\*)(:([a-z]+)([,-][a-z]+)*|\*)*
Here is a detailed analysis:
One of the basic structures used to build complicated regular expressions like this is actually pretty simple, and has the form text(separator text)*. A regex of that form will match:
one text
one text, a separator, and another text
one text, a separator, another text, another separator, and yet another text
or more, just add another separator and a text to the end.
So here is a breakdown of the code:
[a-z]+([,-][a-z]+)* is an instance of the pattern I discussed above: the text here is [a-z]+, and the separator is [,-].
([a-z]+([,-][a-z]+)*|\*) allows an asterisk to be matched instead.
([a-z]+([,-][a-z]+)*|\*)(:([a-z]+([,-][a-z]+)*|\*))* is another instance of the pattern I discussed above: the text is ([a-z]+([,-][a-z]+)*|\*), and the separator is :.
If you plan to use this as a component of an even larger regex, in which the group matches will be important, I would recommend making the internal parens non-grouping, and place grouping parens around the entire regex, like so:
((?:[a-z]+(?:[,-][a-z]+)*|\*)(?::([a-z]+)(?:[,-][a-z]+)*|\*)*)
We rarely see here somebody who can define positive and negative test cases. That makes live really easier.
Here's my regex with a 95% solution:
"(([a-z]+|\\*)[:,-])*([a-z]+|\\*)" (JAVA-Version)
(([a-z]+|\*)[:,-])*([a-z]+|\*) (plain regex)
It simply differntiates between words (a-z or *) and separators (one of :-,) and it must contain at least one word and words must be separated by a separator. It works for the positive cases and for the negative cases except the last two negative ones.
One remark: Such a complex "syntax" would in real live be implemented with a grammer definition tool like ANTLR (or a few years ago with lex/yacc, flex/bison). Regex can do that but will not be easy to maintain.

Need regex to separate comma separated values (Interface list from a router query output)

I have an input like this
RX Only : Gi1/0/15,Gi1/0/20,Gi1/0/17
I want to capture 1/0/15, 1/0/20, 1/0/17 from this. But this input changes. Sometimes there are only 2 comma separated values, sometimes 1 sometimes more than 3.
The regex I came up with only captures the first group. If I use the non-greedy operator, then it captures last. What regex should I use to capture all these groups separately.
The language used would be Java.
it's often easier to just write the regex for the substrings you are interested in, then repeatedly use Matcher.find(), as opposed to trying to write a regex that matches the entire string and pulling what you want from a complex arrangement of groups.
assuming what you are looking for are triples of three numbers separated by "/", then,
Pattern p = Pattern.compile("\\d+/\\d+/\\d+");
Matcher m = p.matcher(inputString);
while (m.find()) {
// your triple is in group 0
System.out.println(m.group(0));
}
Give a man a fish ... or
http://gskinner.com/RegExr/
Do you really have to use regex here? If data formats are quite similar you can just use indexOf function combined with substring. You will have to find the : character and start finding comas starting from the next character. Then you check the position of \n and use the smaller index in order to retrieve a substring.

Possible Regular Expression Question

I have a simple program that looks up details of an IP you give it, and I will show you an example of some of my code
int regIndex = src.indexOf("Region:") + 16;
int endIndex = src.indexOf("<", regIndex);
String region = src.substring(regIndex, endIndex);
if(regIndex == 15) region = "None";
int counIndex = src.indexOf("Country:") + 17;
int couneIndex = src.indexOf(" <", counIndex);
String country = src.substring(counIndex, couneIndex);
As you can see, it is definitely not the most efficient way to do this. The website I am using gives the information like this: http://whatismyipaddress.com/ip/1.1.1.1
I have never really used Regular Expressions before, but it seems to me like there might be one that could really make this more efficient and easier to program, but I've been looking around and I'm pretty lost.
So basically my question is, how could I use a Regular Expression for this (Or if there is another more efficient way).
Any help would be great,
Thanks :)
You can do something like this:
String s = "bla Country: Australia <bla";
Pattern pattern = Pattern.compile("Country: (.*) [<]");
Matcher matcher = pattern.matcher(s);
if(matcher.find()) {
System.out.println("Country = " + matcher.group(1));
}
The source would look like this
<tr><th>Country:</th><td>Australia <img src="http://whatismyipaddress.com/images/flags/au.png" alt="au flag"> </td></tr>
To use regular expression means to match a pattern.
The pattern that indicates your wanted data is pretty straight forward Country:. You need also to match the following tags like <\/th><td>. The only thing is you need to escape the forward slash. Then there is the data you are looking for, I would suggest to match everything that is not a <, so [^<], this is a capturing group with a negation at the beginning, meaning any character that is not a <, to repeat this add a + at the end, meaning at least one of the preceding character.
So, the complete thing should look like this:
Country:<\/th><td>\s*([^<]+)\s*<
I added here also the brackets, they mean put the found pattern into a variable, so your result can be found in capturing group 1. I added also \s*, this is a whitespace character repeated 0 or more times, this is to match whitespace before or after your data, I assume that you don't need that.
Firstly there are some online sites that can help you to develop a regular expression. They let you enter some text, and a regular expression and then show you the result of applying the expression to the text. This saves you having to write code as you develop the expression and expand your understanding. A good site I use alot is FileFormat regex because it allows me to test one expression against multiple test strings. A quick search also brought up regex Planet, RegExr and RegexPal. There are lots of others.
In terms of resources, the Java Pattern class reference is useful for Java development and I quite like regular-expression.info as well.
For your problem I used fileFormat.info and came up with this regex to match "http://whatismyipaddress.com/ip/1.1.1.1":
.*//([.\w]+)/.*/(\d+(?:.\d+){3})
or as a java string:
".*//([.\\w]+)/.*/(\\d+(?:.\\d+){3})"
A quick break down says anything (.*), followed by two slashes (//), followed by at least one or more decimal points or characters (([.\w]+)), followed by a slash, any number of characters and another slash (/.*/), followed by at least 1 digit ((\d+), followed by 3 sets of a decimal point and at least one digit ((?:.\d+){3})). The sets of brackets around the server name part and the IP part are called capturing groups and you can use methods on the Java Matcher class to return the contents of these sections. The ?: on the second part of the ip address tells it that we are using the brackets to group the characters but it's not to be treated as a capturing group.
This regex is not as strict or as flexible as it should be, but it's a starting point.
All of this can be researched on the above links.

Regex to detect number within String

I'm confronted with a String:
[something] -number OR number [something]
I want to be able to cast the number. I do not know at which position is occures. I cannot build a sub-string because there's no obvious separator.
Is there any method how I could extract the number from the String by matching a pattern like
[-]?[0..9]+
, where the minus is optional? The String can contain special characters, which actually drives me crazy defining a regex.
-?\b\d+\b
That's broken down by:
-? (optional minus sign)
\b word boundary
\d+ 1 or more digits
[EDIT 2] - nod to Alan Moore
Unfortuantely Java doesn't have verbatim strings, so you'll have to escape the Regex above as:
String regex = "-?\\b\\d+\\b"
I'd also recommend a site like http://regexlib.com/RETester.aspx or a program like Expresso to help you test and design your regular expressions
[EDIT] - after some good comments
If haven't done something like *?(-?\d+).* (from #Voo) because I wasn't sure if you wanted to match the entire string, or just the digits. Both versions should tell you if there are digits in the string, and if you want the actual digits, use the first regex and look for group[0]. There are clever ways to name groups or multiple captures, but that would be a complicated answer to a straight forward question...

Categories

Resources