regex to match and retrieve some tokens

regex to match and retrieve some tokens - java

The strings I am interested in look like something like the followings
a1.foo, a2.bar, a3.whatever
Now I need to retrieve the number.
So I wrote this piece of code (in Java), thinking it would work, but it does not.
Could anyone please let me know what is wrong with my pattern?
final String testInput = "a2.foo";
Pattern p = Pattern.compile("a(\\d*)\\.([^\\w])");
Matcher matcher = p.matcher(testInput);
if (matcher.find())
{
System.out.println("n = " + matcher.group(1));
}
else
{
System.out.println("NOT MATCHED");
}
This prints NOT MATCHED, while I expected it to print 2

Your regex is wrong as ([^\\w]) will match only one non-word character. You probably wanted more than 1 word character hence (\\w+)
However you can use this lookahead:
Pattern.compile("a(\\d*)(?=\\.)");

Related

How to parse string using regex

I'm pretty new to java, trying to find a way to do this better. Potentially using a regex.
String text = test.get(i).toString()
// text looks like this in string form:
// EnumOption[enumId=test,id=machine]
String checker = text.replace("[","").replace("]","").split(",")[1].split("=")[1];
// checker becomes machine
My goal is to parse that text string and just return back machine. Which is what I did in the code above.
But that looks ugly. I was wondering what kinda regex can be used here to make this a little better? Or maybe another suggestion?

Use a regex' lookbehind:
(?<=\bid=)[^],]*
See Regex101.
(?<= ) // Start matching only after what matches inside
\bid= // Match "\bid=" (= word boundary then "id="),
[^],]* // Match and keep the longest sequence without any ']' or ','
In Java, use it like this:
import java.util.regex.*;
class Main {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("(?<=\\bid=)[^],]*");
Matcher matcher = pattern.matcher("EnumOption[enumId=test,id=machine]");
if (matcher.find()) {
System.out.println(matcher.group(0));
}
}
}
This results in
machine

Assuming you’re using the Polarion ALM API, you should use the EnumOption’s getId method instead of deparsing and re-parsing the value via a string:
String id = test.get(i).getId();

Using the replace and split functions don't take the structure of the data into account.
If you want to use a regex, you can just use a capturing group without any lookarounds, where enum can be any value except a ] and comma, and id can be any value except ].
The value of id will be in capture group 1.
\bEnumOption\[enumId=[^=,\]]+,id=([^\]]+)\]
Explanation
\bEnumOption Match EnumOption preceded by a word boundary
\[enumId= Match [enumId=
[^=,\]]+, Match 1+ times any char except = , and ]
id= Match literally
( Capture group 1
[^\]]+ Match 1+ times any char except ]
)\]
Regex demo | Java demo
Pattern pattern = Pattern.compile("\\bEnumOption\\[enumId=[^=,\\]]+,id=([^\\]]+)\\]");
Matcher matcher = pattern.matcher("EnumOption[enumId=test,id=machine]");
if (matcher.find()) {
System.out.println(matcher.group(1));
}
Output
machine
If there can be more comma separated values, you could also only match id making use of negated character classes [^][]* before and after matching id to stay inside the square bracket boundaries.
\bEnumOption\[[^][]*\bid=([^,\]]+)[^][]*\]
In Java
String regex = "\\bEnumOption\\[[^][]*\\bid=([^,\\]]+)[^][]*\\]";
Regex demo

A regex can of course be used, but sometimes is less performant, less readable and more bug-prone.
I would advise you not use any regex that you did not come up with yourself, or at least understand completely.
PS: I think your solution is actually quite readable.
Here's another non-regex version:
String text = "EnumOption[enumId=test,id=machine]";
text = text.substring(text.lastIndexOf('=') + 1);
text = text.substring(0, text.length() - 1);
Not doing you a favor, but the downvote hurt, so here you go:
String input = "EnumOption[enumId=test,id=machine]";
Matcher matcher = Pattern.compile("EnumOption\\[enumId=(.+),id=(.+)\\]").matcher(input);
if(!matcher.matches()) {
throw new RuntimeException("unexpected input: " + input);
}
System.out.println("enumId: " + matcher.group(1));
System.out.println("id: " + matcher.group(2));

Regex for extracting a string between a word and new line character in java [duplicate]

I'm new to using Regex, I've been going through a rake of tutorials but I haven't found one that applies to what I want to do,
I want to search for something, but return everything following it but not the search string itself
e.g. "Some lame sentence that is awesome"
search for "sentence"
return "that is awesome"
Any help would be much appreciated
This is my regex so far
sentence(.*)
but it returns: sentence that is awesome
Pattern pattern = Pattern.compile("sentence(.*)");
Matcher matcher = pattern.matcher("some lame sentence that is awesome");
boolean found = false;
while (matcher.find())
{
System.out.println("I found the text: " + matcher.group().toString());
found = true;
}
if (!found)
{
System.out.println("I didn't find the text");
}

You can do this with "just the regular expression" as you asked for in a comment:
(?<=sentence).*
(?<=sentence) is a positive lookbehind assertion. This matches at a certain position in the string, namely at a position right after the text sentence without making that text itself part of the match. Consequently, (?<=sentence).* will match any text after sentence.
This is quite a nice feature of regex. However, in Java this will only work for finite-length subexpressions, i. e. (?<=sentence|word|(foo){1,4}) is legal, but (?<=sentence\s*) isn't.

Your regex "sentence(.*)" is right. To retrieve the contents of the group in parenthesis, you would call:
Pattern p = Pattern.compile( "sentence(.*)" );
Matcher m = p.matcher( "some lame sentence that is awesome" );
if ( m.find() ) {
String s = m.group(1); // " that is awesome"
}
Note the use of m.find() in this case (attempts to find anywhere on the string) and not m.matches() (would fail because of the prefix "some lame"; in this case the regex would need to be ".*sentence(.*)")

if Matcher is initialized with str, after the match, you can get the part after the match with
str.substring(matcher.end())
Sample Code:
final String str = "Some lame sentence that is awesome";
final Matcher matcher = Pattern.compile("sentence").matcher(str);
if(matcher.find()){
System.out.println(str.substring(matcher.end()).trim());
}
Output:
that is awesome

You need to use the group(int) of your matcher - group(0) is the entire match, and group(1) is the first group you marked. In the example you specify, group(1) is what comes after "sentence".

You just need to put "group(1)" instead of "group()" in the following line and the return will be the one you expected:
System.out.println("I found the text: " + matcher.group(**1**).toString());

Java extract only first letters/characters from String

Hello guys I want to extract only first letters from this String:
String str = "使 徒 行 傳 16:31 ERV-ZH";
I only want to get these characters:
使 徒 行 傳
and not include
ERV-ZH
Only the letters or characters before the numbers plus the colon.
Note that Chinese letters can also be English and other letters.
this is what I've tried:
str.split(" ")[0];
But I'm only getting the first letter. Do you have an idea how to achieve my requirement? Any help will be appreciated. Thanks.
NOTE:
Also, strings are dynamic so I only presented sample characters.

This should give you the desired output
String str = "使 徒 行 傳 16:31 ERV-ZH";
String[] test = str.split("\\d\\d:\\d\\d");
for (String s : test) {
System.out.println(s);
}
The first element will be the part before the time and so on
Edit: if you are in need to be more dynamic for times like 6:31 or 16:6 then you could use this regex "\\d{1,2}:\\d{1,2}"

You can use the following regex ^([\\D\\s]+), this is what you need:
String str = "使 徒 行 傳 16:31 ERV-ZH";
String pattern = "^([\\D\\s]+)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(str);
if (m.find( )) {
System.out.println("Found value: " + m.group(0) );
} else {
System.out.println("NO MATCH");
}
}
This is a live DEMO here.
In the following regex ^([\\D\\s]+):
^ will match only in the begginnig.
\\D will avoid matching any number.
Note that this will be the case for any string.

If you don't always have a date pattern that can be used as a delimiter in the middle, and are looking for a more generic solution, you could go with this: str.replaceAll("[^\\p{L}\\s]+.*", "")

How do I find multiple substrings from one string using regex in Java?

I want to find every instance of a number, followed by a comma (no space), followed by any number of characters in a string. I was able to get a regex to find all the instances of what I was looking for, but I want to print them individually rather than all together. I'm new to regex in general, so maybe my pattern is wrong?
This is my code:
String test = "1 2,A 3,B 4,23";
Pattern p = Pattern.compile("\\d+,.+");
Matcher m = p.matcher(test);
while(m.find()) {
System.out.println("found: " + m.group());
}
This is what it prints:
found: 2,A 3,B 4,23
This is what I want it to print:
found: 2,A
found: 3,B
found: 4,23
Thanks in advance!

try this regex
Pattern p = Pattern.compile("\\d+,.+?(?= |$)");

You could take an easier route and split by space, then ignore anything without a comma:
String values = test.split(' ');
for (String value : values) {
if (value.contains(",") {
System.out.println("found: " + value);
}
}

What you apparently left out of your requirements statement is where "any number of characters" is supposed to end. As it stands, it ends at the end of the string; from your sample output, it seems you want it to end at the first space.
Try this pattern: "\\d+,[^\\s]*"

Regular expression matching "dictionary words"

I'm a Java user but I'm new to regular expressions.
I just want to have a tiny expression that, given a word (we assume that the string is only one word), answers with a boolean, telling if the word is valid or not.
An example... I want to catch all words that is plausible to be in a dictionary... So, i just want words with chars from a-z A-Z, an hyphen (for example: man-in-the-middle) and an apostrophe (like I'll or Tiffany's).
Valid words:
"food"
"RocKet"
"man-in-the-middle"
"kahsdkjhsakdhakjsd"
"JESUS", etc.
Non-valid words:
"gipsy76"
"www.google.com"
"me#gmail.com"
"745474"
"+-x/", etc.
I use this code, but it won't gave the correct answer:
Pattern p = Pattern.compile("[A-Za-z&-&']");
Matcher m = p.matcher(s);
System.out.println(m.matches());
What's wrong with my regex?

Add a + after the expression to say "one or more of those characters":
Escape the hyphen with \ (or put it last).
Remove those & characters:
Here's the code:
Pattern p = Pattern.compile("[A-Za-z'-]+");
Matcher m = p.matcher(s);
System.out.println(m.matches());
Complete test:
String[] ok = {"food","RocKet","man-in-the-middle","kahsdkjhsakdhakjsd","JESUS"};
String[] notOk = {"gipsy76", "www.google.com", "me#gmail.com", "745474","+-x/" };
Pattern p = Pattern.compile("[A-Za-z'-]+");
for (String shouldMatch : ok)
if (!p.matcher(shouldMatch).matches())
System.out.println("Error on: " + shouldMatch);
for (String shouldNotMatch : notOk)
if (p.matcher(shouldNotMatch).matches())
System.out.println("Error on: " + shouldNotMatch);
(Produces no output.)

This should work:
"[A-Za-z'-]+"

But "-word" and "word-" are not valid. So you can uses this pattern:
WORD_EXP = "^[A-Za-z]+(-[A-Za-z]+)*$"

Regex - /^([a-zA-Z]*('|-)?[a-zA-Z]+)*/
You can use above regex if you don't want successive "'" or "-".
It will give you accurate matching your text.
It accepts
man-in-the-middle
asd'asdasd'asd
It rejects following string
man--in--midle
asdasd''asd

Hi Aloob please check with this, Bit lengthy, might be having shorter version of this, Still...
[A-z]*||[[A-z]*[-]*]*||[[A-z]*[-]*[']*]*

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

regex to match and retrieve some tokens - java

Your regex is wrong as ([^\\w]) will match only one non-word character. You probably wanted more than 1 word character hence (\\w+) However you can use this lookahead: Pattern.compile("a(\\d*)(?=\\.)");

Related

How to parse string using regex

Regex for extracting a string between a word and new line character in java [duplicate]

Java extract only first letters/characters from String

How do I find multiple substrings from one string using regex in Java?

Regular expression matching "dictionary words"

Categories

Resources