Regular expression matching "dictionary words" - java

I'm a Java user but I'm new to regular expressions.
I just want to have a tiny expression that, given a word (we assume that the string is only one word), answers with a boolean, telling if the word is valid or not.
An example... I want to catch all words that is plausible to be in a dictionary... So, i just want words with chars from a-z A-Z, an hyphen (for example: man-in-the-middle) and an apostrophe (like I'll or Tiffany's).
Valid words:
"food"
"RocKet"
"man-in-the-middle"
"kahsdkjhsakdhakjsd"
"JESUS", etc.
Non-valid words:
"gipsy76"
"www.google.com"
"me#gmail.com"
"745474"
"+-x/", etc.
I use this code, but it won't gave the correct answer:
Pattern p = Pattern.compile("[A-Za-z&-&']");
Matcher m = p.matcher(s);
System.out.println(m.matches());
What's wrong with my regex?

Add a + after the expression to say "one or more of those characters":
Escape the hyphen with \ (or put it last).
Remove those & characters:
Here's the code:
Pattern p = Pattern.compile("[A-Za-z'-]+");
Matcher m = p.matcher(s);
System.out.println(m.matches());
Complete test:
String[] ok = {"food","RocKet","man-in-the-middle","kahsdkjhsakdhakjsd","JESUS"};
String[] notOk = {"gipsy76", "www.google.com", "me#gmail.com", "745474","+-x/" };
Pattern p = Pattern.compile("[A-Za-z'-]+");
for (String shouldMatch : ok)
if (!p.matcher(shouldMatch).matches())
System.out.println("Error on: " + shouldMatch);
for (String shouldNotMatch : notOk)
if (p.matcher(shouldNotMatch).matches())
System.out.println("Error on: " + shouldNotMatch);
(Produces no output.)

This should work:
"[A-Za-z'-]+"

But "-word" and "word-" are not valid. So you can uses this pattern:
WORD_EXP = "^[A-Za-z]+(-[A-Za-z]+)*$"

Regex - /^([a-zA-Z]*('|-)?[a-zA-Z]+)*/
You can use above regex if you don't want successive "'" or "-".
It will give you accurate matching your text.
It accepts
man-in-the-middle
asd'asdasd'asd
It rejects following string
man--in--midle
asdasd''asd

Hi Aloob please check with this, Bit lengthy, might be having shorter version of this, Still...
[A-z]*||[[A-z]*[-]*]*||[[A-z]*[-]*[']*]*

Related

How to parse string using regex

I'm pretty new to java, trying to find a way to do this better. Potentially using a regex.
String text = test.get(i).toString()
// text looks like this in string form:
// EnumOption[enumId=test,id=machine]
String checker = text.replace("[","").replace("]","").split(",")[1].split("=")[1];
// checker becomes machine
My goal is to parse that text string and just return back machine. Which is what I did in the code above.
But that looks ugly. I was wondering what kinda regex can be used here to make this a little better? Or maybe another suggestion?
Use a regex' lookbehind:
(?<=\bid=)[^],]*
See Regex101.
(?<= ) // Start matching only after what matches inside
\bid= // Match "\bid=" (= word boundary then "id="),
[^],]* // Match and keep the longest sequence without any ']' or ','
In Java, use it like this:
import java.util.regex.*;
class Main {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("(?<=\\bid=)[^],]*");
Matcher matcher = pattern.matcher("EnumOption[enumId=test,id=machine]");
if (matcher.find()) {
System.out.println(matcher.group(0));
}
}
}
This results in
machine
Assuming you’re using the Polarion ALM API, you should use the EnumOption’s getId method instead of deparsing and re-parsing the value via a string:
String id = test.get(i).getId();
Using the replace and split functions don't take the structure of the data into account.
If you want to use a regex, you can just use a capturing group without any lookarounds, where enum can be any value except a ] and comma, and id can be any value except ].
The value of id will be in capture group 1.
\bEnumOption\[enumId=[^=,\]]+,id=([^\]]+)\]
Explanation
\bEnumOption Match EnumOption preceded by a word boundary
\[enumId= Match [enumId=
[^=,\]]+, Match 1+ times any char except = , and ]
id= Match literally
( Capture group 1
[^\]]+ Match 1+ times any char except ]
)\]
Regex demo | Java demo
Pattern pattern = Pattern.compile("\\bEnumOption\\[enumId=[^=,\\]]+,id=([^\\]]+)\\]");
Matcher matcher = pattern.matcher("EnumOption[enumId=test,id=machine]");
if (matcher.find()) {
System.out.println(matcher.group(1));
}
Output
machine
If there can be more comma separated values, you could also only match id making use of negated character classes [^][]* before and after matching id to stay inside the square bracket boundaries.
\bEnumOption\[[^][]*\bid=([^,\]]+)[^][]*\]
In Java
String regex = "\\bEnumOption\\[[^][]*\\bid=([^,\\]]+)[^][]*\\]";
Regex demo
A regex can of course be used, but sometimes is less performant, less readable and more bug-prone.
I would advise you not use any regex that you did not come up with yourself, or at least understand completely.
PS: I think your solution is actually quite readable.
Here's another non-regex version:
String text = "EnumOption[enumId=test,id=machine]";
text = text.substring(text.lastIndexOf('=') + 1);
text = text.substring(0, text.length() - 1);
Not doing you a favor, but the downvote hurt, so here you go:
String input = "EnumOption[enumId=test,id=machine]";
Matcher matcher = Pattern.compile("EnumOption\\[enumId=(.+),id=(.+)\\]").matcher(input);
if(!matcher.matches()) {
throw new RuntimeException("unexpected input: " + input);
}
System.out.println("enumId: " + matcher.group(1));
System.out.println("id: " + matcher.group(2));

regex to match and retrieve some tokens

The strings I am interested in look like something like the followings
a1.foo, a2.bar, a3.whatever
Now I need to retrieve the number.
So I wrote this piece of code (in Java), thinking it would work, but it does not.
Could anyone please let me know what is wrong with my pattern?
final String testInput = "a2.foo";
Pattern p = Pattern.compile("a(\\d*)\\.([^\\w])");
Matcher matcher = p.matcher(testInput);
if (matcher.find())
{
System.out.println("n = " + matcher.group(1));
}
else
{
System.out.println("NOT MATCHED");
}
This prints NOT MATCHED, while I expected it to print 2
Your regex is wrong as ([^\\w]) will match only one non-word character. You probably wanted more than 1 word character hence (\\w+)
However you can use this lookahead:
Pattern.compile("a(\\d*)(?=\\.)");

How do I find multiple substrings from one string using regex in Java?

I want to find every instance of a number, followed by a comma (no space), followed by any number of characters in a string. I was able to get a regex to find all the instances of what I was looking for, but I want to print them individually rather than all together. I'm new to regex in general, so maybe my pattern is wrong?
This is my code:
String test = "1 2,A 3,B 4,23";
Pattern p = Pattern.compile("\\d+,.+");
Matcher m = p.matcher(test);
while(m.find()) {
System.out.println("found: " + m.group());
}
This is what it prints:
found: 2,A 3,B 4,23
This is what I want it to print:
found: 2,A
found: 3,B
found: 4,23
Thanks in advance!
try this regex
Pattern p = Pattern.compile("\\d+,.+?(?= |$)");
You could take an easier route and split by space, then ignore anything without a comma:
String values = test.split(' ');
for (String value : values) {
if (value.contains(",") {
System.out.println("found: " + value);
}
}
What you apparently left out of your requirements statement is where "any number of characters" is supposed to end. As it stands, it ends at the end of the string; from your sample output, it seems you want it to end at the first space.
Try this pattern: "\\d+,[^\\s]*"

Regular expression to match unescaped special characters only

I'm trying to come up with a regular expression that can match only characters not preceded by a special escape sequence in a string.
For instance, in the string Is ? stranded//? , I want to be able to replace the ? which hasn't been escaped with another string, so I can have this result : **Is Dave stranded?**
But for the life of me I have not been able to figure out a way. I have only come up with regular expressions that eat all the replaceable characters.
How do you construct a regular expression that matches only characters not preceded by an escape sequence?
Use a negative lookbehind, it's what they were designed to do!
(?<!//)[?]
To break it down:
(
?<! #The negative look behind. It will check that the following slashes do not exist.
// #The slashes you are trying to avoid.
)
[\?] #Your special charactor list.
Only if the // cannot be found, it will progress with the rest of the search.
I think in Java it will need to be escaped again as a string something like:
Pattern p = Pattern.compile("(?<!//)[\\?]");
Try this Java code:
str="Is ? stranded//?";
Pattern p = Pattern.compile("(?<!//)([?])");
m = p.matcher(str);
StringBuffer sb = new StringBuffer();
while (m.find()) {
m.appendReplacement(sb, m.group(1).replace("?", "Dave"));
}
m.appendTail(sb);
String s = sb.toString().replace("//", "");
System.out.println("Output: " + s);
OUTPUT
Output: Is Dave stranded?
I was thinking about this and have a second simplier solution, avoiding regexs. The other answers are probably better but I thought I might post it anyway.
String input = "Is ? stranded//?";
String output = input
.replace("//?", "a717efbc-84a9-46bf-b1be-8a9fb714fce8")
.replace("?", "Dave")
.replace("a717efbc-84a9-46bf-b1be-8a9fb714fce8", "?");
Just protect the "//?" by replacing it with something unique (like a guid). Then you know any remaining question marks are fair game.
Use grouping. Here's one example:
import java.util.regex.*;
class Test {
public static void main(String[] args) {
Pattern p = Pattern.compile("([^/][^/])(\\?)");
String s = "Is ? stranded//?";
Matcher m = p.matcher(s);
if (m.matches)
s = m.replaceAll("$1XXX").replace("//", "");
System.out.println(s + " -> " + s);
}
}
Output:
$ java Test
Is ? stranded//? -> Is XXX stranded?
In this example, I'm:
first replacing any non-escaped ? with "XXX",
then, removing the "//" escape sequences.
EDIT Use if (m.matches) to ensure that you handle non-matching strings properly.
This is just a quick-and-dirty example. You need to flesh it out, obviously, to make it more robust. But it gets the general idea across.
Match on a set of characters OTHER than an escape sequence, then a regex special character. You could use an inverted character class ([^/]) for the first bit. Special case an unescaped regex character at the front of the string.
String aString = "Is ? stranded//?";
String regex = "(?<!//)[^a-z^A-Z^\\s^/]";
System.out.println(aString.replaceAll(regex, "Dave"));
The part of the regular expression [^a-z^A-Z^\\s^/] matches non-alphanumeric, whitespace or non-forward slash charaters.
The (?<!//) part does a negative lookbehind - see docco here for more info
This gives the output Is Dave stranded//?
try matching:
(^|(^.)|(.[^/])|([^/].))[special characters list]
I used this one:
((?:^|[^\\])(?:\\\\)*[ESCAPABLE CHARACTERS HERE])
Demo: https://regex101.com/r/zH1zO3/4

Author and time matching regex

I would to use a regex in my Java program to recognize some feature of my strings.
I've this type of string:
`-Author- has wrote (-hh-:-mm-)
So, for example, I've a string with:
Cecco has wrote (15:12)
and i've to extract author, hh and mm fields. Obviously I've some restriction to consider:
hh and mm must be numbers
author hasn't any restrictions
I've to consider space between "has wrote" and (
How can I can use regex?
EDIT: I attach my snippet:
String mRegex = "(\\s)+ has wrote \\((\\d\\d):(\\d\\d)\\)";
Pattern mPattern = Pattern.compile(mRegex);
String[] str = {
"Cecco CQ has wrote (14:55)", //OK (matched)
"yesterday you has wrote that I'm crazy", //NO (different text)
"Simon has wrote (yesterday)", // NO (yesterday isn't numbers)
"John has wrote (22:32)", //OK
"James has wrote(22:11)", //NO (missed space between has wrote and ()
"Tommy has wrote (xx:ss)" //NO (xx and ss aren't numbers)
};
for(String s : str) {
Matcher mMatcher = mPattern.matcher(s);
while (mMatcher.find()) {
System.out.println(mMatcher.group());
}
}
homework?
Something like:
(.+) has wrote \((\d\d):(\d\d)\)
Should do the trick
() - mark groups to capture (there are three in the above)
.+ - any chars (you said no restrictions)
\d - any digit
\(\) escape the parens as literals instead of a capturing group
use:
Pattern p = Pattern.compile("(.+) has wrote \\((\\d\\d):(\\d\\d)\\)");
Matcher m = p.matcher("Gareth has wrote (12:00)");
if( m.matches()){
System.out.println(m.group(1));
System.out.println(m.group(2));
System.out.println(m.group(3));
}
To cope with an optional (HH:mm) at the end you need to start to use some dark regex voodoo:
Pattern p = Pattern.compile("(.+) has wrote\\s?(?:\\((\\d\\d):(\\d\\d)\\))?");
Matcher m = p.matcher("Gareth has wrote (12:00)");
if( m.matches()){
System.out.println(m.group(1));
System.out.println(m.group(2));
System.out.println(m.group(3));
}
m = p.matcher("Gareth has wrote");
if( m.matches()){
System.out.println(m.group(1));
// m.group(2) == null since it didn't match anything
}
The new unescaped pattern:
(.+) has wrote\s?(?:\((\d\d):(\d\d)\))?
\s? optionally match a space (there might not be a space at the end if there isn't a (HH:mm) group
(?: ... ) is a none capturing group, i.e. allows use to put ? after it to make is optional
I think #codinghorror has something to say about regex
The easiest way to figure out regular expressions is to use a testing tool before coding.
I use an eclipse plugin from http://www.brosinski.com/regex/
Using this I came up with the following result:
([a-zA-Z]*) has wrote \((\d\d):(\d\d)\)
Cecco has wrote (15:12)
Found 1 match(es):
start=0, end=23
Group(0) = Cecco has wrote (15:12)
Group(1) = Cecco
Group(2) = 15
Group(3) = 12
An excellent turorial on regular expression syntax can be found at http://www.regular-expressions.info/tutorial.html
Well, just in case you didn't know, Matcher has a nice function that can draw out specific groups, or parts of the pattern enclosed by (), Matcher.group(int). Like if I wanted to match for a number between two semicolons like:
:22:
I could use the regex ":(\\d+):" to match one or more digits between two semicolons, and then I can fetch specifically the digits with:
Matcher.group(1)
And then its just a matter of parsing the String into an int. As a note, group numbering starts at 1. group(0) is the whole match, so Matcher.group(0) for the previous example would return :22:
For your case, I think the regex bits you need to consider are
"[A-Za-z]" for alphabet characters (you could probably also safely use "\\w", which matchers alphabet characters, as well as numbers and _).
"\\d" for digits (1,2,3...)
"+" for indicating you want one or more of the previous character or group.

Categories

Resources