Grouping regular expression - java

Here is my questions:
I have a very long string with so many values bounded by the different tags. Those values including chinese, english wording and digits.
I wanna to separate by specify pattern. The following is an example:
(I want to find a pattern xxxxxx where xxxx is chinese, english, digits or any notation but not include "<" or ">" as those two symbol is for identify the tags)
However, I found some strange for these pattern. The Pattern seems didn't recgonize the first two tag() but the second one
String a = "<f\"number\">4 <f\"number\"><f$n0>14 <h85><f$n0>4 <f$n0>2 <f$n0>2 7 -<f\"Times-Roman\">7<f\"number\">";
Pattern p = Pattern.compile("<f\"number\">[\\P{sc=Han}*\\p{sc=Han}*[a-z]*[A-Z]*[0-9]*^<>]*<f\"number\">");
Matcher m = p.matcher(a);
while(m.find()){
System.out.println(m.group());
}
The output is as same as my String a

The character class [\\P{sc=Han}*\\p{sc=Han}*[a-z]*[A-Z]*[0-9]*^<>]* matches 0 or more any character because \\P{sc=Han} and \\p{sc=Han} are opposite.
I guess you want:
Pattern p = Pattern.compile("<f\"number\">[\\P{sc=Han}a-zA-Z0-9]*<f\"number\">");
You may want to add spaces:
Pattern p = Pattern.compile("<f\"number\">[\\P{sc=Han}a-zA-Z0-9\s]*<f\"number\">");
or:
Pattern p = Pattern.compile("<f\"number\">[^<]*<f\"number\">");

Related

Repeating capture group in a regular expression

What would be the best way to parse the following string in Java using a single regex?
String:
someprefix foo=someval baz=anotherval baz=somethingelse
I need to extract someprefix, someval, anotherval and somethingelse. The string always contains a prefix value (someprefix in the example) and can have from 0 to 4 key-value pairs (foo=someval baz=anotherval baz=somethingelse in the example)
You can use this regex for capturing your intended text,
(?<==|^)\w+
Which captures a word that is preceded by either an = character or is at ^ start of string.
Sample java code for same,
Pattern p = Pattern.compile("(?<==|^)\\w+");
String s = "someprefix foo=someval baz=anotherval baz=somethingelse";
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println(m.group());
}
Prints,
someprefix
someval
anotherval
somethingelse
Live Demo

How to use regular expressions in this case?

the String is:"LinksImagesListCodeHt1233ddmlImagesConsider112dd2Download",I want to get "ImagesConsider112dd2Download". so I used this expression "Images.*?Download".but it matches "ImagesListCodeHt1233ddmlImagesConsider112dd2Download".what's the correct expression should be?
Temporarily,there is a ugly way to solve this problem:
Pattern p = Pattern.compile(StringUtils.reverse("Download")+ ".*?" + StringUtils.reverse("Images") );
String s = "LinksImagesListCodeHt1233ddmlImagesConsider112dd2Download";
s = StringUtils.reverse(s);
Matcher m = p.matcher(s);
while (m.find()){
m.end();
System.out.println(StringUtils.reverse(m.group()));
}
To match the text between Images to Download which does not contain the word Images inside you can use negative lookaround like this
Images((?!Images).)*Download
Explanation
Images -- Match literal string Images
(?!Images). -- Match a character that does not follow Images word
((?!Images).)* -- Match zero or more times

Regular expression not filtering out digits

Im using this line of code to extract all text between the two strings "Origin" and "//". I'm trying to exclude all digits but this doesn't work, It grabs everything including the digits. is my regex incorrect?
Pattern p = Pattern.compile(Pattern.quote("ORIGIN") + "(.*?[^0-9])" + Pattern.quote("//"), Pattern.DOTALL);
First of all: you have no need to Pattern.quote() either of ORIGIN or //; what is more, the text in your question suggests Origin, not ORIGIN, so I'll go with that instead.
Try this regex:
private static final Pattern PATTERN
= Pattern.compile("Origin([^0-9/]+)//");
Note: it disallows any slash between Origin and // as well, which may, or may not, be what you want; but since there are no examples in your question, this is as good a solution as I can muster.
What you want is not clear.
1) If you want to get only the text (without any number) even if there is number in:
Pattern p = Pattern.compile("ORIGIN(.*)//");
Matcher m = p.matcher(str);
if(m.find())
System.out.println(m.group(1).replaceAll("\\d+", ""));
2) If you want to get text without number :
Pattern p = Pattern.compile("ORIGIN([^0-9]+)//");
Matcher m = p.matcher(str);
if(m.find())
ystem.out.println(m.group(1));
3) Something else ???????
E.g :
String : ORIGINbla54bla//
1) String : blabla
2) No result (Pattern does not match)

Regular expression matching "dictionary words"

I'm a Java user but I'm new to regular expressions.
I just want to have a tiny expression that, given a word (we assume that the string is only one word), answers with a boolean, telling if the word is valid or not.
An example... I want to catch all words that is plausible to be in a dictionary... So, i just want words with chars from a-z A-Z, an hyphen (for example: man-in-the-middle) and an apostrophe (like I'll or Tiffany's).
Valid words:
"food"
"RocKet"
"man-in-the-middle"
"kahsdkjhsakdhakjsd"
"JESUS", etc.
Non-valid words:
"gipsy76"
"www.google.com"
"me#gmail.com"
"745474"
"+-x/", etc.
I use this code, but it won't gave the correct answer:
Pattern p = Pattern.compile("[A-Za-z&-&']");
Matcher m = p.matcher(s);
System.out.println(m.matches());
What's wrong with my regex?
Add a + after the expression to say "one or more of those characters":
Escape the hyphen with \ (or put it last).
Remove those & characters:
Here's the code:
Pattern p = Pattern.compile("[A-Za-z'-]+");
Matcher m = p.matcher(s);
System.out.println(m.matches());
Complete test:
String[] ok = {"food","RocKet","man-in-the-middle","kahsdkjhsakdhakjsd","JESUS"};
String[] notOk = {"gipsy76", "www.google.com", "me#gmail.com", "745474","+-x/" };
Pattern p = Pattern.compile("[A-Za-z'-]+");
for (String shouldMatch : ok)
if (!p.matcher(shouldMatch).matches())
System.out.println("Error on: " + shouldMatch);
for (String shouldNotMatch : notOk)
if (p.matcher(shouldNotMatch).matches())
System.out.println("Error on: " + shouldNotMatch);
(Produces no output.)
This should work:
"[A-Za-z'-]+"
But "-word" and "word-" are not valid. So you can uses this pattern:
WORD_EXP = "^[A-Za-z]+(-[A-Za-z]+)*$"
Regex - /^([a-zA-Z]*('|-)?[a-zA-Z]+)*/
You can use above regex if you don't want successive "'" or "-".
It will give you accurate matching your text.
It accepts
man-in-the-middle
asd'asdasd'asd
It rejects following string
man--in--midle
asdasd''asd
Hi Aloob please check with this, Bit lengthy, might be having shorter version of this, Still...
[A-z]*||[[A-z]*[-]*]*||[[A-z]*[-]*[']*]*

Author and time matching regex

I would to use a regex in my Java program to recognize some feature of my strings.
I've this type of string:
`-Author- has wrote (-hh-:-mm-)
So, for example, I've a string with:
Cecco has wrote (15:12)
and i've to extract author, hh and mm fields. Obviously I've some restriction to consider:
hh and mm must be numbers
author hasn't any restrictions
I've to consider space between "has wrote" and (
How can I can use regex?
EDIT: I attach my snippet:
String mRegex = "(\\s)+ has wrote \\((\\d\\d):(\\d\\d)\\)";
Pattern mPattern = Pattern.compile(mRegex);
String[] str = {
"Cecco CQ has wrote (14:55)", //OK (matched)
"yesterday you has wrote that I'm crazy", //NO (different text)
"Simon has wrote (yesterday)", // NO (yesterday isn't numbers)
"John has wrote (22:32)", //OK
"James has wrote(22:11)", //NO (missed space between has wrote and ()
"Tommy has wrote (xx:ss)" //NO (xx and ss aren't numbers)
};
for(String s : str) {
Matcher mMatcher = mPattern.matcher(s);
while (mMatcher.find()) {
System.out.println(mMatcher.group());
}
}
homework?
Something like:
(.+) has wrote \((\d\d):(\d\d)\)
Should do the trick
() - mark groups to capture (there are three in the above)
.+ - any chars (you said no restrictions)
\d - any digit
\(\) escape the parens as literals instead of a capturing group
use:
Pattern p = Pattern.compile("(.+) has wrote \\((\\d\\d):(\\d\\d)\\)");
Matcher m = p.matcher("Gareth has wrote (12:00)");
if( m.matches()){
System.out.println(m.group(1));
System.out.println(m.group(2));
System.out.println(m.group(3));
}
To cope with an optional (HH:mm) at the end you need to start to use some dark regex voodoo:
Pattern p = Pattern.compile("(.+) has wrote\\s?(?:\\((\\d\\d):(\\d\\d)\\))?");
Matcher m = p.matcher("Gareth has wrote (12:00)");
if( m.matches()){
System.out.println(m.group(1));
System.out.println(m.group(2));
System.out.println(m.group(3));
}
m = p.matcher("Gareth has wrote");
if( m.matches()){
System.out.println(m.group(1));
// m.group(2) == null since it didn't match anything
}
The new unescaped pattern:
(.+) has wrote\s?(?:\((\d\d):(\d\d)\))?
\s? optionally match a space (there might not be a space at the end if there isn't a (HH:mm) group
(?: ... ) is a none capturing group, i.e. allows use to put ? after it to make is optional
I think #codinghorror has something to say about regex
The easiest way to figure out regular expressions is to use a testing tool before coding.
I use an eclipse plugin from http://www.brosinski.com/regex/
Using this I came up with the following result:
([a-zA-Z]*) has wrote \((\d\d):(\d\d)\)
Cecco has wrote (15:12)
Found 1 match(es):
start=0, end=23
Group(0) = Cecco has wrote (15:12)
Group(1) = Cecco
Group(2) = 15
Group(3) = 12
An excellent turorial on regular expression syntax can be found at http://www.regular-expressions.info/tutorial.html
Well, just in case you didn't know, Matcher has a nice function that can draw out specific groups, or parts of the pattern enclosed by (), Matcher.group(int). Like if I wanted to match for a number between two semicolons like:
:22:
I could use the regex ":(\\d+):" to match one or more digits between two semicolons, and then I can fetch specifically the digits with:
Matcher.group(1)
And then its just a matter of parsing the String into an int. As a note, group numbering starts at 1. group(0) is the whole match, so Matcher.group(0) for the previous example would return :22:
For your case, I think the regex bits you need to consider are
"[A-Za-z]" for alphabet characters (you could probably also safely use "\\w", which matchers alphabet characters, as well as numbers and _).
"\\d" for digits (1,2,3...)
"+" for indicating you want one or more of the previous character or group.

Categories

Resources