How to exclude underscore from regex group java

How to exclude underscore from regex group java - java

I am using the name of excel files that can be in this format
table_A_Apr_2000.xlsx and I want an expression that would give me groups as string object below form
[table_A, Apr, 2000, .xlsx]
when I am using this expression in my code
String table="table_A";
String[] slist = {"table_A_Apr_2001.xlsx"};
Pattern p = Pattern.compile("^"+table+"|\\d+|\\D+|[^_]*");
for(int i=0; i<slist.length;i++){
Matcher m = p.matcher(slist[i]);
List<String> a = new ArrayList<String>();
while(m.find()){
a.add((m.group()));
}
System.out.println(a);
System.out.println("~~~~~");
}
it gives following output
[table_A, _Apr_, 2001, .xlsx, ]
but I want it to be like
[table_A, Apr, 2000, .xlsx]
Any suggestions will be much appreciated, especially in the pattern expression part

\\D represents every non-digit which includes _. To eliminate it create intersection of \\D and [^_] sets with && operator. Try using [\\D&&[^_]]+ instead of \\D+|[^_]*
OR since \D is negation of \d, we can use De Morgan's law that ~p AND ~q is same as ~(p OR q) and rewrite it as [^\\d_]+.

You could use a formal regex matcher, but one option which might be workable here would be to do an intelligent split of the filename:
String filename = "table_A_Apr_2001.xlsx";
filename = filename.substring(0, filename.indexOf('.'));
String[] parts = filename.split("_(?=[^_]{3,})");
System.out.println("table: " + parts[0]);
System.out.println("month: " + parts[1]);
System.out.println("year: " + parts[2]);
table: table_A
month: Apr
year: 2001
Demo

Related

Need help in regex matching

It may be very simple, but I am extremely new to regex and have a requirement where I need to do some regex matches in a string and extract the number in it. Below is my code with sample i/p and required o/p. I tried to construct the Pattern by referring to https://www.freeformatter.com/java-regex-tester.html, but my regex match itself is returning false.
Pattern pattern = Pattern.compile(".*/(a-b|c-d|e-f)/([0-9])+(#[0-9]?)");
String str = "foo/bar/Samsung-Galaxy/a-b/1"; // need to extract 1.
String str1 = "foo/bar/Samsung-Galaxy/c-d/1#P2";// need to extract 2.
String str2 = "foo.com/Samsung-Galaxy/9090/c-d/69"; // need to extract 69
System.out.println("result " + pattern.matcher(str).matches());
System.out.println("result " + pattern.matcher(str1).matches());
System.out.println("result " + pattern.matcher(str1).matches());
All of above SOPs are returning false. I am using java 8, is there is any way by which in a single statement I can match the pattern and then extract the digit from the string.
I would be great if somebody can point me on how to debug/develop the regex.Please feel free to let me know if something is not clear in my question.

You may use
Pattern pattern = Pattern.compile(".*/(?:a-b|c-d|e-f)/[^/]*?([0-9]+)");
See the regex demo
When used with matches(), the pattern above does not require explicit anchors, ^ and $.
Details
.* - any 0+ chars other than line break chars, as many as possible
/ - the rightmost / that is followed with the subsequent subpatterns
(?:a-b|c-d|e-f) - a non-capturing group matching any of the alternatives inside: a-b, c-d or e-f
/ - a / char
[^/]*? - any chars other than /, as few as possible
([0-9]+) - Group 1: one or more digits.
Java demo:
List<String> strs = Arrays.asList("foo/bar/Samsung-Galaxy/a-b/1","foo/bar/Samsung-Galaxy/c-d/1#P2","foo.com/Samsung-Galaxy/9090/c-d/69");
Pattern pattern = Pattern.compile(".*/(?:a-b|c-d|e-f)/[^/]*?([0-9]+)");
for (String s : strs) {
Matcher m = pattern.matcher(s);
if (m.matches()) {
System.out.println(s + ": \"" + m.group(1) + "\"");
}
}
A replacing approach using the same regex with anchors added:
List<String> strs = Arrays.asList("foo/bar/Samsung-Galaxy/a-b/1","foo/bar/Samsung-Galaxy/c-d/1#P2","foo.com/Samsung-Galaxy/9090/c-d/69");
String pattern = "^.*/(?:a-b|c-d|e-f)/[^/]*?([0-9]+)$";
for (String s : strs) {
System.out.println(s + ": \"" + s.replaceFirst(pattern, "$1") + "\"");
}
See another Java demo.
Output:
foo/bar/Samsung-Galaxy/a-b/1: "1"
foo/bar/Samsung-Galaxy/c-d/1#P2: "2"
foo.com/Samsung-Galaxy/9090/c-d/69: "69"

Because you match always the last number in your regex, I would Like to just use replaceAll with this regex .*?(\d+)$ :
String regex = ".*?(\\d+)$";
String strResult1 = str.replaceAll(regex, "$1");
System.out.println(!strResult1.isEmpty() ? "result " + strResult1 : "no result");
String strResult2 = str1.replaceAll(regex, "$1");
System.out.println(!strResult2.isEmpty() ? "result " + strResult2 : "no result");
String strResult3 = str2.replaceAll(regex, "$1");
System.out.println(!strResult3.isEmpty() ? "result " + strResult3 : "no result");
If the result is empty then you don't have any number.
Outputs
result 1
result 2
result 69

Here is a one-liner using String#replaceAll:
public String getDigits(String input) {
String number = input.replaceAll(".*/(?:a-b|c-d|e-f)/[^/]*?(\\d+)$", "$1");
return number.matches("\\d+") ? number : "no match";
}
System.out.println(getDigits("foo.com/Samsung-Galaxy/9090/c-d/69"));
System.out.println(getDigits("foo/bar/Samsung-Galaxy/a-b/some other text/1"));
System.out.println(getDigits("foo/bar/Samsung-Galaxy/9090/a-b/69ace"));
69
no match
no match
This works on the sample inputs you provided. Note that I added logic which will display no match for the case where ending digits could not be matched fitting your pattern. In the case of a non-match, we would typically be left with the original input string, which would not be all digits.

Using Regular Expression in Java to extract information from a String

I have one input String like this:
"I am Duc/N Ta/N Van/N"
String "/N" present it is the Name of one person.
The expected output is:
Name: Duc Ta Van
How can I do it by using regular expression?

You can use Pattern and Matcher like this :
String input = "I am Duc/N Ta/N Van/N";
Pattern pattern = Pattern.compile("([^\\s]+)/N");
Matcher matcher = pattern.matcher(input);
String result = "";
while (matcher.find()) {
result+= matcher.group(1) + " ";
}
System.out.println("Name: " + result.trim());
Output
Name: Duc Ta Van
Another Solution using Java 9+
From Java9+ you can use Matcher::results like this :
String input = "I am Duc/N Ta/N Van/N";
String regex = "([^\\s]+)/N";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
String result = matcher.results().map(s -> s.group(1)).collect(Collectors.joining(" "));
System.out.println("Name: " + result); // Name: Duc Ta Van

Here is the regex to use to capture every "name" preceded by a /N
(\w+)\/N
Validate with Regex101
Now, you just need to loop on every match in that String and concatenate the to get the result :
String pattern = "(\\w+)\\/N";
String test = "I am Duc/N Ta/N Van/N";
Matcher m = Pattern.compile(pattern).matcher(test);
StringBuilder sbNames = new StringBuilder();
while(m.find()){
sbNames.append(m.group(1)).append(" ");
}
System.out.println(sbNames.toString());
Duc Ta Van
It is giving you the hardest part. I let you adapt this to match your need.
Note :
In java, it is not required to escape a forward slash, but to use the same regex in the entire answer, I will keep "(\\w+)\\/N", but "(\\w+)/N" will work as well.

I've used "[/N]+" as the regular expression.
Regex101
[] = Matches characters inside the set
\/ = Matches the character / literally (case sensitive)
+ = Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

Java Split String by colon on both side

Can you suggest me an approach by which I can split a String which is like:
:31C:150318
:31D:150425 IN BANGLADESH
:20:314015040086
So I tried to parse that string with
:[A-za-z]|\\d:
This kind of regular expression, but it is not working . Please suggest me a regular expression by which I can split that string with 20 , 31C , 31D etc as Keys and 150318 , 150425 IN BANGLADESH etc as Values .
If I use string.split(":") then it would not serve my purpose.
If a string is like:
:20: MY VALUES : ARE HERE
then It will split up into 3 string , and key 20 will be associated with "MY VALUES" , and "ARE HERE" will not associated with key 20 .

You may use matching mechanism instead of splitting since you need to match a specific colon in the string.
The regex to get 2 groups between the first and second colon and also capture everything after the second colon will look like
^:([^:]*):(.*)$
See demo. The ^ will assert the beginning of the string, ([^:]*) will match and capture into Group 1 zero or more characters other than :, and (.*) will match and capture into Group 2 the rest of the string. $ will assert the position at the end of a single line string (as . matches any symbol but a newline without Pattern.DOTALL modifier).
String s = ":20:AND:HERE";
Pattern pattern = Pattern.compile("^:([^:]*):(.*)$");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println("Key: " + matcher.group(1) + ", Value: " + matcher.group(2) + "\n");
}
Result for this demo: Key: 20, Value: AND:HERE

You can use the following to split:
^[:]+([^:]+):

Try with split function of String class
String[] splited = string.split(":");
For your requirements:
String c = ":31D:150425 IN BANGLADESH:todasdsa";
c=c.substring(1);
System.out.println("C="+c);
String key= c.substring(0,c.indexOf(":"));
String value = c.substring(c.indexOf(":")+1);
System.out.println("key="+key+" value="+value);
Result:
C=31D:150425 IN BANGLADESH:todasdsa
key=31D value=150425 IN BANGLADESH:todasdsa

Certain strings that should be found by a working Regex are missed, and I need help identifying why

I have a set of strings, which I cycle through, checking those against the following set of regex, to try and separate the first small section from the rest of the string. The regex works in almost all cases, but unfortunately I have no idea why it fails occasionally. I’ve been using Pattern Matcher to print out the string, if the pattern is found.
Two example working strings:
98. SORGHUM Moench - Millets Annuals or rhizomatous perennials; inflorescence …
99. MISCANTHUS Andersson - Silver-grasses Rhizomatous perennials; inflorescence …
Two example failed strings:
100. ZEA L. - Maize Annuals; male and female inflorescences separate, the …
26. POA L. (Parodiochloa C.E. Hubb.) - Meadow-grasses Annuals or perennials with or without stolons or rhizomes; sheaths overlapping or some …
Regex’s used so far:
Pattern endOfGenus = Pattern.compile("(?<=(^\\d+\\. " + genusNames[l].toUpperCase() + "))");
Pattern endOfGenusTwo = Pattern.compile("(?<=(^\\d+" + genusNames[l].toUpperCase() + "))");
Pattern endOfGenusThree = Pattern.compile("(?<=(\\d+\\. " + genusNames[l] + "))");
Pattern endOfGenusFour = Pattern.compile("(?<=(\\d+" + genusNames[l] + "))");
Pattern endOfGenusFive = Pattern.compile("(?<=(\\. " + genusNames[l] + "))");
The first of these is the one thats producing the reliable results so far.
Example Code
Pattern endOfGenus = Pattern.compile("(?<=(^\\d+\\. " + genusNames[l].toUpperCase() + "))");
Matcher endOfGenusFinder = endOfGenus.matcher(descriptionPartBits[b]);
if (endOfGenusFinder.find()) {
System.out.print(descriptionPartBits[b] + ":- ");
System.out.print(genusNames[l] + "\n");
String[] genusNameBits = descriptionPartBits[b].split("(?<=(^\\d+\\. " + genusNames[l].toUpperCase() + "))");
}
Desired Output. This is what is produced by strings that work. Strings that don't work simply don't appear in the output:
98. SORGHUM Moench - Millets Annuals or rhizomatous perennials:- Sorghum
99. MISCANTHUS Andersson - Silver-grasses Rhizomatous perennials:- Miscanthus

From regex tutorial:
Lookahead and lookbehind, collectively called "lookaround", are
zero-length assertions just like the start and end of line, and start
and end of word anchors explained earlier in this tutorial.
Lookahead and lookbehind only return true or false.
So I changed your code example:
Pattern endOfGenus = Pattern.compile("(?<=(^\\d+\\. ZEA L))(.+)$");
// Matcher matcher = endOfGenus.matcher("98. SORGHUM Moench - Millets Annuals or rhizomatous perennials; inflorescence …");
Matcher matcher = endOfGenus.matcher("100. ZEA L. - Maize Annuals; male and female inflorescences separate, the …");
while (matcher.find()) {
String group1 = matcher.group(1);
String group2 = matcher.group(2);
System.out.println("group1=" + group1);
System.out.println("group2=" + group2);
}
Group 1 is matched by (^\\d+\\. ZEA L). Group 2 is matched by (.+).

How to pull numbers from a string/file name in Java?

Hopefully somebody can help me with this.. or at least point me in the right direction.
First off, I have a bunch of files with names such as:
vendor.2012-07-25
vendor.2012-07-25 2
ven_dor.2012-05-18
ven_dor.2012-05-18 2
Basically a vendor name (Sometimes one word, sometimes two with an underscore) + (period ".") + (year) + (month) + (day). Year, month, day are separated by (-). Possibly multiple files with the same name, denoted by a 2/3/4 etc after the date.
I obtain these as strings by doing file.getName(); where 'file' is the selected file from a JFileChooser
Then I need to chart some of the data based on date. Should I try to split the initial file name string by a "." first, so that the vendor and date are separated, and then split/divide up the remaining part by "-" to have the individual values for year/month/day?
I was thinking this could be a regex thing, but I'm pretty weak in that area.. so the double splitting is what I came up with. Anybody have input or suggestions? Thanks!

Indeed, you can use a regular expression:
String s = "vendor.2012-07-25 2";
Pattern p = Pattern.compile("([^.]+)\\.(\\d{4})-(\\d{2})-(\\d{2}) ?(\\d?)");
Matcher m = p.matcher(s);
if (m.find()) {
String vendorName = m.group(1);
String year = m.group(2);
String month = m.group(3);
String day = m.group(4);
String multipleFiles = m.groupCount() > 4 ? m.group(5) : "";
System.out.printf("%s %s %s %s %s", vendorName, year, month, day, multipleFiles);
}
Each expression wrapped with parentheses () is called a capturing group, and it basically tells the regex engine to save its content, so that it can be retrieved later on.
In sum, here's what each capturing group does:
([^.]+) - Everything but a dot (.), so we are basically capturing the vendor name part;
(\\d{4}) - \d matches a digit. \d{4} matches 4 digits (year);
(\\d{2}) - Month;
(\\d{2}) - Day;
(\\d?) - Matches an optional (?) last digit.
If you want to parse the date part as a java.Util.Date instance, you can use a single capturing group for it, and then use SimpleDateFormat:
Pattern p = Pattern.compile("([^.]+)\\.(\\d{4}-\\d{2}-\\d{2}) ?(\\d?)");
Matcher m = p.matcher(s);
if (m.find()) {
String vendorName = m.group(1);
String dateString = m.group(2);
SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd");
String multipleFiles = m.groupCount() > 2 ? m.group(3) : "";
}

String.split on the . (it will probably require escaping). Take the dotSplitString[1] as being the part after vendor. or ven_dor.
Split that part on space (spaceSplitString).
Parse the first part using DateFormat.parse(String) to get a Date
If the 2nd part (of the spaceSplitString) is present, use Integer.parseInt(spaceSplitString[1])

Java API String Tokenizer class
What you can do is:
tokenizer = new StringTokenizer(file.getName(), ".");
tokenizer.nextElement();
you get the picture, Or you can use Scanner to parse it as well

I tend to make use of StringTokenizers in my code a lot. To tokenize the above example you could use something akin to the following:
StringTokenizer tok = new StringTokenizer(filename,".-"); //tokenizes both on '.' and '-'
String name = tok.nextToken();
int year = Integer.parseInt(tok.nextToken());
int month = Integer.parseInt(tok.nextToken());
int day = Integer.parseInt(tok.nextToken());
int cnt = 1; //default one copy of the file
if(tok.hasMoreTokens()){
cnt = Integer.parseInt(tok.nextToken());
}
...and so on.
However I endorse the use of the regex solution above, if not only because it looks less comprehensible to a layman. Just including this here for completeness.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to exclude underscore from regex group java - java

\\D represents every non-digit which includes _. To eliminate it create intersection of \\D and [^_] sets with && operator. Try using [\\D&&[^_]]+ instead of \\D+|[^_]* OR since \D is negation of \d, we can use De Morgan's law that ~p AND ~q is same as ~(p OR q) and rewrite it as [^\\d_]+.

Related

Need help in regex matching

Using Regular Expression in Java to extract information from a String

Java Split String by colon on both side

Certain strings that should be found by a working Regex are missed, and I need help identifying why

How to pull numbers from a string/file name in Java?

Categories

Resources