Regex for JAVA to get optional group - java

I try to match non english text from 用量 to name=用量 and 用量2 to name=用量 and number=2. I tried (\p{L}+)(\d*) on RegexPlanet, it works, but when get it run in java, can not get the 2 out the second test case.
Here's the code:
String pt = "(?<name>\\p{L}+)(?<number>\\d*)";
Matcher m = Pattern.compile(pt).matcher(t.trim());
m.find();
System.out.println("Using [" + pt + "] vs [" + t + "] GC=>" +
m.groupCount());
NameID n = new NameID();
n.name = m.group(1);
if (m.groupCount() > 2) {
try {
String ind = m.group(2);
n.id = Integer.parseInt(ind);
} catch (Exception e) { }
}

String t = "用量2";
String pt = "^(?<name>\\p{L}+)(?<number>\\d*)$";
Matcher m = Pattern.compile(pt).matcher(t.trim());
if (m.matches()) {
String name = m.group("name");
Integer id = m.group("number").length() > 0 ? Integer.parseInt(m.group("number")) : null;
System.out.println("name=" + name + ", id=" + id); // name=用量, id=2
}
Your regex works fine, but your Java code has some issues. See javadoc for groupCount():
Group zero denotes the entire pattern by convention. It is not included in this count.

Related

Split filename into groups

Input:
"MyPrefix_CH-DE_ProductName.pdf"
Desired output:
["MyPrefix", "CH", "DE", "ProductName"]
CH is a country code, and it should come from a predefined list, eg. ["CH", "IT", "FR", "GB"]
Edit: prefix can contain _ and - as well but not CH or DE.
DE is a language code, and it should come from a predefined list, eg. ["EN", "IT", "FR", "DE"]
How do I do that?
I'm looking for a regex based solution here.
I'll assume that the extension is always pdf
String str = "MyPref_ix__CH-DE_ProductName.pdf";
String regex = "(.*)_(CH|IT|FR|GB)-(EN|IT|FR|DE)_(.*)\\.pdf";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(str);
String[] res = new String[4];
if(matcher.matches()) {
res[0] = matcher.group(1);
res[1] = matcher.group(2);
res[2] = matcher.group(3);
res[3] = matcher.group(4);
}
You can try the following
String input = "MyPrefix_CH-DE_ProductName.pdf";
String[] segments = input.split("_");
String prefix = segments[0];
String countryCode = segments[1].split("-")[0];
String languageCode = segments[1].split("-")[1];
String fileName = segments[2].substring(0, segments[2].length() - 4);
System.out.println("prefix " + prefix);
System.out.println("countryCode " + countryCode);
System.out.println("languageCode " + languageCode);
System.out.println("fileName " + fileName);
this code does the split and create an object using the returned result, more OOP.
package com.local;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
/**
* Hello world!
*
*/
public class App
{
public static void main( String[] args )
{
List<String> countries = Arrays.asList("CH", "IT", "FR", "GB");
List<String> languages = Arrays.asList("EN", "IT", "FR", "DE");
String filename = "MyPrefix_CH-DE_ProductName.pdf";
//Remove prefix
filename = filename.split("\\.")[0];
List<String> result = Arrays.asList(filename.split("[_\\-]"));
FileNameSplitResult resultOne = new FileNameSplitResult(result.get(0), result.get(1), result.get(2), result.get(3));
System.out.println(resultOne);
}
static class FileNameSplitResult{
String prefix;
String country;
String language;
String productName;
public FileNameSplitResult(String prefix, String country, String language, String productName) {
this.prefix = prefix;
this.country = country;
this.language = language;
this.productName = productName;
}
#Override
public String toString() {
return "FileNameSplitResult{" +
"prefix='" + prefix + '\'' +
", country='" + country + '\'' +
", language='" + language + '\'' +
", productName='" + productName + '\'' +
'}';
}
}
}
Result of execution:
FileNameSplitResult{prefix='MyPrefix', country='CH', language='DE', productName='ProductName'}
You can use String.split two times so you can first split by '_' to get the CH-DE string and then split by '-' to get the CountryCode and LanguageCode.
Updated after your edit, with input containing '_' and '-':
The following code scans through the input String to find countries matches. I changed the input to "My-Pre_fix_CH-DE_ProductName.pdf"
Check the following code:
public static void main(String[] args) {
String [] countries = {"CH", "IT", "FR", "GB"};
String input = "My-Pre_fix_CH-DE_ProductName.pdf";
//First scan to find country position
int index = -1;
for (int i=0; i<input.length()-4; i++){
for (String country:countries){
String match = "_" + country + "-";
String toMatch = input.substring(i, match.length()+i);
if (match.equals(toMatch)){
//Found index
index=i;
break;
}
}
}
String prefix = input.substring(0,index);
String remaining = input.substring(index+1);//remaining is CH-DE_ProductName.pdf
String [] countryLanguageProductCode = remaining.split("_");
String country = countryLanguageProductCode[0].split("-")[0];
String language = countryLanguageProductCode[0].split("-")[1];
String productName = countryLanguageProductCode[1].split("\\.")[0];
System.out.println("[\"" + prefix +"\", \"" + country + "\", \"" + language +"\", \"" + productName+"\"]");
}
It outputs:
["My-Pre_fix", "CH", "DE", "ProductName"]
You can use the following regex :
^(.*?)_(CH|IT|FR|GB)-(EN|IT|FR|DE)_(.*)$
Java code :
Pattern p = Pattern.compile("^(.*?)_(CH|IT|FR|GB)-(EN|IT|FR|DE)_(.*)$");
Matcher m = p.matcher(input);
if (m.matches()) {
String[] result = { m.group(1), m.group(2), m.group(3), m.group(4) };
}
You can try it here.
Note that it would still fail if the prefix could contain a substring like _CH-EN_, and I don't think there's much than can be done about it beside sanitize the inputs.
One more alternative, which is pretty much the same as #billal GHILAS and #Aaron answers but using named groups. I find it handy for myself or for others who after a while look at my code immediately see what my regex does. The named groups make it easier.
String str = "My_Prefix_CH-DE_ProductName.pdf";
Pattern filePattern = Pattern.compile("(?<prefix>\\w+)_"
+ "(?<country>CH|IT|FR|GB)-"
+ "(?<language>EN|IT|FR|DE)_"
+ "(?<product>\\w+)\\.");
Matcher file = filePattern.matcher(str);
file.find();
System.out.println("Prefix: " + file.group("prefix"));
System.out.println("Country: " + file.group("country"));
System.out.println("Language: " + file.group("language"));
System.out.println("Product: " + file.group("product"));

java regular expression between characters java

My Java code:
String subjectString = "BYWW4 AterMux TP 46[_221] \n"
+ "FHTTY TC AterMux TP 9 \n"
+ "TUI_OO AterMux TP 2[_225] \n"
+ "F-UYRE TC AterMux TP 2 \n"
+ "RRRDSA AterMux TP 31[_256] ";
String textStr[] = subjectString.split("\n");
for (int i = 0; i < textStr.length; i++) {
String ResultString = null;
try {
Pattern regex = Pattern.compile("????????");
Matcher regexMatcher = regex.matcher(textStr[i]);
if (regexMatcher.find()) {
ResultString = regexMatcher.group();
System.out.println(ResultString); ///
}
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
}
I want the program to print the value after word (TP) and before ([) on this code to get result like below:4692231
You can use regexp TP\s*(\d+)\[ (double backslashes in Java code) and get a value with regexMatcher.group(1).
But you should not recreate it on each iteration of the loop, you should use Pattern.compile once per regexp.

java.lang.ArrayIndexOutOfBoundsException :

I have a String = "abc model 123 abcd1862893007509396 abcd2862893007509404", if I provide space between abcd1 & number eg. abcd1 862893007509396 my code will work fine, but if there is no space like abcd1862893007509396, I will get java.lang.ArrayIndexOutOfBoundsException, please help ?:
PFB the code :
String text = "";
final String suppliedKeyword = "abc model 123 abcd1862893007509396 abcd2862893007509404";
String[] keywordarray = null;
String[] keywordarray2 = null;
String modelname = "";
String[] strIMEI = null;
if ( StringUtils.containsIgnoreCase( suppliedKeyword,"model")) {
keywordarray = suppliedKeyword.split("(?i)model");
if (StringUtils.containsIgnoreCase(keywordarray[1], "abcd")) {
keywordarray2 = keywordarray[1].split("(?i)abcd");
modelname = keywordarray2[0].trim();
if (keywordarray[1].trim().contains(" ")) {
strIMEI = keywordarray[1].split(" ");
for (int i = 0; i < strIMEI.length; i++) {
if (StringUtils.containsIgnoreCase(strIMEI[i],"abcd")) {
text = text + " " + strIMEI[i] + " "
+ strIMEI[i + 1];
System.out.println(text);
}
}
} else {
text = keywordarray2[1];
}
}
}
After looking at your code the only thing i can consider for cause of error is
if (StringUtils.containsIgnoreCase(strIMEI[i],"abcd")) {
text = text + " " + strIMEI[i] + " "
+ strIMEI[i + 1];
System.out.println(text);
}
You are trying to access strIMEI[i+1] which will throw an error if your last element in strIMEI contains "abcd".

regex filename pattern match

I am using the below regex expression :
Pattern p = Pattern.compile("(.*?)(\\d+)?(\\..*)?");
while(new File(fileName).exists())
{
Matcher m = p.matcher(fileName);
if(m.matches()) { //group 1 is the prefix, group 2 is the number, group 3 is the suffix
fileName = m.group(1) + (m.group(2) == null ? "_copy" + 1 : (Integer.parseInt(m.group(2)) + 1)) + (m.group(3)==null ? "" : m.group(3));
}
}
This works fine for filename like abc.txt but if there is any file with name abc1.txt the above method is giving abc2.txt. How to make the regex condition or change (m.group(2) == null ? "_copy" + 1 : (Integer.parseInt(m.group(2)) + 1)) so that it returns me abc1_copy1.txt as new filename and not abc2.txt and so forth like abc1_copy2 etc.
Pattern p = Pattern.compile("(.*?)(_copy(\\d+))?(\\..*)?");
while(new File(fileName).exists())
{
Matcher m = p.matcher(fileName);
if (m.matches()) {
String prefix = m.group(1);
String numberMatch = m.group(3);
String suffix = m.group(4);
int copyNumber = numberMatch == null ? 1 : Integer.parseInt(numberMatch) + 1;
fileName = prefix;
fileName += "_copy" + copyNumber;
fileName += (suffix == null ? "" : suffix);
}
}
I'm not a java guy, but in general, you should use libarary functions/classes for parsing filenames as many platforms have different rules for them.
Look at:
http://people.apache.org/~jochen/commons-io/site/apidocs/org/apache/commons/io/FilenameUtils.html#getBaseName(java.lang.String)

Regex Issue With Multiple Groups

I'm trying to create a regex pattern to match the lines in the following format:
field[bii] = float4:.4f_degree // Galactic Latitude
field[class] = int2 (index) // Browse Object Classification
field[dec] = float8:.4f_degree (key) // Declination
field[name] = char20 (index) // Object Designation
field[dircos1] = float8 // 1st Directional Cosine
I came up with this pattern, which seemed to work, then suddenly seemed NOT to work:
field\[(.*)\] = (float|int|char)([0-9]|[1-9][0-9]).*(:(\.([0-9])))
Here is the code I'm trying to use (edit: provided full method instead of excerpt):
private static Map<String, String> createColumnMap(String filename) {
// create a linked hashmap mapping field names to their column types. Use LHM because I'm picky and
// would prefer to preserve the order
Map<String, String> columnMap = new LinkedHashMap<String, String>();
// define the regex patterns
Pattern columnNamePattern = Pattern.compile(columnNameRegexPattern);
try {
Scanner scanner = new Scanner(new FileInputStream(filename));
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
if (line.indexOf("field[") != -1) {
// get the field name
Matcher fieldNameMatcher = columnNamePattern.matcher(line);
String fieldName = null;
if (fieldNameMatcher.find()) {
fieldName = fieldNameMatcher.group(1);
}
String columnName = null;
String columnType = null;
String columnPrecision = null;
String columnScale = null;
//Pattern columnTypePattern = Pattern.compile(".*(float|int|char)([0-9]|[1-9][0-9])");
Pattern columnTypePattern = Pattern.compile("field\\[(.*)\\] = (float|int|char).*([0-9]|[1-9][0-9]).*(:(\\.([0-9])))");
Matcher columnTypeMatcher = columnTypePattern.matcher(line);
System.out.println(columnTypeMatcher.lookingAt());
if (columnTypeMatcher.lookingAt()) {
System.out.println(fieldName + ": " + columnTypeMatcher.groupCount());
int count = columnTypeMatcher.groupCount();
if (count > 1) {
columnName = columnTypeMatcher.group(1);
columnType = columnTypeMatcher.group(2);
}
if (count > 2) {
columnScale = columnTypeMatcher.group(3);
}
if (count >= 6) {
columnPrecision = columnTypeMatcher.group(6);
}
}
int precision = Integer.parseInt(columnPrecision);
int scale = Integer.parseInt(columnScale);
if (columnType.equals("int")) {
if (precision <= 4) {
columnMap.put(fieldName, "INTEGER");
} else {
columnMap.put(fieldName, "BIGINT");
}
} else if (columnType.equals("float")) {
if (columnPrecision==null) {
columnMap.put(fieldName,"DECIMAL(8,4)");
} else {
columnMap.put(fieldName,"DECIMAL(" + columnPrecision + "," + columnScale + ")");
}
} else {
columnMap.put(fieldName,"VARCHAR("+columnPrecision+")");
}
}
if (line.indexOf("<DATA>") != -1) {
scanner.close();
break;
}
}
scanner.close();
} catch (FileNotFoundException e) {
}
return columnMap;
}
When I get the groupCount from the Matcher object, it says there are 6 groups. However, they aren't matching the text, so I could definitely use some help... can anyone assist?
It's not entirely clear to me what you're after but I came up with the following pattern and it accepts all of your input examples:
field\\[(.*)\\] = (float|int|char)([1-9][0-9]?)?(:\\.([0-9]))?
using this code:
String columnName = null;
String columnType = null;
String columnPrecision = null;
String columnScale = null;
// Pattern columnTypePattern =
// Pattern.compile(".*(float|int|char)([0-9]|[1-9][0-9])");
// field\[(.*)\] = (float|int|char)([0-9]|[1-9][0-9]).*(:(\.([0-9])))
Pattern columnTypePattern = Pattern
.compile("field\\[(.*)\\] = (float|int|char)([1-9][0-9]?)?(:\\.([0-9]))?");
Matcher columnTypeMatcher = columnTypePattern.matcher(line);
boolean match = columnTypeMatcher.lookingAt();
System.out.println("Match: " + match);
if (match) {
int count = columnTypeMatcher.groupCount();
if (count > 1) {
columnName = columnTypeMatcher.group(1);
columnType = columnTypeMatcher.group(2);
}
if (count > 2) {
columnScale = columnTypeMatcher.group(3);
}
if (count > 4) {
columnPrecision = columnTypeMatcher.group(5);
}
System.out.println("Name=" + columnName + "; Type=" + columnType + "; Scale=" + columnScale + "; Precision=" + columnPrecision);
}
I think the problem with your regex was it needed to make the scale and precision optional.
field\[(.*)\] = (float|int|char)([0-9]|[1-9][0-9]).*(:(\.([0-9])))
The .* is overly broad, and there is a lot of redundancy in ([0-9]|[1-9][0-9]), and I think the parenthetical group that starts with : and preceding .* should be optional.
After removing all the ambiguity, I get
field\[([^\]]*)\] = (float|int|char)(0|[1-9][0-9]+)(?:[^:]*(:(\.([0-9]+))))?

Categories

Resources