How to extract members with regex - java

I have this string to parse and extract all elements between <>:
String text = "test user #myhashtag <#C5712|user_name_toto> <#U433|user_hola>";
I tried with this pattern, but it doesn't work (no result):
String pattern = "<#[C,U][0-9]+\\|[.]+>";
So in this example I want to extract:
<#C5712|user_name_toto>
<#U433|user_hola>
Then for each, I want to extract:
C or U element
ID (ie: 5712 or 433)
user name (ie: user_name_toto)
Thank you very much guys

The main problem I can see with your pattern is that it doesn't contain groups, hence retrieving parts of it will be impossible without further parsing.
You define numbered groups within parenthesis: (partOfThePattern).
From Java 7 onwards, you can also define named groups as follows: (?<theName>partOfThePattern).
The second problem is that [.] corresponds to a literal dot, not an "any character" wildcard.
The third problem is your last quantifier, which is greedy, therefore it would consume the whole rest of the string starting from the first username.
Here's a self-contained example fixing all that:
String text = "test user #myhashtag <#C5712|user_name_toto> <#U433|user_hola>";
// | starting <#
// | | group 1: any 1 char
// | | | group 2: 1+ digits
// | | | | escaped "|"
// | | | | | group 3: 1+ non-">" chars, greedy
// | | | | | | closing >
// | | | | | |
Pattern p = Pattern.compile("<#(.)(\\d+)\\|([^>]+))>");
Matcher m = p.matcher(text);
while (m.find()) {
System.out.printf(
"C or U? %s%nUser ID: %s%nUsername: %s%n",
m.group(1), m.group(2), m.group(3)
);
}
Output
C or U? C
User ID: 5712
Username: user_name_toto
C or U? U
User ID: 433
Username: user_hola
Note
I'm not validating C vs U here (gives you another . example).
You can easily replace the initial (.) with (C|U) if you only have either. You can also have the same with ([CU]).

<#([CU])(\d{4})\|(\w+)>
Where:
$1 --> C/U
$2 --> 5712/433
$3 --> user_name_toto/user_hola

Related

Variable increment (index++) not increase 1 each time

My program will read a file line by line, then split the line by delimiter | (vertical line) and stored into a String []. However, as column position and number of columns in the line will change in the future, instead of using concrete index number 0,1,2,3..., I use index++ to iterate the line split tokens;
After running the program, instead of increase 1, the index will increase more than 1 each time.
My code is like as follows:
BufferedReader br = null;
String line = null;
String[] lineTokens = null;
int index = 1;
DataModel dataModel = new DataModel();
try {
br = new BufferedReader(new FileReader(filePath));
while((line = br.readLine()) != null) {
// check Group C only
if(line.contains("CCC")) {
lineTokens = line.split("\\|");
dataModel.setGroupID(lineTokens[index++]);
//System.out.println(index); The value of index not equal to 2 here. The value change each running time
dataModel.setGroupName(lineTokens[index++]);
//System.out.println(index);
// dataModel.setOthers(lineTokens[index++]); <- if the file add columns in the middle of the line in the future, this is required.
dataModel.setMemberID(lineTokens[index++]);
dataModel.setMemberName(lineTokens[index++]);
dataModel.setXXX(lineTokens[index++]);
dataModel.setYYY(lineTokens[index++]);
index = 1;
//blah blah below
}
}
br.close();
} catch (Exception ex) {
}
The file format is like as follows:
Prefix | Group ID | Group Name | Memeber ID | Member Name | XXX | YYY
GroupInterface | AAA | Group A | 001 | Amy | XXX | YYY
GroupInterface | BBB | Group B | 002 | Tom | XXX | YYY
GroupInterface | AAA | Group A | 003 | Peter | XXX | YYY
GroupInterface | CCC | Group C | 004 | Sam | XXX | YYY
GroupInterface | CCC | Group C | 005 | Susan | XXX | YYY
GroupInterface | DDD | Group D | 006 | Parker| XXX | YYY
Instead of increase 1, the index++ will increase more than 1. I wonder why this happen and how to solve it? Any help is highly appreciated.
Well, #SomeProgrammerDude slyly hinted it, but I'll just come out and say it: when you reset index it should be set to zero, not 1.
By starting index at 1, you're always indexing one position ahead of where you should be, and you're probably eventually getting an IndexOutOfBoundsException that's being swallowed up by your empty catch clause.

write a grammar rule name in unicode [ANTLR 4]

I am still a beginner in ANTLR 4 and I was wondering if there is a way to write a grammar rule name in unicode. For example, the following rule is fine:
atomExp returns [double value]
: n=Number {$value = Double.parseDouble($n.text);}
| '(' exp=additionExp ')' {$value = $exp.value;}
;
However, let's say I want to write the same rule but instead of writing its name as "atomExp" , I want to write the name as an Arabic word "تعبير"
تعبير returns [double value]
: n=Number {$value = Double.parseDouble($n.text);}
| '(' exp=additionExp ')' {$value = $exp.value;}
;
but when I try to write it that way I get "no viable alternative" error. Can someone solve my problem please. Thanks in advance
When looking at the lexer grammar for ANTLR4, you can see that lexer and parser names support certain Unicode chars:
/** Allow unicode rule/token names */
ID : NameStartChar NameChar*;
fragment
NameChar
: NameStartChar
| '0'..'9'
| '_'
| '\u00B7'
| '\u0300'..'\u036F'
| '\u203F'..'\u2040'
;
fragment
NameStartChar
: 'A'..'Z'
| 'a'..'z'
| '\u00C0'..'\u00D6'
| '\u00D8'..'\u00F6'
| '\u00F8'..'\u02FF'
| '\u0370'..'\u037D'
| '\u037F'..'\u1FFF'
| '\u200C'..'\u200D'
| '\u2070'..'\u218F'
| '\u2C00'..'\u2FEF'
| '\u3001'..'\uD7FF'
| '\uF900'..'\uFDCF'
| '\uFDF0'..'\uFFFD'
; // ignores | ['\u10000-'\uEFFFF] ;
INT : [0-9]+
;
But it appears that your ID تعبير does not comply with the NameChar* part of the ID rule.

Java Regex for custom function

I'm looking for a Regex pattern that matches the following, but I'm kind of stumped so far. I'm not sure how to grab the results of the two groups I want, marked by id, and attr.
Should match:
account[id].attr
account[anotherid].anotherattr
These should respectively return id, attr,
and anotherid, anotherattr
Any tips?
Here's a complete solution mapping your id -> attributes:
String[] input = {
"account[id].attr",
"account[anotherid].anotherattr"
};
// | literal for "account"
// | | escaped "["
// | | | group 1: any character
// | | | | escaped "]"
// | | | | | escaped "."
// | | | | | | group 2: any character
Pattern p = Pattern.compile("account\\[(.+)\\]\\.(.+)");
Map<String, String> output = new LinkedHashMap<String, String>();
// iterating over input Strings
for (String s: input) {
// matching
Matcher m = p.matcher(s);
// finding only once per input String. Change to a while-loop if multiple instances
// within single input
if (m.find()) {
// back-referencing group 1 and 2 as key -> value
output.put(m.group(1), m.group(2));
}
}
System.out.println(output);
Output
{id=attr, anotherid=anotherattr}
Note
In this implementation, "incomplete" inputs such as "account[anotherid]." will not be put in the Map as they don't match the Pattern at all.
In order to have these cases put as id -> null, you only need to add a ? at the end of the Pattern.
That will make the last group optional.

Matching ${123...456} and extracting 2 numbers in Java?

What is the simplest succinct way to expect 2 integers from a String when i know the format will always be ${INT1...INT2} e.g. "Hello ${123...456} would extract 123,456?
I would go with a Pattern with groups and back-references.
Here's an example:
String input = "Hello ${123...456}, bye ${789...101112}";
// | escaped "$"
// | | escaped "{"
// | | | first group (any number of digits)
// | | | | 3 escaped dots
// | | | | | second group (same as 1st)
// | | | | | | escaped "}"
Pattern p = Pattern.compile("\\$\\{(\\d+)\\.{3}(\\d+)\\}");
Matcher m = p.matcher(input);
// iterating over matcher's find for multiple matches
while (m.find()) {
System.out.println("Found...");
System.out.println("\t" + m.group(1));
System.out.println("\t" + m.group(2));
}
Output
Found...
123
456
Found...
789
101112
final String string = "${123...456}";
final String firstPart = string.substring(string.indexOf("${") + "${".length(), string.indexOf("..."));
final String secondPart = string.substring(string.indexOf("...") + "...".length(), string.indexOf("}"));
final Integer integer = Integer.valueOf(firstPart.concat(secondPart));

How to replace multiple words with space in a string using Java

I tried to replace a list of words from a give string with the following code.
String Sample = " he saw a cat running of that pat's mat ";
String regex = "'s | he | of | to | a | and | in | that";
Sample = Sample.replaceAll(regex, " ");
The output is
[ saw cat running that pat mat ]
// minus the []
It still has the last word "that". Is there anyway to modify the regex to consider the last word also.
Try:
String Sample = " he saw a cat running of that pat's mat remove 's";
String resultString = Sample.replaceAll("\\b( ?'s|he|of|to|a|and|in|that)\\b", "");
System.out.print(resultString);
saw cat running pat mat remove
DEMO
http://ideone.com/Yitobz
The problem is that you have consecutive words that you are trying to replace.
For example, consider the substring
[ of that ]
while the replaceAll is running, the [ of ] matches
[ of that ]
^ ^
and that will be replaced with a (space). The next character to match is t, not a space expected by
... | that | ...
What I think you can do to fix this is add word boundaries instead of spaces.
String regex = "'s\\b|\\bhe\\b|\\bof\\b|\\bto\\b|\\ba\\b|\\band\\b|\\bin\\b|\\bthat\\b";
or the shorter version as shown in Tuga's answer.
it doesn't work, because you delete the " of " part first and then there is no space before the "that" word, because you deleted it (replaced)
you can change in two ways:
String regex = "'s | he | of| to | a | and | in | that";
or
String regex = "'s | he | of | to | a | and | in |that ";
or you just call Sample = Sample.replaceAll(regex, " "); again

Categories

Resources