java regex read property, what is different double parentheses - java

form: column1 = emp_no
extract:
key: column1
value: emp_no
first code:
String p1 = "column1 = emp_no";
String propertyRegexp = "^\\s*(\\w+)\\s*=\\s*(\\w+)\\s*$";
Pattern pattern = Pattern.compile(propertyRegexp);
Matcher matcher = pattern.matcher(p1);
System.out.println("groupCount: " + matcher.groupCount());
if(matcher.matches()) {
for(int i = 0; i < matcher.groupCount(); i++) {
System.out.println(i + ": " + matcher.group(i));
}
}
first result:
groupCount: 2
0: column1 = emp_no
1: column1
It is not possible to find a second result.
The second brackets change to double parentheses.
second code:
String p1 = "column1 = emp_no";
String propertyRegexp = "^\\s*(\\w+)\\s*=\\s*((\\w+))\\s*$";
Pattern pattern = Pattern.compile(propertyRegexp);
Matcher matcher = pattern.matcher(p1);
System.out.println("groupCount: " + matcher.groupCount());
if(matcher.matches()) {
for(int i = 0; i < matcher.groupCount(); i++) {
System.out.println(i + ": " + matcher.group(i));
}
}
second result:
groupCount: 3
0: column1 = emp_no
1: column1
2: emp_no
I want results are output.
What is different regex in first and second code?

Change your code to.
String p1 = "column1 = emp_no";
String propertyRegexp = "^\\s*(\\w+)\\s*=\\s*(\\w+)\\s*$";
Pattern pattern = Pattern.compile(propertyRegexp);
Matcher matcher = pattern.matcher(p1);
System.out.println("groupCount: " + matcher.groupCount());
if(matcher.matches()) {
for(int i = 1; i <= matcher.groupCount(); i++) { //see the changes
System.out.println(i + ": " + matcher.group(i));
}
}
0th group always contains the entire matched string.
Actual groups start from index 1
Check out this live demo

Groups in regex are indexed from 0, but group 0 is added by regex engine automatically to represent entire match. Your groups are indexed as 1 and 2.
So your first attempt was almost correct, you should simply change loop from
for(int i = 0; i < matcher.groupCount(); i++) {
to
for(int i = 1; i <= matcher.groupCount(); i++) {
// ^ ^
You can read more about groups at official Java tutorual about regex https://docs.oracle.com/javase/tutorial/essential/regex/groups.html
where we can find example showing how groups are numbered:
...capturing groups are numbered by counting their opening parentheses from left to right. In the expression ((A)(B(C))), for example, there are four such groups:
((A)(B(C)))
(A)
(B(C))
(C)
...
There is also a special group, group 0, which always represents the entire expression.

Related

split String If get any capital letters

My String:
BByTTheWay .I want to split the string as B By T The Way BByTheWay .That means I want to split string if I get any capital letters and last put the main string as it is. As far I tried in java:
public String breakWord(String fileAsString) throws FileNotFoundException, IOException {
String allWord = "";
String allmethod = "";
String[] splitString = fileAsString.split(" ");
for (int i = 0; i < splitString.length; i++) {
String k = splitString[i].replaceAll("([A-Z])(?![A-Z])", " $1").trim();
allWord = k.concat(" " + splitString[i]);
allWord = Arrays.stream(allWord.split("\\s+")).distinct().collect(Collectors.joining(" "));
allmethod = allmethod + " " + allWord;
// System.out.print(allmethod);
}
return allmethod;
}
It givs me the output: B ByT The Way BByTTheWay . I think stackoverflow community help me to solve this.
You may use this code:
Code 1
String s = "BByTTheWay";
Pattern p = Pattern.compile("\\p{Lu}\\p{Ll}*");
String out = p.matcher(s)
.results()
.map(MatchResult::group)
.collect(Collectors.joining(" "))
+ " " + s;
//=> "B By T The Way BByTTheWay"
RegEx \\p{Lu}\\p{Ll}* matches any unicode upper case letter followed by 0 or more lowercase letters.
CODE DEMO
Or use String.split using same regex and join it back later:
Code 2
String out = Arrays.stream(s.split("(?=\\p{Lu})"))
.collect(Collectors.joining(" ")) + " " + s;
//=> "B By T The Way BByTTheWay"
Use
String s = "BByTTheWay";
Pattern p = Pattern.compile("[A-Z][a-z]*");
Matcher m = p.matcher(s);
String r = "";
while (m.find()) {
r = r + m.group(0) + " ";
}
System.out.println(r + s);
See Java proof.
Results: B By T The Way BByTTheWay
EXPLANATION
--------------------------------------------------------------------------------
[A-Z] any character of: 'A' to 'Z'
--------------------------------------------------------------------------------
[a-z]* any character of: 'a' to 'z' (0 or more
times (matching the most amount possible))
As per requirements, you can write in this way checking if a character is an alphabet or not:
char[] chars = fileAsString.toCharArray();
StringBuilder fragment = new StringBuilder();
for (char ch : chars) {
if (Character.isLetter(ch) && Character.isUpperCase(ch)) { // it works as internationalized check
fragment.append(" ");
}
fragment.append(ch);
}
String.join(" ", fragment).concat(" " + fileAsString).trim(); // B By T The Way BByTTheWay

Merge lines that share a word-link

so I'm having a small problem in java. I have something like
"Victor Fleming"
"Gone With"
"With The"
"The Wind."
So what the sentence should actually look like is
"Victor Fleming"
"Gone with the wind."
Therefore I'm looking to form a single sentence, by words that are adjacent and the same. If no adjacent same word is detected then the sentence will be separated as in "Victor Fleming" case where Fleming is not the same with Gone, so a new sentence is starting. What I've written so far:
List<String> separatedText = new ArrayList<>();
int i = 0;
while (i < mergedTextByHeightColor.size()) {
if ((i < (mergedTextByHeightColor.size() - 3)) && !(mergedTextByHeightColor.get(i + 1).equals(mergedTextByHeightColor.get(i + 2)))) {
separatedText.add(mergedTextByHeightColor.get(i) + " " + mergedTextByHeightColor.get(i + 1));
i = i + 2;
}
String concatStr = "";
while ((i < (mergedTextByHeightColor.size() - 3)) && (mergedTextByHeightColor.get(i + 1).equals(mergedTextByHeightColor.get(i + 2)))) {
if (concatStr.contains(mergedTextByHeightColor.get(i))) {
concatStr = mergedTextByHeightColor.get(i + 1) + " " + mergedTextByHeightColor.get(i + 3);
} else {
concatStr = mergedTextByHeightColor.get(i) + " " + mergedTextByHeightColor.get(i + 1) + " " + mergedTextByHeightColor.get(i + 3);
}
i = i + 3;
}
separatedText.add(concatStr);
}
We can store the sentences in a String array, then loop through each one.
Inside the loop, we check whether the last word of the last item (by splitting it into an array with .split(" "), then getting the last element) is equal to the first word of the current item. If it is, we first remove the first word of the current item, then append it to a StringBuilder.
If it isn't, then we append the StringBuilder's value to the list, append the current element, and move on.
String[] sentences = {"Victor Fleming", "Gone With", "With The", "The Wind."};
List<String> newsentences = new ArrayList<>();
StringBuilder str = new StringBuilder();
for(int i = 0; i < sentences.length; i++) {
String cur = sentences[i];
if(i != 0) {
String[] a = sentences[i-1].split(" ");
String[] b = cur.split(" ");
String last = a[a.length-1];
String first = b[0];
if(last.equalsIgnoreCase(first)) {
str.append(cur.substring(first.length()));
}else {
newsentences.add(str.toString());
str = new StringBuilder();
str.append(cur);
}
}else {
str.append(cur);
}
}
newsentences.add(str.toString());
System.out.println(Arrays.toString(newsentences.toArray()));
Output:
[Victor Fleming, Gone With The Wind.]

Java matcher unable to finding last group

I'm trying regex after a long time. I'm not sure if the issue is with regex or the logic.
String test = "project/components/content;contentLabel|contentDec";
String regex = "(([A-Za-z0-9-/]*);([A-Za-z0-9]*))";
Map<Integer, String> matchingGroups = new HashMap<>();
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(test);
//System.out.println("Input: " + test + "\n");
//System.out.println("Regex: " + regex + "\n");
//System.out.println("Matcher Count: " + matcher.groupCount() + "\n");
if (matcher != null && matcher.find()) {
for (int i = 0; i < matcher.groupCount(); i++) {
System.out.println(i + " -> " + matcher.group(i) + "\n");
}
}
I was expecting the above to give me the output as below:
0 -> project/components/content;contentLabel|contentDec
1 -> project/components/content
2 -> contentLabel|contentDec
But when running the code the group extractions are off.
Any help would be really appreciated.
Thanks!
You have a few issues:
You're missing | in your second character class.
You have an unnecessary capture group around the whole regex.
When outputting the groups, you need to use <= matcher.groupCount() because matcher.group(0) is reserved for the whole match, so your capture groups are in group(1) and group(2).
This will work:
String test = "project/components/content;contentLabel|contentDec";
String regex = "([A-Za-z0-9-/]*);([A-Za-z0-9|]*)";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(test);
if (matcher != null && matcher.find()) {
for (int i = 0; i <= matcher.groupCount(); i++) {
System.out.println(i + " -> " + matcher.group(i) + "\n");
}
}

Java Pattern to match String beginning and end?

I have an input that looks like this : 0; expires=2016-12-27T16:52:39
I am trying extract from this only the date, using Pattern and Matcher.
private String extractDateFromOutput(String result) {
Pattern p = Pattern.compile("(expires=)(.+?)(?=(::)|$)");
Matcher m = p.matcher(result);
while (m.find()) {
System.out.println("group 1: " + m.group(1));
System.out.println("group 2: " + m.group(2));
}
return result;
}
Why does this matcher find more than 1 group ? The output is as follows:
group 1: expires=
group 2: 2016-12-27T17:04:39
How can I get only group 2 out of this?
Thank you !
Because you have used more than one capturing group in your regex.
Pattern p = Pattern.compile("expires=(.+?)(?=::|$)");
Just remove the capturing group around
expires
::
private String extractDateFromOutput(String result) {
Pattern p = Pattern.compile("expires=(.+?)(?=::|$)");
Matcher m = p.matcher(result);
while (m.find()) {
System.out.println("group 1: " + m.group(1));
// no group 2, accessing will gives you an IndexOutOfBoundsException
//System.out.println("group 2: " + m.group(2));
}
return result;
}

Getting regex data

I'm trying to use a java regex to extract data. Its matching my data, but I can't get the group data. I'm trying to get the data 1, xmlAggregator, 268803451, 3. Looking at the docs, I assume that if I put() around \d+, and \w+, I get the numbers and strings inside the group. Any suggestions on how to change the regex?
String:
Span(trace_id:1, name:XmlAggregator, id:268803451, parent_id:3)
Java code:
String pattern="Span\\(trace_id:(\\d+), name:(\\w+), id:(\\d+), parent_id:(\\d+), (duration:(\\d+))*";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(line);
int count = 0;
while(m.find()) {
System.out.println("Match number "+count);
System.out.println("start(): "+m.start());
System.out.println("end(): "+m.end());
System.out.println("Found value: " + m.group(count) );
count++;
}
Output:
Match number 0
start(): 0
end(): 64
Found value: Span(trace_id:1, name:XmlAggregator, id:268803451, parent_id:3,
Hoping to get:
Found value: 1
Found value: XmlAggregator
Found value: 268803451
Found value: 3
You can access the capture groups (the parts of the match inside your unescaped parentheses) using the group method on your match result:
System.out.println("Trace ID = " + m.group(1));
System.out.println("Name = " + m.group(2));
// etc...
Note that you start counting the capture groups from 1, not 0. This is because group 0 corresponds to the entire matched string.
Each value is inside a group. Therefore you can loop over the number of groups matched and for each one print the group number, value, start index, etc.:
if(m.find()) {
for(int count = 1; count <= m.groupCount(); count++) {
System.out.println("Match number " + count);
System.out.println("start(): " + m.start(count));
System.out.println("end(): " + m.end(count));
System.out.println("Found value: " + m.group(count));
}
}

Categories

Resources