How do you replace groups in a regular expression? - java

How, exactly, do you replace groups while appending them to a string buffer?
For Example:
(a)(b)(c)
How can you replace group 1 with d, group 2 with e and so on?
I'm working with the Java regex engine.
Thanks in advance.

You could use Matcher's appendReplacement
Here is an example sample using:
input: "hello bob How is your cat?"
regular expression: "(bob|cat)"
output: "hello alice How is your dog"
public static void main(String[] args) {
Pattern p = Pattern.compile("(bob|cat)");
Matcher m = p.matcher("hello bob How is your cat?");
StringBuffer s = new StringBuffer();
while (m.find()) {
m.appendReplacement(s, doReplace(m.group(1)));
}
m.appendTail(s);
System.out.println(s.toString());
}
public static String doReplace(String s) {
if(s.equals("bob")) {
return "alice";
}
if(s.equals("cat")) {
return "dog";
}
return "";
}

You could use Matcher#start(group) and Matcher#end(group) to build a generic replacement method:
public static String replaceGroup(String regex, String source, int groupToReplace, String replacement) {
return replaceGroup(regex, source, groupToReplace, 1, replacement);
}
public static String replaceGroup(String regex, String source, int groupToReplace, int groupOccurrence, String replacement) {
Matcher m = Pattern.compile(regex).matcher(source);
for (int i = 0; i < groupOccurrence; i++)
if (!m.find()) return source; // pattern not met, may also throw an exception here
return new StringBuilder(source).replace(m.start(groupToReplace), m.end(groupToReplace), replacement).toString();
}
public static void main(String[] args) {
// replace with "%" what was matched by group 1
// input: aaa123ccc
// output: %123ccc
System.out.println(replaceGroup("([a-z]+)([0-9]+)([a-z]+)", "aaa123ccc", 1, "%"));
// replace with "!!!" what was matched the 4th time by the group 2
// input: a1b2c3d4e5
// output: a1b2c3d!!!e5
System.out.println(replaceGroup("([a-z])(\\d)", "a1b2c3d4e5", 2, 4, "!!!"));
}
Check online demo here.

Are you looking for something like this?
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Program1 {
public static void main(String[] args) {
Pattern p = Pattern.compile("(a)(b)(c)");
String str = "111abc222abc333";
String out = null;
Matcher m = p.matcher(str);
out = m.replaceAll("z$3y$2x$1");
System.out.println(out);
}
}
This gives 111zcybxa222zcybxa333 as output.
I guess you will see what this example does.
But OK, I think there's no ready built-in
method through which you can say e.g.:
- replace group 3 with zzz
- replace group 2 with yyy
- replace group 1 with xxx

Related

Finding six consecutive integers in three lines of string

I have written an OCR program in Java where it scans documents and finds all text in it. My primary task is to find the Invoice number which can be 6 or more integer.
I used the substring functionality but that's not so efficient as the position of that number is changing with every document, but it is always present in the first three lines of OCR text.
I want to write code in Java 8 from where I can iterate through the first three lines and get this 6 consecutive numbers.
I am using Tesseract for OCR.
Example:
,——— ————i_
g DAILYW RK SHE 278464
E C 0 mp] on THE POUJER Hello, Mumbai, Co. Maha
from this, I need to extract the number 278464.
Please help!!
try the following code using regex.
import java.lang.Math; // headers MUST be above the first class
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class Test
{
// arguments are passed using the text field below this editor
public static void main(String[] args)
{
Pattern pattern = Pattern.compile("(?<=\\D)\\d{6}(?!\\d)");
String str = "g DAILYW RK SHE 278464";
Matcher matcher = pattern.matcher(str);
if(matcher.find()){
String s = matcher.group();
//278464
System.out.println(s);
}
}
}
(?<=\\D) match but not catch text current and before current are not numbers
\\d{6} match exactly 6 numbers
(?!\\d) match but not catch text current and after current are not numbers
It can be solved simply with \\d{6,} as shown below:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String args[]) {
// Tests
String[] textArr1 = { ",——— ————i_", "g DAILYW RK SHE 2784647",
"E C 0 mp] on THE POUJER Hello, Mumbai, Co. Maha" };
String[] textArr2 = { ",——— ————i_", "g DAILYW RK SHE ——— ————",
"E C 0 mp] on THE 278464 POUJER Hello, Mumbai, Co. Maha" };
String[] textArr3 = { ",——— 278464————i_", "g DAILYW RK SHE POUJER",
"E C 0 mp] on THE POUJER Hello, Mumbai, Co. Maha" };
System.out.println(getInvoiceNumber(textArr1));
System.out.println(getInvoiceNumber(textArr2));
System.out.println(getInvoiceNumber(textArr3));
}
static String getInvoiceNumber(String[] textArr) {
String invoiceNumber = "";
Pattern pattern = Pattern.compile("\\d{6,}");
for (String text : textArr) {
Matcher matcher = pattern.matcher(text);
if (matcher.find()) {
invoiceNumber = matcher.group();
}
}
return invoiceNumber;
}
}
Output:
2784647
278464
278464
check this code.
public class Test {
private static final Pattern p = Pattern.compile("(\\d{6,})");
public static void main(String[] args) {
try {
Scanner scanner = new Scanner(new File("here put your file path"));
System.out.println("done");
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
// create matcher for pattern p and given string
Matcher m = p.matcher(line);
// if an occurrence if a pattern was found in a given string...
if (m.find()) {
System.out.println(m.group(1)); // second matched digits
}
}
scanner.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
}

Why Regular expression matches one character less at the end?

Problem is last one character never gets matched.
When I tried displaying using group ,it shows all match except last character.
Its same in all cases.
Below is the code and its o/p.
package mon;
import java.util.*;
import java.util.regex.*;
class HackerRank {
static void Pattern(String text) {
String p="\\d{1,2}|(0|1)\\d{2}|2[0-4]\\d|25[0-5]";
String pattern="(("+p+")\\.){3}"+p;
Pattern pi=Pattern.compile(pattern);
Matcher m=pi.matcher(text);
// System.out.println(m.group());
if(m.find() && m.group().equals(text))
System.out.println(m.group()+"true");
else
System.out.println(m.group()+" false");
}
public static void main(String[] args) {
Scanner sc=new Scanner(System.in);
while(sc.hasNext()) {
Pattern(sc.next());
}
sc.close();
}
}
I/P:000.12.12.034;
O/P:000.12.12.03 false
You should properly group the alternatives inside the octet pattern:
String p="(?:\\d{1,2}|[01]\\d{2}|2[0-4]\\d|25[0-5])";
// ^^^ ^
Then build the patter like
String pattern = p + "(?:\\." + p + "){3}";
It will become a bit more efficient. Then, use matches to require a full string match:
if(m.matches()) {...
See a Java demo:
String p="(?:\\d{1,2}|[01]\\d{2}|2[0-4]\\d|25[0-5])";
String pattern = p + "(?:\\." + p + "){3}";
String text = "192.156.34.56";
// System.out.println(pattern); => (?:\d{1,2}|[01]\d{2}|2[0-4]\d|25[0-5])(?:\.(?:\d{1,2}|[01]\d{2}|2[0-4]\d|25[0-5])){3}
Pattern pi=Pattern.compile(pattern);
Matcher m=pi.matcher(text);
if(m.matches())
System.out.println(m.group()+" => true");
else
System.out.println("False"); => 192.156.34.56 => true
And here is the resulting regex demo.

Substring based on special characters

I have to fetch the tablename and columnnames from a sql. For this I had split from clause data based on space and stored all the elements in a list, But now some of the columns having method calling or some other validations.
For ex some of columns :
max(TableName1.ColumnName1) --> TableName1.ColumnName1
concat('Q',TableName2.ColumnName2)} --> TableName2.ColumnName2
left(convert(varchar(90),TableName3.ColumnName3),1)}) --> TableName3.ColumnName3
Now I validate strings which are having .
Here I had only hint i.e (.) based on this I have to get right and left strings upto/before special characters.
Might get special characters like , ( )
import java.util.*;
import java.text.*;
import java.util.regex.*;
public class Parser {
private static Pattern p = Pattern.compile("(?![\\(\\,])([^\\(\\)\\,]*\\.[^\\(\\)\\,]+)(?=[\\)\\,])");
private static String getColumnName(String s) {
Matcher m = p.matcher(s);
while(m.find()) {
return m.group(1);
}
return "";
}
public static void main(String []args) {
String s1= "max(TableName1.ColumnName1)";
System.out.println(getColumnName(s1));
String s2= "concat('Q',TableName2.ColumnName2)}";
System.out.println(getColumnName(s2));
String s3= "left(convert(varchar(90),TableName3.ColumnName3),1)})";
System.out.println(getColumnName(s3));
}
}
Output:
TableName1.ColumnName1
TableName2.ColumnName2
TableName3.ColumnName3
You can use a regular expression like [(),{}] to split the array into tokens, and then just select the token with the "." sign in it. For example:
public static String getColumnName (String input) {
if (StringUtils.isEmpty(input)) return input;
String[] tokens = input.split("[(),{}]");
for (String token: tokens) {
if (token.contains(".")) return token;
}
return input;
}
public static void main(String args[]) throws Exception {
//The two tokens will be "max", "TableName1.ColumnName1".
String test1 = "max(TableName1.ColumnName1)";
//The three tokens will be "concat", "Q" and "TableName2.ColumnName2".
String test2 = "concat('Q',TableName2.ColumnName2)}";
//The six tokens will be "left", "convert", "varchar",
//"90", "", "1" and "TableName3.ColumnName3".
String test3 = "left(convert(varchar(90),TableName3.ColumnName3),1)})";
System.out.println(getColumnName(test1));
System.out.println(getColumnName(test2));
System.out.println(getColumnName(test3));
}
The print out will give you:
TableName1.ColumnName1
TableName2.ColumnName2
TableName3.ColumnName3

Regex back reference to match a number (or any char sequence) with itself

I am missing something basic here. I have this regex (.*)=\1 and I am using it to match 100=100 and its failing. When I remove the back reference from the regex and continue to use the capturing group, it shows that the captured group is '100'. Why does it not work when I try to use the back reference?
package test;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexTest {
public static void main(String[] args) {
String eqPattern = "(.*)=\1";
String input[] = {"1=1"};
testAndPrint(eqPattern, input); // this does not work
eqPattern = "(.*)=";
input = new String[]{"1=1"};
testAndPrint(eqPattern, input); // this works when the backreference is removed from the expr
}
static void testAndPrint(String regexPattern, String[] input) {
System.out.println("\n Regex pattern is "+regexPattern);
Pattern p = Pattern.compile(regexPattern, Pattern.CASE_INSENSITIVE);
boolean found = false;
for (String str : input) {
System.out.println("Testing "+str);
Matcher matcher = p.matcher(str);
while (matcher.find()) {
System.out.println("I found the text "+ matcher.group() +" starting at " + "index "+ matcher.start()+" and ending at index "+matcher.end());
found = true;
System.out.println("Group captured "+matcher.group(1));
}
if (!found) {
System.out.println("No match found");
}
}
}
}
When I run this, I get the following output
Regex pattern is (.*)=\1
Testing 100=100
No match found
Regex pattern is (.*)=
Testing 100=100
I found the text 100= starting at index 0 and ending at index 4
Group captured 100 -->If the group contains 100, why doesnt it match when I add \1 above
?
You have to escape the pattern string.
String eqPattern = "(.*)=\\1";
I think you need to escape the backslash.
String eqPattern = "(.*)=\\1";

java tokenizer for strings

I have a text file and want to tokenize its lines -- but only the sentences with the # character.
For example, given...
Buah... Molt bon concert!! #Postconcert #gintonic
...I want to print only #Postconcert #gintonic.
I have already tried this code with some changes...
public class MyTokenizer {
/**
* #param args
*/
public static void main(String[] args) {
tokenize("Europe3.txt","allo.txt");
}
public static void tokenize(String sFile,String sFileOut) {
String sLine="", sToken="";
MyBufferedReaderWriter f = new MyBufferedReaderWriter();
f.openRFile(sFile);
MyBufferedReaderWriter fOut = new MyBufferedReaderWriter();
fOut.openWFile(sFileOut);
while ((sLine=f.readLine()) != null) {
//StringTokenizer st = new StringTokenizer(sLine, "#");
String[] tokens = sLine.split("\\#");
for (String token : tokens)
{
fOut.writeLine(token);
//System.out.println(token);
}
/*while (st.hasMoreTokens()) {
sToken = st.nextToken();
System.out.println(sToken);
}*/
}
f.closeRFile();
}
}
Can anyone help?
You can try something like with Regex:
package com.stackoverflow.answers;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class HashExtractor {
public static void main(String[] args) {
String strInput = "Buah... Molt bon concert!! #Postconcert #gintonic";
String strPattern = "(?:\\s|\\A)[##]+([A-Za-z0-9-_]+)";
Pattern pattern = Pattern.compile(strPattern);
Matcher matcher = pattern.matcher(strInput);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
}
As per the given example, when using the split() function the values would be stored something like this:
tokens[0]=Buah... Molt bon concert!!
tokens[1]=Postconcert
tokens[2]=gintonic
So you just need to skip first value and append '#' (if you need that in your other) to the other string values.
Hope this helps.
You have not specially asked for this, but I assume you try to extract all the #hashtags from your textfile.
To do this, Regex is your friend:
String text = "Buah... Molt bon concert!! #Postconcert #gintonic";
System.out.println(getHashTags(text));
public Collection<String> getHashTags(String text) {
Pattern pattern = Pattern.compile("(#\\w+)");
Matcher matcher = pattern.matcher(text);
Set<String> htags = new HashSet();
while (matcher.find()) {
htags.add(matcher.group(1));
}
return htags;
}
Compile a pattern like this #\w+, everything that starts with a # followed by one or more (+) word character (\w).
Then we have to escape the \ for java with a \\.
And finally put this expression in a group to get access to the matched text by surrounding it with braces (#\w+).
For every match, add the first matched group to the set htags, finally we get a set with all the hashtags in it.
[#gintonic, #Postconcert]

Categories

Resources