Replace text with data & matched group contents - java

I don't believe I saw this when searching (believe me, I spent a good amount of time searching for this) for a solution to this so here goes.
Goal:
Match regex in a string and replace it with something that contains the matched value.
Regex used currently:
\b(Connor|charries96|Foo|Bar)\b
For the record I suck at regex incase this isn't the best way to do it.
My current code (and several other methods I tried) can only replace the text with the first match it encounters if there are multiple matches.
private Pattern regexFromList(List<String> input) {
if(input.size() < 1) {
return "";
}
StringBuilder builder = new StringBuilder();
builder.append("\\b");
builder.append("(");
for(String s : input) {
builder.append(s);
if(!s.equals(input.get(input.size() - 1)))
{
builder.append("|");
}
}
builder.append(")");
builder.append("\\b");
return Pattern.compile(builder.toString(), Pattern.CASE_INSENSITIVE);
}
Example input:
charries96's name is Connor.
Example result using TEST as the data to prepend the match with
TESTcharries96's name is TESTcharries96.
Desired result using example input:
TESTcharries96's name is TESTConnor.
Here is my current code for replacing the text:
if(highlight) {
StringBuilder builder = new StringBuilder();
Matcher match = pattern.matcher(event.getMessage());
String string = event.getMessage();
if (match.find()) {
string = match.replaceAll("TEST" + match.group());
// I do realise I'm using #replaceAll but that's mainly given it gives me the same result as other methods so why not just cut to the chase.
}
builder.append(string);
return builder.toString();
}
EDIT:
Working example of desired result on RegExr

There are a few problems here:
You are taking the user input as is and build the regex:
builder.append(s);
If there are special character in the user input, it might be recognized as meta character and cause unexpected behavior.
Always use Pattern.quote if you want to match a string as it is passed in.
builder.append(Pattern.quote(s));
Matcher.replaceAll is a high level function which resets the Matcher (start the match all over again), and search for all the matches and perform the replacement. In your case, it can be as simple as:
String result = match.replaceAll("TEST$1");
The StringBuilder should be thrown away along with the if statement.
Matcher.find, Matcher.group are lower level functions for fine grain control on what you want to do with a match.
When you perform replacement, you need to build the result with Matcher.appendReplacement and Matcher.appendTail.
A while loop (instead of if statement) should be used with Matcher.find to search for and perform replacement for all matched.

Related

Delete some part of the string in beginning and some at last in java

I want a dynamic code which will trim of some part of the String at the beginning and some part at last. I am able to trim the last part but not able to trim the initial part of the String to a specific point completely. Only the first character is deleted in the output.
public static String removeTextAndLastBracketFromString(String string) {
StringBuilder str = new StringBuilder(string);
int i=0;
do {
str.deleteCharAt(i);
i++;
} while(string.equals("("));
str.deleteCharAt(string.length() - 2);
return str.toString();
}
This is my code. When I pass Awaiting Research(5056) as an argument, the output given is waiting Research(5056. I want to trim the initial part of such string till ( and I want only the digits as my output. My expected output here is - 5056. Please help.
You don't need loops (in your code), you can use String.substring(int, int) in combination with String.indexOf(char):
public static void main(String[] args) {
// example input
String input = "Awaiting Research(5056)";
// find the braces and use their indexes to get the content
String output = input.substring(
input.indexOf('(') + 1, // index is exclusive, so add 1
input.indexOf(')')
);
// print the result
System.out.println(output);
}
Output:
5056
Hint:
Only use this if you are sure the input will always contain a ( and a ) with indexOf('(') < indexOf(')') or handle IndexOutOfBoundsExceptions, which will occur on most Strings not matching the braces constraint.
If your goal is just to look one numeric value of the string, try split the string with regex for the respective numeric value and then you'll have the number separated from the string
e.g:
Pattern pattern = Pattern.compile("\\d+");
Matcher matcher = pattern.matcher("somestringwithnumberlike123");
if(matcher.find()) {
System.out.println(matcher.group());
}
Using a regexp to extract what you need is a better option :
String test = "Awaiting Research(5056)";
Pattern p = Pattern.compile("([0-9]+)");
Matcher m = p.matcher(test);
if (m.find()) {
System.out.println(m.group());
}
For your case, battery use regular expression to extract your interested part.
Pattern pattern = Pattern.compile("(?<=\\().*(?=\\))");
Matcher matcher = pattern.matcher("Awaiting Research(5056)");
if(matcher.find())
{
return matcher.group();
}
It is much easier to solve the problem e.g. using the String.indexOf(..) and String.substring(from,to). But if, for some reason you want to stick to your approach, here are some hints:
Your code does what is does because:
string.equals("(") is only true if the given string is exacly "("
the do {code} while (condition)-loop executes code once if condition is not true -> think about using the while (condition) {code} loop instead
if you change the condition to check for the character at i, your code would remove the first, third, fifth and so on: After first execution i is 1 and char at i is now the third char of the original string (because the first has been removed already) -> think about always checking and removing charAt 0.

Java, getting portion of pattern partially matched by input

As title says, i'd like to get the portion of the pattern that is being matched partially by the input; example:
Pattern: aabb
Input string: "aa"
At this point, i'll use hitEnd() method of Matcher class to find out if the pattern is being matched partially, like shown in this answer, but i'd also like to find out that specifically "aa" of "aabb" is matched.
Is there any way to do this in java?
This may be dirty, but here We go...
Once you know that some string hitEnd, do a second processing:
Remove the last character from the string
Search with the original regex
If It matches, then you are over and you have the part of the string
If not, go to 1 and repeat the whole process until you match
If test strings can be long, performance may be a problem. So instead of positions from last to first, try searching for blocks.
For example, considering a string of 1,000 chars:
Test 1000/2 characters: 1-500. For this example, we consider it matches
Test for first 500 chars + 500/2 (1-750 positions). For this example, We consider It does not match. So we know that the position must be placed from 500 to 750
Now test 1-625 ((750+500)/2)... If it matches, the positions must exist between 625-750. If it does not match, It must be from 500 to 625
...
There is no such function in Matcher class. However you could achieve it for example in this way:
public String getPartialMatching(String pattern, String input) {
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(input);
int end = 0;
while(m.find()){
end = m.end();
}
if (m.hitEnd()) {
return input.substring(end);
} else {
return null;
}
}
First, iterate over all matched parts of string and skip them. For example: input = "aabbaa" m.hitEnd() will return false without skipping aabb.
Second, validate if the left part of the string partially matches.

Extracting Twitter username from a given text (JAVA, Regex)

I believe the code is OK, the problem is the regex.
Basically I want to find a username mention (it starts with #), and then I want to extract the allowed username part from the given word.
For example if the text contains "#FOO!!" I want to extract only "foo", but I believe the problem is with my "split("[a-z0-9-_]+")[0]" part.
Btw, allowed symbols are numbers, letters, - and _
public static Set<String> getMentionedUsers(List<Tweet> tweets) {
Set<String> mentioned = new HashSet<>();
for (Tweet tweet : tweets) {
String tweetToAnal = null;
if (tweet.getText().contains("#")) tweetToAnal = tweet.getText();
if (tweetToAnal == null) continue;
String[] splited = tweetToAnal.split("\\s+");
for (String elem : splited) {
String newElem = "";
if (elem.startsWith("#")) {
newElem = elem.substring(1).toLowerCase().split("[a-z0-9-_]+")[0];
}
if (newElem.length() > 0) mentioned.add(newElem);
}
}
return mentioned;
}
The problem is not on your regex but on your logic.
You are using below line to analize usernames:
if (elem.startsWith("#")) {
newElem = elem.substring(1).toLowerCase().split("[a-z0-9-_]+")[0];
}
If you debug step by step your code, you will notice that you are consuming (with substring(1)) the # and then you are splitting by using a regex, therefore this split is consuming all your characters as well. However, you don't want to consume characters with the split method but you just want to capture content.
So, you can actually use split by using the negated regex you are using by doing:
split("[^a-z0-9-_]+")
^---- Notice the negate character class indicator
On the other hand, instead of splitting the whole text in multiple tokens to further be analyzed, you can use a regex with capturing group and then grab the username you want. So, instead of having this code:
String[] splited = tweetToAnal.split("\\s+");
for (String elem : splited) {
String newElem = "";
if (elem.startsWith("#")) {
newElem = elem.substring(1).toLowerCase().split("[a-z0-9-_]+")[0];
}
if (newElem.length() > 0) mentioned.add(newElem);
You can use a much more simpler code like this:
Matcher m = Pattern.compile("(?<=#)([\\w-]+)").matcher(tweetToAnal); // Analyze text with a regex that will capture usernames preceded by #
while (m.find()) { // Stores all username (without #)
mentioned.add(m.group(1));
}
Btw, I didn't test the code, so I may have a typo but you can understand the idea. Anyway the code is pretty simple to understand.
I'm not a Java-Person, but you can easily match twitter-usernames without the "#" using the following regex:
(?<=#)[\w-]+
which can be seen here. Of course you would need to escape special characters properly, but since I have no clue of Java, you would have to do that and the actual matching by yourself.

Replacing Strings with a number in it without a for loop

So I currently have this code;
for (int i = 1; i <= this.max; i++) {
in = in.replace("{place" + i + "}", this.getUser(i)); // Get the place of a user.
}
Which works well, but I would like to just keep it simple (using Pattern matching)
so I used this code to check if it matches;
System.out.println(StringUtil.matches("{place5}", "\\{place\\d\\}"));
StringUtil's matches;
public static boolean matches(String string, String regex) {
if (string == null || regex == null) return false;
Pattern compiledPattern = Pattern.compile(regex);
return compiledPattern.matcher(string).matches();
}
Which returns true, then comes the next part I need help with, replacing the {place5} so I can parse the number. I could replace "{place" and "}", but what if there were multiple of those in a string ("{place5} {username}"), then I can't do that anymore, as far as I'm aware, if you know if there is a simple way to do that then please let me know, if not I can just stick with the for-loop.
then comes the next part I need help with, replacing the {place5} so I can parse the number
In order to obtain the number after {place, you can use
s = s.replaceAll(".*\\{place(\\d+)}.*", "$1");
The regex matches arbitrary number of characters before the string we are searching for, then {place, then we match and capture 1 or more digits with (\d+), and then we match the rest of the string with .*. Note that if the string has newline symbols, you should append (?s) at the beginning of the pattern. $1 in the replacement pattern "restores" the value we need.

The best way to find out that part of the string is potencial RegEx match

how would you do this:
I have a string and some regexes. Then I iterate over the string and in every iteration I need to know if the part (string index 0 to string currently iterated index) of that string is possible full match of one or more given regexes in next iterations.
Thank you for help.
What about a code like this:
// all of *greedy* regexs into a list
List<String> regex = new ArrayList<String>();
// here is my text
String mytext = "...";
String tmp = null;
// iterate over letters of my text
for (int i = 0; i < mytext.length(); i++) {
// substring from 0. position till i. index
tmp = mytext.substring(0, i);
// append regex on sub text
for (String reg : regex ) {
Pattern p = Pattern.compile(reg);
Matcher m = p.matcher(tmp);
// if found, do smt
if (m.find() ) { bingo.. do smt! }
}
}
You could use Matcher.lookingAt() to try to match as much as possible from a given input, but not requiring the whole input to match (.matches() would require the full input to match and .find() would not require the match to start at the beginning).
I don't believe the Java regular expression API provides such "incremental" or "step-by-step" search.
What you could do however, is to formulate your expression using reluctant quantifiers.
[...] The reluctant quantifiers, however, take the opposite approach: They start at the beginning of the input string, then reluctantly eat one character at a time looking for a match. The last thing they try is the entire input string. [...]
If this isn't viable in your case, you could use the Matcher.setRegion method to incrementally increase the region used by the matcher.
So I've been searching for alternatives to Java's standart RegEx library and found one that does the job well - JRegex

Categories

Resources