How can I replace different match with different replacement regex
if I have two match option separated by |, for each of the match I want to reference the string or substring that matches.
if I have
Pattern p = Pattern.compile("man|woman|girls");
Matcher m = p.matcher("some string");
If the match is "man" I want to use a different replacement from when the match is "woman" or "girls".
I have looked through Most efficient way to use replace multiple words in a string but dont understand how to reference the match itself.
Consider improving your pattern a little to add word-boundries to prevent it from patching only some part of words like man can be match for mandatory.
(BTW: Also in case you would want to replace words which have same start like man and manual you should place manual before man in regex, or man will consume <man>ual part which will prevent ual from being match. So correct order would be manual|man)
So your regex can look more like
Pattern p = Pattern.compile("\\b(man|woman|girls)\\b");
Matcher m = p.matcher("some text about woman and few girls");
Next thing you can do is simply store pairs originalValue -> replacement inside some collection which will let you easily get replacement for value. Simplest way would be using Map
Map<String, String> replacementMap = new HashMap<>();
replacementMap.put("man", "foo");
replacementMap.put("woman", "bar");
replacementMap.put("girls", "baz");
Now your code can look like this:
StringBuffer sb = new StringBuffer();
while(m.find()){
String wordToReplace = m.group();
//replace found word with with its replacement in map
m.appendReplacement(sb, replacementMap.get(wordToReplace));
}
m.appendTail(sb);
String replaced = sb.toString();
You could do
str =
str.replace("woman", "REPLACEMENT1")
.replace("man", "REPLACEMENT2")
.replace("girls", "REPLACEMENT3");
Related
String s = "A..?-B^&';(,,,)G56.6C,,,M4788C..,,A1''";
String[] result = s.split("(?=[ABC])");
System.out.println(Arrays.toString(result));
Output:
[A..?-, B^&';(,,,)G56.6, C,,,M4788, C..,,, A1'']
Please refer to the The split in the above case. I am trying to separate strings based on A, B orC. How can I get the the same split strings into an ArrayList using pattern matcher? I could not figure out how to group in the below code.
Pattern p = Pattern.compile("(?=[ABC])");
Matcher m = p.matcher(s);
List<String> matches = new ArrayList<>();
while (m.find()) {
matches.add(m.group());
}
Also suppose I have few characters before first occurance of A, B or C and I want to combine with first element in ArrayList. ,,A..
Appreciate the help.
[ABC][^ABC]*
If I didn't ommit any edge case that should work with the code you provided
For the extra question, you could possibly add (^[^ABC]*)* to the beggining, but that makes it slower and look less readable, not to mention it will only work for single-line strings to check. I would recommend just parsing the beggining characters manually, treating it like a special case it is.
I have some strings with equations in the following format ((a+b)/(c+(d*e))).
I also have a text file that contains the names of each variable, e.g.:
a velocity
b distance
c time
etc...
What would be the best way for me to write code so that it plugs in velocity everywhere a occurs, and distance for b, and so on?
Don't use String#replaceAll in this case if there is slight chance part you will replace your string contains substring that you will want to replace later, like "distance" contains a and if you will want to replace a later with "velocity" you will end up with "disvelocityance".
It can be same problem as if you would like to replace A with B and B with A. For this kind of text manipulation you can use appendReplacement and appendTail from Matcher class. Here is example
String input = "((a+b)/(c+(d*e)))";
Map<String, String> replacementsMap = new HashMap<>();
replacementsMap.put("a", "velocity");
replacementsMap.put("b", "distance");
replacementsMap.put("c", "time");
StringBuffer sb = new StringBuffer();
Pattern p = Pattern.compile("\\b(a|b|c)\\b");
Matcher m = p.matcher(input);
while (m.find())
m.appendReplacement(sb, replacementsMap.get(m.group()));
m.appendTail(sb);
System.out.println(sb);
Output:
((velocity+distance)/(time+(d*e)))
This code will try to find each occurrence of a or b or c which isn't part of some word (it doesn't have any character before or after it - done with help of \b which represents word boundaries). appendReplacement is method which will append to StringBuffer text from last match (or from beginning if it is first match) but will replace found match with new word (I get replacement from Map). appendTail will put to StringBuilder text after last match.
Also to make this code more dynamic, regex should be generated automatically based on keys used in Map. You can use this code to do it
StringBuilder regexBuilder = new StringBuilder("\\b(");
for (String word:replacementsMap.keySet())
regexBuilder.append(Pattern.quote(word)).append('|');
regexBuilder.deleteCharAt(regexBuilder.length()-1);//lets remove last "|"
regexBuilder.append(")\\b");
String regex = regexBuilder.toString();
I'd make a hashMap mapping the variable names to the descriptions, then iterate through all the characters in the string and replace each occurrance of a recognised key with it's mapping.
I would use a StringBuilder to build up the new string.
Using a hashmap and iterating over the string as A Boschman suggested is one good solution.
Another solution would be to do what others have suggested and do a .replaceAll(); however, you would want to use a regular expression to specify that only the words matching the whole variable name and not a substring are replaced. A regex using word boundary '\b' matching will provide this solution.
String variable = "a";
String newVariable = "velocity";
str.replaceAll("\\b" + variable + "\\b", newVariable);
See http://docs.oracle.com/javase/tutorial/essential/regex/bounds.html
For string str, use the replaceAll() function:
str = str.toUpperCase(); //Prevent substitutions of characters in the middle of a word
str = str.replaceAll("A", "velocity");
str = str.replaceAll("B", "distance");
//etc.
First time posting.
Firstly I know how to use both Pattern Matcher & String Split.
My questions is which is best for me to use in my example and why?
Or suggestions for better alternatives.
Task:
I need to extract an unknown NOUN between two known regexp in an unknown string.
My Solution:
get the Start and End of the noun (from Regexp 1&2) and substring to extract the noun.
String line = "unknownXoooXNOUNXccccccXunknown";
int goal = 12 ;
String regexp1 = "Xo+X";
String regexp2 = "Xc+X";
I need to locate the index position AFTER the first regex.
I need to locate the index position BEFORE the second regex.
A) I can use pattern matcher
Pattern p = Pattern.compile(regexp1);
Matcher m = p.matcher(line);
if (m.find()) {
int afterRegex1 = m.end();
} else {
throw new IllegalArgumentException();
//TODO Exception Management;
}
B) I can use String Split
String[] split = line.split(regex1,2);
if (split.length != 2) {
throw new UnsupportedOperationException();
//TODO Exception Management;
}
int afterRegex1 = line.indexOf(split[1]);
Which Approach should I use and why?
I don't know which is more efficient on time and memory.
Both are near enough as readable to myself.
I'd do it like this:
String line = "unknownXoooXNOUNXccccccXunknown";
String regex = "Xo+X(.*?)Xc+X";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(line);
if (m.find()) {
String noun = m.group(1);
}
The (.*?) is used to make the inner match on the NOUN reluctant. This protects us from a case where our ending pattern appears again in the unknown portion of the string.
EDIT
This works because the (.*?) defines a capture group. There's only one such group defined in the pattern, so it gets index 1 (the parameter to m.group(1)). These groups are indexed from left to right starting at 1. If the pattern were defined like this
String regex = "(Xo+X)(.*?)(Xc+X)";
Then there would be three capture groups, such that
m.group(1); // yields "XoooX"
m.group(2); // yields "NOUN"
m.group(3); // yields "XccccccX"
There is a group 0, but that matches the whole pattern, and it's equivalent to this
m.group(); // yields "XoooXNOUNXccccccX"
For more information about what you can do with the Matcher, including ways to get the start and end positions of your pattern within the source string, see the Matcher JavaDocs
You should use String.split() for readability unless you're in a tight loop.
Per split()'s javadoc, split() does the equivalent of Pattern.compile(), which you can optimize away if you're in a tight loop.
It looks like you want to get a unique occurrence. For this do simply
input.replaceAll(".*Xo+X(.*)Xc+X.*", "$1")
For efficiency, use Pattern.matcher(input).replaceAll instead.
In case you input contains line breaks, use Pattern.DOTALL or the s modifier.
In case you want to use split, consider using Guava's Splitter. It behaves more sane and also accepts a Pattern which is good for speed.
If you really need the locations you can do it like this:
String line = "unknownXoooXNOUNXccccccXunknown";
String regexp1 = "Xo+X";
String regexp2 = "Xc+X";
Matcher m=Pattern.compile(regexp1).matcher(line);
if(m.find())
{
int start=m.end();
if(m.usePattern(Pattern.compile(regexp2)).find())
{
final int end = m.start();
System.out.println("from "+start+" to "+end+" is "+line.substring(start, end));
}
}
But if you just need the word in between, I recommend the way Ian McLaird has shown.
I'm trying to perform some super simple parsing o log files, so I'm using String.split method like this:
String [] parts = input.split(",");
And works great for input like:
a,b,c
Or
type=simple, output=Hello, repeat=true
Just to say something.
How can I escape the comma, so it doesn't match intermediate commas?
For instance, if I want to include a comma in one of the parts:
type=simple, output=Hello, world, repeate=true
I was thinking in something like:
type=simple, output=Hello\, world, repeate=true
But I don't know how to create the split to avoid matching the comma.
I've tried:
String [] parts = input.split("[^\,],");
But, well, is not working.
You can solve it using a negative look behind.
String[] parts = str.split("(?<!\\\\), ");
Basically it says, split on each ", " that is not preceeded by a backslash.
String str = "type=simple, output=Hello\\, world, repeate=true";
String[] parts = str.split("(?<!\\\\), ");
for (String s : parts)
System.out.println(s);
Output:
type=simple
output=Hello\, world
repeate=true
(ideone.com link)
If you happen to be stuck with the non-escaped comma-separated values, you could do the following (similar) hack:
String[] parts = str.split(", (?=\\w+=)");
Which says split on each ", " which is followed by some word-characters and an =
(ideone.com link)
I'm afraid, there's no perfect solution for String.split. Using a matcher for the three parts would work. In case the number of parts is not constant, I'd recommend a loop with matcher.find. Something like this maybe
final String s = "type=simple, output=Hello, world, repeat=true";
final Pattern p = Pattern.compile("((?:[^\\\\,]|\\\\.)*)(?:,|$)");
final Matcher m = p.matcher(s);
while (m.find()) System.out.println(m.group(1));
You'll probably want to skip the spaces after the comma as well:
final Pattern p = Pattern.compile("((?:[^\\\\,]|\\\\.)*)(?:,\\s*|$)");
It's not really complicated, just note that you need four backslashes in order to match one.
Escaping works with the opposite of aioobe's answer (updated: aioobe now uses the same construct but I didn't know that when I wrote this), negative lookbehind
final String s = "type=simple, output=Hello\\, world, repeate=true";
final String[] tokens = s.split("(?<!\\\\),\\s*");
for(final String item : tokens){
System.out.println("'" + item.replace("\\,", ",") + "'");
}
Output:
'type=simple'
'output=Hello, world'
'repeate=true'
Reference:
Pattern: Special Constructs
I think
input.split("[^\\\\],");
should work. It will split at all commas that are not preceeded with a backslash.
BTW if you are working with Eclipse, I can recommend the QuickRex Plugin to test and debug Regexes.
I am not a beginner to regular expressions, but their use in perl seems a bit different than in Java.
Anyways, I basically have a dictionary of shorthand words and their definitions. I want to iterate over words in the dictionary and replace them with their meanings. what is the best way to do this in JAVA?
I have seen String.replaceAll(), String.replace(), as well as the Pattern/Matcher classes. I wish to do a case insensitive replacement along the lines of:
word =~ s/\s?\Q$short_word\E\s?/ \Q$short_def\E /sig
While I am at it, do you think that it is best to extract all the words from the string and then apply my dictionary or just apply the dictionary to the string? I know that I need to be careful, because the shorthand words could match parts of other shorthand meanings.
Hopefully this all makes sense.
Thanks.
Clarification:
Dictionary is something like:
lol:laugh out loud, rofl:rolling on the floor laughing, ll:like lemons
string is:
lol, i am rofl
replaced text:
laugh out loud, i am rolling on the floor laughing
notice how the ll wasnt added anywhere
The danger is false positives inside of normal words. "fell" != "felikes lemons"
One way is to split the words on whitespace (do multiple spaces need to be conserved?) then loop over the List performing the 'if contains() { replace } else { output original } idea above.
My output class would be a StringBuffer
StringBuffer outputBuffer = new StringBuffer();
for(String s: split(inputText)) {
outputBuffer.append( dictionary.contains(s) ? dictionary.get(s) : s);
}
Make your split method smart enough to return word delimiters also:
split("now is the time") -> now,<space>,is,<space>,the,<space><space>,time
Then you don't have to worry about conserving white space - the loop above will just append anything that isn't a dictionary word to the StringBuffer.
Here's a recent SO thread on retaining delimiters when regexing.
If you insist on using regex, this would work (taking Zoltan Balazs' dictionary map approach):
Map<String, String> substitutions = loadDictionaryFromSomewhere();
int lengthOfShortestKeyInMap = 3; //Calculate
int lengthOfLongestKeyInMap = 3; //Calculate
StringBuffer output = new StringBuffer(input.length());
Pattern pattern = Pattern.compile("\\b(\\w{" + lengthOfShortestKeyInMap + "," + lengthOfLongestKeyInMap + "})\\b");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
String candidate = matcher.group(1);
String substitute = substitutions.get(candidate);
if (substitute == null)
substitute = candidate; // no match, use original
matcher.appendReplacement(output, Matcher.quoteReplacement(substitute));
}
matcher.appendTail(output);
// output now contains the text with substituted words
If you plan to process many inputs, pre-compiling the pattern is more efficient than using String.split(), which compiles a new Pattern each call.
(edit) Compiling all of the keys into a single pattern yields a more efficient approach, like so:
Pattern pattern = Pattern.compile("\\b(lol|rtfm|rofl|wtf)\\b");
// rest of the method unchanged, don't need the shortest/longest key stuff
This allows the regex engine to skip over any words that happen to be short enough but aren't in the list, saving you a lot of map accesses.
The first thing, that comes into my mind is this:
...
// eg: lol -> laugh out loud
Map<String, String> dictionatry;
ArrayList<String> originalText;
ArrayList<String> replacedText;
for(String string : originalText) {
if(dictionary.contains(string)) {
replacedText.add(dictionary.get(string));
} else {
replacedText.add(string);
}
...
Or you could use a StringBuffer instead of the replacedText.