help using Java Matcher to modify a group - java

I want to match on a regex and modify the match. here is my function. right now, my method doesn't change the input at all. what is wrong? thanks.
Matcher abbrev_matcher = abbrev_p.matcher(buffer);
StringBuffer result = new StringBuffer();//must use stringbuffer here!
while (abbrev_matcher.find()){
//System.out.println("match found");
abbrev_matcher.appendReplacement(result, getReplacement(abbrev_matcher));
}
abbrev_matcher.appendTail(result);
private static String getReplacement(Matcher aMatcher){
StringBuilder temp = new StringBuilder(aMatcher.group(0));
for (int i = 0; i < temp.length(); i++){
if (temp.charAt(i) == '.'){
temp.deleteCharAt(i);
}
}
return temp.toString();
}

You just want to remove all the dots from the matched text? Here:
StringBuffer result = new StringBuffer();
while (abbrev_matcher.find()) {
abbrev_matcher.appendReplacement(result, "");
result.append(abbrev_matcher.group().replaceAll("\\.", ""));
}
abbrev_matcher.appendTail(result);
The reason for the appendReplacement(result, "") is because appendReplacement looks for $1, $2, etc., so it can replace them with capture groups. If you aren't passing string literals or other string constants to that method, it's best to avoid that processing step and use StringBuffer's append method instead. Otherwise it will tend to blow up if there are any dollar signs or backslashes in the replacement string.
As for your getReplacement method, in my tests it does change the matched string, but it doesn't do it correctly. For example, if the string is ...blah..., it returns .blah.. That's because, every time you call deletecharAt(i) on the StringBuilder, you change the indexes of all subsequent characters. You would have to iterate through the string backward to make that approach work, but it's not worth it; just start with an empty StringBuilder and build the string by append-ing instead of deleting. It's much more efficient as well as easier to manage.
Now that I think about it some more, the reason you aren't seeing any change may be that your code is throwing a StringIndexOutOfBoundsException, which you aren't seeing because the code runs in a try block and the corresponding catch block is empty (the classic Empty Catch Block anti-pattern). N'est-ce pas?

Related

Alternative to successive String.replace

I want to replace some strings in a String input :
string=string.replace("<h1>","<big><big><big><b>");
string=string.replace("</h1>","</b></big></big></big>");
string=string.replace("<h2>","<big><big>");
string=string.replace("</h2>","</big></big>");
string=string.replace("<h3>","<big>");
string=string.replace("</h3>","</big>");
string=string.replace("<h4>","<b>");
string=string.replace("</h4>","</b>");
string=string.replace("<h5>","<small><b>");
string=string.replace("</h5>","</b><small>");
string=string.replace("<h6>","<small>");
string=string.replace("</h6>","</small>");
As you can see this approach is not the best, because each time I have to search for the portion to replace etc, and Strings are immutable... Also the input is large, which means that some performance issues are to be considered.
Is there any better approach to reduce the complexity of this code ?
Although StringBuilder.replace() is a huge improvement compared to String.replace(), it is still very far from being optimal.
The problem with StringBuilder.replace() is that if the replacement has different length than the replaceable part (applies to our case), a bigger internal char array might have to be allocated, and the content has to be copied, and then the replace will occur (which also involves copying).
Imagine this: You have a text with 10.000 characters. If you want to replace the "XY" substring found at position 1 (2nd character) to "ABC", the implementation has to reallocate a char buffer which is at least larger by 1, has to copy the old content to the new array, and it has to copy 9.997 characters (starting at position 3) to the right by 1 to fit "ABC" into the place of "XY", and finally characters of "ABC" are copied to the starter position 1. This has to be done for every replace! This is slow.
Faster Solution: Building Output On-The-Fly
We can build the output on-the-fly: parts that don't contain replaceable texts can simply be appended to the output, and if we find a replaceable fragment, we append the replacement instead of it. Theoretically it's enough to loop over the input only once to generate the output. Sounds simple, and it's not that hard to implement it.
Implementation:
We will use a Map preloaded with mappings of the replaceable-replacement strings:
Map<String, String> map = new HashMap<>();
map.put("<h1>", "<big><big><big><b>");
map.put("</h1>", "</b></big></big></big>");
map.put("<h2>", "<big><big>");
map.put("</h2>", "</big></big>");
map.put("<h3>", "<big>");
map.put("</h3>", "</big>");
map.put("<h4>", "<b>");
map.put("</h4>", "</b>");
map.put("<h5>", "<small><b>");
map.put("</h5>", "</b></small>");
map.put("<h6>", "<small>");
map.put("</h6>", "</small>");
And using this, here is the replacer code: (more explanation after the code)
public static String replaceTags(String src, Map<String, String> map) {
StringBuilder sb = new StringBuilder(src.length() + src.length() / 2);
for (int pos = 0;;) {
int ltIdx = src.indexOf('<', pos);
if (ltIdx < 0) {
// No more '<', we're done:
sb.append(src, pos, src.length());
return sb.toString();
}
sb.append(src, pos, ltIdx); // Copy chars before '<'
// Check if our hit is replaceable:
boolean mismatch = true;
for (Entry<String, String> e : map.entrySet()) {
String key = e.getKey();
if (src.regionMatches(ltIdx, key, 0, key.length())) {
// Match, append the replacement:
sb.append(e.getValue());
pos = ltIdx + key.length();
mismatch = false;
break;
}
}
if (mismatch) {
sb.append('<');
pos = ltIdx + 1;
}
}
}
Testing it:
String in = "Yo<h1>TITLE</h1><h3>Hi!</h3>Nice day.<h6>Hi back!</h6>End";
System.out.println(in);
System.out.println(replaceTags(in, map));
Output: (wrapped to avoid scroll bar)
Yo<h1>TITLE</h1><h3>Hi!</h3>Nice day.<h6>Hi back!</h6>End
Yo<big><big><big><b>TITLE</b></big></big></big><big>Hi!</big>Nice day.
<small>Hi back!</small>End
This solution is faster than using regular expressions as that involves much overhead, like compiling a Pattern, creating a Matcher etc. and regexp is also much more general. It also creates many temporary objects under the hood which are thrown away after the replace. Here I only use a StringBuilder (plus char array under its hood) and the code iterates over the input String only once. Also this solution is much faster that using StringBuilder.replace() as detailed at the top of this answer.
Notes and Explanation
I initialized the StringBuilder in the replaceTags() method like this:
StringBuilder sb = new StringBuilder(src.length() + src.length() / 2);
So basically I created it with an initial capacity of 150% of the length of the original String. This is because our replacements are longer than the replaceable texts, so if replacing occurs, the output will obviously be longer than the input. Giving a larger initial capacity to StringBuilder will result in no internal char[] reallocation at all (of course the required initial capacity depends on the replaceable-replacement pairs and their frequency/occurrence in the input, but this +50% is a good upper estimation).
I also utilized the fact that all replaceable strings start with a '<' character, so finding the next potential replaceable position becomes blazing-fast:
int ltIdx = src.indexOf('<', pos);
It's just a simple loop and char comparisons inside String, and since it always starts searching from pos (and not from the start of the input), overall the code iterates over the input String only once.
And finally to tell if a replaceable String does occur at the potential position, we use the String.regionMatches() method to check the replaceable stings which is also blazing-fast as all it does is just compares char values in a loop and returns at the very first mismatching character.
And a PLUS:
The question doesn't mention it, but our input is an HTML document. HTML tags are case-insensitive which means the input might contain <H1> instead of <h1>.
To this algorithm this is not a problem. The regionMatches() in the String class has an overload which supports case-insensitive comparison:
boolean regionMatches(boolean ignoreCase, int toffset, String other,
int ooffset, int len);
So if we want to modify our algorithm to also find and replace input tags which are the same but are written using different letter case, all we have to modify is this one line:
if (src.regionMatches(true, ltIdx, key, 0, key.length())) {
Using this modified code, replaceable tags become case-insensitive:
Yo<H1>TITLE</H1><h3>Hi!</h3>Nice day.<H6>Hi back!</H6>End
Yo<big><big><big><b>TITLE</b></big></big></big><big>Hi!</big>Nice day.
<small>Hi back!</small>End
For performance - use StringBuilder.
For convenience you can use Map to store values and replacements.
Map<String, String> map = new HashMap<>();
map.put("<h1>","<big><big><big><b>");
map.put("</h1>","</b></big></big></big>");
map.put("<h2>","<big><big>");
...
StringBuilder builder = new StringBuilder(yourString);
for (String key : map.keySet()) {
replaceAll(builder, key, map.get(key));
}
... To replace all occurences in StringBuilder you can check here:
Replace all occurrences of a String using StringBuilder?
public static void replaceAll(StringBuilder builder, String from, String to)
{
int index = builder.indexOf(from);
while (index != -1)
{
builder.replace(index, index + from.length(), to);
index += to.length(); // Move to the end of the replacement
index = builder.indexOf(from, index);
}
}
Unfortunately StringBuilder doesn't provide a replace(string,string) method, so you might want to consider using Pattern and Matcher in conjunction with StringBuffer:
String input = ...;
StringBuffer sb = new StringBuffer();
Pattern p = Pattern.compile("</?(h1|h2|...)>");
Matcher m = p.matcher( input );
while( m.find() )
{
String match = m.group();
String replacement = ...; //get replacement for match, e.g. by lookup in a map
m.appendReplacement( sb, replacement );
}
m.appendTail( sb );
You could do something similar with StringBuilder but in that case you'd have to implement appendReplacement etc. yourself.
As for the expression you could also just try and match any html tag (although that might cause problems since regex and arbitrary html don't fit very well) and when the lookup doesn't have any result you just replace the match with itself.
The particular example you provide seems to be HTML or XHTML. Trying to edit HTML or XML using regular expressions is frought with problems. For the kind of editing you seem to be interested in doing you should look at using XSLT. Another possibility is to use SAX, the streaming XML parser, and have your back-end write the edited output on the fly. If the text is actually HTML, you might be better using a tolerant HTML parser, such as JSoup, to build a parsed representation of the document (like the DOM), and manipulate that before outputting it.
StringBuilder is backed by a char array. So, unlike String instances, it is mutable. Thus, you can call indexOf() and replace() on the StringBuilder.
I would do something like this
StringBuilder sb = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
if (tagEquals(str, i, "h1")) {
sb.append("<big><big><big><b>");
i += 2;
} else (tagEquals(s, i, "/h1")) {
...
} else {
sb.append(str.charAt(i));
}
}
tagEquals is a func which checks a tag name
Use Apache Commons StringUtils.replaceEach.
String[] searches = new String[]{"<h1>", "</h1>", "<h2>", ...};
String[] replacements = new String[]("<big><big><big><b>", "</b></big></big></big>", "<big><big>" ...};
string = StringUtils.replaceEach(string, searches, replacements);

How to properly use java Pattern object to match string patterns

I wrote a code that does several string operations including checking whether a given string matches with a certain regular expression. It ran just fine with 70,000 input but it started to give me out of memory error when I iteratively ran it for five-fold cross validation. It just might be the case that I have to assign more memory, but I have a feeling that I might have written an inefficient code, so wanted to double check if I didn't make any obvious mistake.
static Pattern numberPattern = Pattern.compile("^[a-zA-Z]*([0-9]+).*");
public static boolean someMethod(String line) {
String[] tokens = line.split(" ");
for(int i=0; i<tokens.length; i++) {
tokens[i] = tokens[i].replace(",", "");
tokens[i] = tokens[i].replace(";", "");
if(numberPattern.matcher(tokens[i]).find()) return true;
}
return false;
}
and I have also many lines like below:
token.matches("[a-z]+[A-Z][a-z]+");
Which way is more memory efficient? Do they look efficient enough? Any advice is appreciated!
Edited:
Sorry, I had a wrong code, which I intended to modify before posting this question but I forgot at the last minute. But the problem was I had many similar looking operations all over, aside from the fact that the example code did not make sense, I wanted to know if regexp comparison part was efficient.
Thanks for all of your comments, I'll look through and modify the code following the advice!
Well, first at all, try a second look at your code... it will always return a "true" value ! You are not reading the 'match' variable, just putting values....
At second, String is immutable, so, each time you're splitting, you're creating another instances... why don't you try so create a pattern that makes the matches you want ignoring the commas and semicolons? I'm not sure, but I think it will take you less memory...
Yes, this code is inefficient indeed because you can return immediately once you've found that match = true; (no point to continue looping).
Further, are you sure you need to break the line into tokens ? why not check the regex only once ?
And last, if all comparisons checks failed, you should return false (last line).
Instead of altering the text and splitting it you can put it all in the regex.
// the \\b means it must be the start of the String or a word
static Pattern numberPattern = Pattern.compile("\\b[a-zA-Z,;]*[0-9,;]*[0-9]");
// return true if the string contains
// a number which might have letters in front
public static boolean someMethod(String line) {
return numberPattern.matcher(line).find());
}
Aside from what #alfasin has mentioned in his answer, you should avoid duplicating code; Rewrite the following:
{
tokens[i] = tokens[i].replace(",", "");
tokens[i] = tokens[i].replace(";", "");
}
Into:
tokens[i] = tokens[i].replaceAll(",|;", "");
And please just compute this before it was .split(), such that the operation doesn't have to be repeated within the loop:
String[] tokens = line.replaceAll(",|;", "").split(" ");
^^^^^^^^^^^^^^^^^^^^^^
Edit: After staring at your code for a bit I think I have a better solution, using regex ;)
public static boolean someMethod(String line) {
return Pattern.compile("\\b[a-zA-Z]*\\d")
.matcher(line.replaceAll(",|;", "")).find();
}
Online Regex DemoOnline Code Demo
\b is a Word Boundary.
It asserts position at the Boundary of a word (Start of line + after spacing)
Code Demo STDOUT:
foo does not match
bar does not match
bar1 does match
foo baz bar bar1 lolz does match
password_01 does not match

Making only the first letter of a word uppercase

I have a method that converts all the first letters of the words in a sentence into uppercase.
public static String toTitleCase(String s)
{
String result = "";
String[] words = s.split(" ");
for (int i = 0; i < words.length; i++)
{
result += words[i].replace(words[i].charAt(0)+"", Character.toUpperCase(words[i].charAt(0))+"") + " ";
}
return result;
}
The problem is that the method converts each other letter in a word that is the same letter as the first to uppercase. For example, the string title comes out as TiTle
For the input this is a title this becomes the output This Is A TiTle
I've tried lots of things. A nested loop that checks every letter in each word, and if there is a recurrence, the second is ignored. I used counters, booleans, etc. Nothing works and I keep getting the same result.
What can I do? I only want the first letter in upper case.
Instead of using the replace() method, try replaceFirst().
result += words[i].replaceFirst(words[i].charAt(0)+"", Character.toUpperCase(words[i].charAt(0))+"") + " ";
Will output:
This Is A Title
The problem is that you are using replace method which replaces all occurrences of described character. To solve this problem you can either
use replaceFirst instead
take first letter,
create its uppercase version
concatenate it with rest of string which can be created with a little help of substring method.
since you are using replace(String, String) which uses regex you can add ^ before character you want to replace like replace("^a","A"). ^ means start of input so it will only replace a that is placed after start of input.
I would probably use second approach.
Also currently in each loop your code creates new StringBuilder with data stored in result, append new word to it, and reassigns result of output from toString().
This is infective approach. Instead you should create StringBuilder before loop that will represent your result and append new words created inside loop to it and after loop ends you can get its String version with toString() method.
Doing some Regex-Magic can simplify your task:
public static void main(String[] args) {
final String test = "this is a Test";
final StringBuffer buffer = new StringBuffer(test);
final Pattern patter = Pattern.compile("\\b(\\p{javaLowerCase})");
final Matcher matcher = patter.matcher(buffer);
while (matcher.find()) {
buffer.replace(matcher.start(), matcher.end(), matcher.group().toUpperCase());
}
System.out.println(buffer);
}
The expression \\b(\\p{javaLowerCase}) matches "The beginning of a word followed by a lower-case letter", while matcher.group() is equal to whats inside the () in the part that matches. Example: Applying on "test" matches on "t", so start is 0, end is 1 and group is "t". This can easily run through even a huge amount of text and replace all those letters that need replacement.
In addition: it is always a good idea to use a StringBuffer (or similar) for String manipulation, because each String in Java is unique. That is if you do something like result += stringPart you actually create a new String (equal to result + stringPart) each time this is called. So if you do this with like 10 parts, you will in the end have at least 10 different Strings in memory, while you only need one, which is the final one.
StringBuffer instead uses something like char[] to ensure that if you change only a single character no extra memory needs to be allocated.
Note that a patter only need to be compiled once, so you can keep that as a class variable somewhere.

Regex not finding string

I am having issues with this code:
For some reason, it always fails to match the code.
for (int i = 0; i < pluginList.size(); i++) {
System.out.println("");
String findMe = "plugins".concat(FILE_SEPARATOR).concat(pluginList.get(i));
Pattern pattern = Pattern.compile("("+name.getPath()+")(.*)");
Matcher matcher = pattern.matcher(findMe);
// Check if the current plugin matches the string.
if (matcher.find()) {
return !pluginListMode;
}
}
All you really need is
return ("plugins"+FILE_SEPARATOR+pluginName).indexOf(name.getPath()) != -1;
But your code also makes no sense due to the fact that there's no way for that for-loop to enter a second iteration -- it returns unconditionally. So more probably you need something like this:
for (String pluginName : pluginList)
if (("plugins"+FILE_SEPARATOR+pluginName).indexOf(name.getPath()) != -1)
return false;
return true;
Right now we can only guess since we don't know what name.getPath() might return.
I suspect it fails because that string might contain characters that have special meaning inside regexes. Try it again with
Pattern pattern = Pattern.compile("("+Pattern.quote(name.getPath())+")(.*)");
and see what happens then.
Also the (.*) part (and even the parentheses around your name.getPath() result) don't appear to matter at all since you're not doing anything with the result of the match itself. At which point the question is why you're using a regex in the first place.

Why are the leading whitespaces not being removed when using trim()?

I need to have access to java source files and I am using the String's method trim() to remove any leading and trailing whitespaces. However the code which is some scope, for example:
if(name.equals("joe")){
System.out.println(name);
}
the white spaces for the printing statement are not being removed completely. Is there a way to be able to remove also these white-spaces please?
Thanks
EDIT: I did use a new variable:
String n = statements.get(i).toString().trim();
System.out.println(n);
however the output still looks like this:
System.out.println("NAME:" + m.getName());
BlockStmt bs = m.getBody();
List<Statement> statements = bs.getStmts();
for (int i = 0; i < statements.size(); i++) {
if ((statements.get(i).toString().trim().contains(needed)) & (statements.get(i).toString().trim().length() == needed.length())) {
System.out.println("HEREEEEEEEEEEEEEEEEEEEEEEEEEE");
}
}
Some of the strings are still containing the spaces beforehand
You are mistaken. The String.trim() method does remove leading and trailing whiteshape entirely.
However, I suspect that your real problem is that you don't know what this really means. Java strings are immutable, so trim() obviously doesn't modify the target String object. Instead, it returns a new String instance with the whitespace removed. So you need to use it as follows:
String trimmed = someString.trim();
You must have to assign the result of string. (String objects are immutable).
name=name.trim();
if(name.equals("joe")){
System.out.println(name);
}
As #home mentioned:
if(name.equals("joe")){
String newName = name.trim();
System.out.println(newName);
}
Should work
EDIT: I guess that you want to use trim before the condition. My mistake.
String newName = name.trim();
if(newName.equals("joe")){
System.out.println(newName);
}

Categories

Resources