What does replace do if no match is found? (under the hood) - java

I have very long strings that need to have a pattern removed if it appears. But it's an incredibly rare edge case for it to appear in the strings.
If I do this:
str = str.replace("pattern", "");
Then it looks like I'm creating a new string (because Java strings are immutable), which would be a waste if the original string was fine. Should I first check for a match, and then only replace if a match is found?

Short answer
Checking the documentation of various implementations, none seems to require the String.replace(CharSequence, CharSequence) method to return the same string if no match is found.
Without the requirement from the documentation, the implementation may or may not optimize the method in the case no match is found. It is best to write your code as if there is no optimization, to make sure that it runs correctly on any implementation or version of JRE.
In particular, when no match is found, Oracle's implementation (version 8-b123) returns the same String object, while GNU Classpath (version 0.95) returns a new String object regardless.
If you can find any clause in any of the documentation requiring String.replace(CharSequence, CharSequence) to return the same String object when no match is found, please leave a comment.
Long answer
The long answer below is to show that different implementation may or may not optimize the case where no match is found.
Let us look at Oracle's implementation and GNU Classpath's implementation of String.replace(CharSequence, CharSequence) method.
GNU Classpath
Note: This is correct as of the time of writing. While the link is not likely to change in the future, the content of the link is likely to change to a newer version of GNU Classpath and may go out of sync with the quoted content below. If the change affects the correctness, please leave a comment.
Let us look at GNU Classpath's implementation of String.replace(CharSequence, CharSequence) (version 0.95 quoted).
public String replace (CharSequence target, CharSequence replacement)
{
String targetString = target.toString();
String replaceString = replacement.toString();
int targetLength = target.length();
int replaceLength = replacement.length();
int startPos = this.indexOf(targetString);
StringBuilder result = new StringBuilder(this);
while (startPos != -1)
{
// Replace the target with the replacement
result.replace(startPos, startPos + targetLength, replaceString);
// Search for a new occurrence of the target
startPos = result.indexOf(targetString, startPos + replaceLength);
}
return result.toString();
}
Let us check the source code of StringBuilder.toString(). Since this decides the return value, if StringBuilder.toString() copies the buffer, then we don't need to further check any code above.
/**
* Convert this <code>StringBuilder</code> to a <code>String</code>. The
* String is composed of the characters currently in this StringBuilder. Note
* that the result is a copy, and that future modifications to this buffer
* do not affect the String.
*
* #return the characters in this StringBuilder
*/
public String toString()
{
return new String(this);
}
If the documentation doesn't manage to persuade you, just follow the String constructor. Eventually, the non-public constructor String(char[], int, int, boolean) is called, with the boolean dont_copy set to false, which means that the new String must copy the buffer.
589: public String(StringBuilder buffer)
590: {
591: this(buffer.value, 0, buffer.count);
592: }
245: public String(char[] data, int offset, int count)
246: {
247: this(data, offset, count, false);
248: }
594: /**
595: * Special constructor which can share an array when safe to do so.
596: *
597: * #param data the characters to copy
598: * #param offset the location to start from
599: * #param count the number of characters to use
600: * #param dont_copy true if the array is trusted, and need not be copied
601: * #throws NullPointerException if chars is null
602: * #throws StringIndexOutOfBoundsException if bounds check fails
603: */
604: String(char[] data, int offset, int count, boolean dont_copy)
605: {
606: if (offset < 0)
607: throw new StringIndexOutOfBoundsException("offset: " + offset);
608: if (count < 0)
609: throw new StringIndexOutOfBoundsException("count: " + count);
610: // equivalent to: offset + count < 0 || offset + count > data.length
611: if (data.length - offset < count)
612: throw new StringIndexOutOfBoundsException("offset + count: "
613: + (offset + count));
614: if (dont_copy)
615: {
616: value = data;
617: this.offset = offset;
618: }
619: else
620: {
621: value = new char[count];
622: VMSystem.arraycopy(data, offset, value, 0, count);
623: this.offset = 0;
624: }
625: this.count = count;
626: }
These evidences suggest that GNU Classpath's implementation of String.replace(CharSequence, CharSequence) does not return the same string.
Oracle
In Oracle's implementation String.replace(CharSequence, CharSequence) (version 8-b123 quoted), the method makes use of Pattern class to do the replacement.
public String replace(CharSequence target, CharSequence replacement) {
return Pattern.compile(target.toString(), Pattern.LITERAL).matcher(
this).replaceAll(Matcher.quoteReplacement(replacement.toString()));
}
Matcher.replaceAll(String) call toString() function on CharSequence and return it when no match is found:
public String replaceAll(String replacement) {
reset();
boolean result = find();
if (result) {
StringBuffer sb = new StringBuffer();
do {
appendReplacement(sb, replacement);
result = find();
} while (result);
appendTail(sb);
return sb.toString();
}
return text.toString();
}
String implements the CharSequence interface, and since the String passes itself into the Matcher, let us look at String.toString:
public String toString() {
return this;
}
From this, we can conclude that Oracle's implementation returns the same String when no match is found.

I have not found a definitive answer (from the docs), but I tried this out on Oracle's JRE7 and found that replace returned a reference to the same string.
Here is the code I used for testing:
public class NoReplace {
public static void main(String[]args) {
String a = "hello";
/* Test: replacement with no match */
String b = a.replace("X", "H");
/* a and b are still the same string? */
System.out.println(b == a); // true
/* Sanity: replacement WITH a match */
String c = a.replace("h", "H");
/* a and c are still the same string? */
System.out.println(c == a); // false
}
}
But I'd be interested in seeing some benchmarks of replace vs contains to know for sure if there's any advantage.

Ok.. In Java 8. This is what happens when you call myString.replace().
public String replace(CharSequence target, CharSequence replacement) {
return Pattern.compile(target.toString(), Pattern.LITERAL).matcher(
this).replaceAll(Matcher.quoteReplacement(replacement.toString()));
}
Pattern.compile(target.toString(), Pattern.LITERAL).matcher(
this)
The target String is compiled as a Literal pattern. and matcher() is called on it by passing the calling stringInstance to it.
Now the matcher() method will return a new matcher here. Just note that the text field of the matcher will be the current object (this) i.e the String object on which replace() was called.
Next, in replaceAll() we have the following code :
boolean result = find();
i.e.,
public String replaceAll(String replacement) {
reset();
boolean result = find(); --> returns false.
if (result) {
StringBuffer sb = new StringBuffer();
do {
appendReplacement(sb, replacement);
result = find();
} while (result);
appendTail(sb);
return sb.toString();
}
return text.toString(); --> same String
}
if `find()` returns false, then ,matcher.text is returned which is the original String

Related

Efficient way match a regex Pattern on an OutputStream greater than max String limit

I am trying to find an efficient way to do a pattern match on a ByteArrayOutputStream whose size exceeds String's max size.
Doing a pattern match on a ByteArrayOutputStream that fits into a single String is trivial:
private boolean doesStreamContainPattern(Pattern pattern, ByteArrayOutputStream baos) throws IOException {
/*
* Append external source String to output stream...
*/
if (pattern != null) {
String out = new String(baos.toByteArray(), "UTF-8");
if (pattern.matcher(out).matches()) {
return true;
}
}
/*
* Some other processing if no pattern match
*/
return false;
}
But if the size of baos exceeds String max size, the problem turns into:
Feeding baos into multiple Strings.
"Sliding" the pattern matching over the concatenation of those multiple Strings (i.e. the original baos content).
Step 2 looks more challenging then Step 1 but I know that utilities like Unix sed do just that on a file.
What is the right way to accomplish that?
You can write a simple wrapper class to implement CharSequence from the stream:
class ByteArrayCharSequence implement CharSequence {
private byte[] array;
public StreamCharSequence(byte[] input) {
array = input;
}
public char charAt(int index) {
return (char) array[index];
}
public int length() {
return array.length;
}
public CharSequence subSequence(int start, int end) {
return new ByteArrayCharSequence(Arrays.copyOfRange(array, start, end));
}
public String toString() {
// maybe test whether we exceeded max String length
}
}
and then match by
private boolean doesStreamContainPattern(Pattern pattern, ByteArrayOutputStream baos) throws IOException {
if (pattern != null) {
CharSequence seq = new ByteArrayCharSequence(baos.toByteArray());
if (pattern.matcher(seq).matches()) {
return true;
}
}
/*
* Some other processing if no pattern match
*/
return false;
}
It's obviously rough around the edges with the cast to char, and using copyOfRange, but it should work for most cases and you can adjust for those where it doesn't.

Deleting A Specified Substring in Java

This is actually an exercise from CodingBat. The definition of the problem is as follows:
Given a string, if the string "del" appears starting at index 1, return a string where that "del" has been deleted. Otherwise, return the string unchanged.
delDel("adelbc") → "abc"
delDel("adelHello") → "aHello"
delDel("adedbc") → "adedbc"
My work is as follows:
public String delDel(String str) {
String del = "del";
if (str.indexOf(del, 1) == 1){
str.replaceFirst("del", null);
}
return str;
}
It works fine for most of the cases, but I get NullPointerException in "adelbc", "adelHello" and "adel" cases. I can't quite understand why.
If you look closely in the OpenJDK sources, you'll note that replaceFirst delegates work to the regexp functions, including this one for replacing step:
public String replaceFirst(String replacement) {
if (replacement == null)
throw new NullPointerException("replacement");
reset();
if (!find())
return text.toString();
StringBuffer sb = new StringBuffer();
appendReplacement(sb, replacement);
appendTail(sb);
return sb.toString();
}
Note that replacement can not be null. I assume the behaviour is going to be similar in other implementations of the JRE. Please use "" - empty string - instead of null as the replacement.
Also as mentioned in the comments by cricket_007 you want to save the result of replaceFirst for returning, since the original string will not be affected (all Strings in Java are immutable). The final piece of code:
public String delDel(String str) {
String del = "del";
if (str.indexOf(del, 1) == 1){
return str.replaceFirst("del", "");
}
return str;
}

String concatenation - Boolean hard-coded Vs Boolean Concatenation with String

I need a advice (both in java & .net) for the following piece of code.
public void method(bool value)
{
String someString;
//some code
if (value)
{
//some code
...
someString = "one" + value;
}
else
{
//some code
...
someString = "two" + value;
}
}
Which one is advisable and why? either code like above or code like
someString = "onetrue";
someString = "twofalse";
After compilation and optimization by JDK, method will look like:
public static String method(boolean value) {
String someString;
if (value) {
StringBuilder sb = new StringBuilder();
sb.append("one");
sb.append(value);
someString = sb.toString();
} else {
StringBuilder sb = new StringBuilder();
sb.append("two");
sb.append(value);
someString = sb.toString();
}
return someString;
}
If this code is invoked very frequently, it could bring a performance impact, compared to the second version. In each case a new StringBuilder is constructed and three methods are invoked on it. And boolean should be converted to an object before calling append. While in the second version we just return constant. Everything depends on how often this code is called.
Neither will make any difference it's purely style.
Since you have // some other code I'd just stick with the first. If you only had one line in each branch then either is ok.
At a high level they both are the same but if you look down at lower levels, I would advise to using the method:
someString = "onetrue";
someString = "twofalse";
This is because when you do "one" + value, the value is actually a bool and the toString() method of the bool object will be called to add to the string. Basically just adding another step opposed to just specifying what to add to the string.

Front-popping a Java String

Having read the documentation of Java's String class, it doesn't appear to support popping from front(which does make sense since it's basically a char array). Is there an easy way to do something like
String firstLetter = someString.popFront();
which would remove the first character from the string and return it?
A String in Java is immutable, so you can't "remove" characters from it.
You can use substring to get parts of the String.
String firstLetter = someString.substring(0, 1);
someString = someString.substring(1);
You can easily implement this by using java.lang.StringBuilder's charAt() and deleteCharAt() methods. StringBuilder also implements a toString() method.
http://java.sun.com/j2se/1.5.0/docs/api/java/lang/StringBuilder.html
I don't think there is something like that (even because strings can't be changed - a new one needs to be created), but You can use charAt and subString to implement your own.
An example of charAt:
String aString = "is this your homework Larry?";
char aChar = aString.charAt(0);
Then subString:
String anotherString = aString.substring(1, aString.length());
So you basically want to have the String in a FIFO stack? For that you can use a LinkedList which offers under each a pop() method to pop the first from the stack.
To get all characters of a String in a LinkedList, do so:
String string = "Hello World";
LinkedList<Character> chars = new LinkedList<Character>();
for (int i = 0; i < string.length(); i++) chars.add(string.charAt(i));
Then you can pop it as follows:
char c = chars.pop();
// ...
Update: I didn't see the comment that you'd like to be able to get the remaining characters back as a string. Well, your best bet is to create and implement your own StringStack or so. Here's a kickoff example:
public class StringStack {
private String string;
private int i;
public StringStack(String string) {
this.string = string;
}
public char pop() {
if (i >= string.length()) throw new IllegalStateException("Stack is empty");
return string.charAt(i++);
}
public String toString() {
if (i >= string.length()) throw new IllegalStateException("Stack is empty");
return string.substring(i, string.length());
}
}
You can use it as follows:
String string = "Hello World";
StringStack stack = new StringStack(string);
char c = stack.pop();
String remnant = stack.toString();
// ...
To make it more solid, you can eventually compose a LinkedList.
You should look at a StringReader. The read() method returns a single character.

How do I make my string comparison case-insensitive?

I created a Java program to compare two strings:
String str = "Hello";
if (str.equals("hello")) {
System.out.println("match");
} else {
System.out.println("no match");
}
It's case-sensitive. How can I change it so that it's not?
The best way is to use str.equalsIgnoreCase("foo"). It's optimized specifically for this purpose.
You can also convert both strings to upper- or lowercase before comparing them with equals. This is a trick that's useful to remember for other languages which might not have an equivalent of equalsIgnoreCase.
str.toUpperCase().equals(str2.toUpperCase())
If you are using a non-Roman alphabet, take note of this part of the JavaDoc of equalsIgnoreCase which says
Note that this method does not take locale into account, and will
result in unsatisfactory results for certain locales. The Collator
class provides locale-sensitive comparison.
Use String.equalsIgnoreCase().
Use the Java API reference to find answers like these:
https://docs.oracle.com/javase/1.5.0/docs/api/java/lang/String.html#equalsIgnoreCase(java.lang.String)
https://docs.oracle.com/javase/1.5.0/docs/api/
String.equalsIgnoreCase is the most practical choice for naive case-insensitive string comparison.
However, it is good to be aware that this method does neither do full case folding nor decomposition and so cannot perform caseless matching as specified in the Unicode standard. In fact, the JDK APIs do not provide access to information about case folding character data, so this job is best delegated to a tried and tested third-party library.
That library is ICU, and here is how one could implement a utility for case-insensitive string comparison:
import com.ibm.icu.text.Normalizer2;
// ...
public static boolean equalsIgnoreCase(CharSequence s, CharSequence t) {
Normalizer2 normalizer = Normalizer2.getNFKCCasefoldInstance();
return normalizer.normalize(s).equals(normalizer.normalize(t));
}
String brook = "flu\u0308ßchen";
String BROOK = "FLÜSSCHEN";
assert equalsIgnoreCase(brook, BROOK);
Naive comparison with String.equalsIgnoreCase, or String.equals on upper- or lowercased strings will fail even this simple test.
(Do note though that the predefined case folding flavour getNFKCCasefoldInstance is locale-independent; for Turkish locales a little more work involving UCharacter.foldCase may be necessary.)
You have to use the compareToIgnoreCase method of the String object.
int compareValue = str1.compareToIgnoreCase(str2);
if (compareValue == 0) it means str1 equals str2.
import java.lang.String; //contains equalsIgnoreCase()
/*
*
*/
String s1 = "Hello";
String s2 = "hello";
if (s1.equalsIgnoreCase(s2)) {
System.out.println("hai");
} else {
System.out.println("welcome");
}
Now it will output : hai
In the default Java API you have:
String.CASE_INSENSITIVE_ORDER
So you do not need to rewrite a comparator if you were to use strings with Sorted data structures.
String s = "some text here";
s.equalsIgnoreCase("Some text here");
Is what you want for pure equality checks in your own code.
Just to further informations about anything pertaining to equality of Strings in Java. The hashCode() function of the java.lang.String class "is case sensitive":
public int hashCode() {
int h = hash;
if (h == 0 && value.length > 0) {
char val[] = value;
for (int i = 0; i < value.length; i++) {
h = 31 * h + val[i];
}
hash = h;
}
return h;
}
So if you want to use an Hashtable/HashMap with Strings as keys, and have keys like "SomeKey", "SOMEKEY" and "somekey" be seen as equal, then you will have to wrap your string in another class (you cannot extend String since it is a final class). For example :
private static class HashWrap {
private final String value;
private final int hash;
public String get() {
return value;
}
private HashWrap(String value) {
this.value = value;
String lc = value.toLowerCase();
this.hash = lc.hashCode();
}
#Override
public boolean equals(Object o) {
if (this == o) return true;
if (o instanceof HashWrap) {
HashWrap that = (HashWrap) o;
return value.equalsIgnoreCase(that.value);
} else {
return false;
}
}
#Override
public int hashCode() {
return this.hash;
}
}
and then use it as such:
HashMap<HashWrap, Object> map = new HashMap<HashWrap, Object>();
Note that you may want to do null checks on them as well prior to doing your .equals or .equalsIgnoreCase.
A null String object can not call an equals method.
ie:
public boolean areStringsSame(String str1, String str2)
{
if (str1 == null && str2 == null)
return true;
if (str1 == null || str2 == null)
return false;
return str1.equalsIgnoreCase(str2);
}
Use s1.equalsIgnoreCase(s2): https://docs.oracle.com/javase/1.5.0/docs/api/java/lang/String.html#equalsIgnoreCase(java.lang.String).
You can use equalsIgnoreCase
More about string can be found in String Class and String Tutorials
To be nullsafe, you can use
org.apache.commons.lang.StringUtils.equalsIgnoreCase(String, String)
or
org.apache.commons.lang3.StringUtils.equalsIgnoreCase(CharSequence, CharSequence)
public boolean newEquals(String str1, String str2)
{
int len = str1.length();
int len1 = str2.length();
if(len==len1)
{
for(int i=0,j=0;i<str1.length();i++,j++)
{
if(str1.charAt(i)!=str2.charAt(j))
return false;
}`enter code here`
}
return true;
}

Categories

Resources