Efficiency of Very Long StringBuilders

Efficiency of Very Long StringBuilders - java

I am currently developing an app on android with java.
Obviously there are very many situations that I have to use lots of String manipulation. I prefer using the class SpannableStringBuilder since it has the most functions, very easy to append or add spans to the string.
However I was curious about the internal implementation of the class so I read parts of the code in the java file of SpannableStringBuilder. I was quite surprised when I saw that, when I use append() or replace() functions on the stringbuilder, it just creates a new char array with the resulting length and copies all the characters in the original char array from start to end to the new array with the changed part together.
Now I am worried, since what i am working on is quite similar to a memo pad and I use SpannableStringBuilder and append() every input the user types. But since it's a memopad the user might even type up to more than several tens of thousands of characters. In that case, the append() or replace() function has to copy a whole bunch of characters every time the user types something with a keyboard... I think this should have a huge impact on performance.
Is this something I should be worried about? are there other ways to make continuously appendable stringbuilders?

Related

Is there any scenario where character array is better than Strings in Java

I feel strings can replace character array in all the scenarios. Even considering the immutability characteristic of Strings, declaration of strings in appropriate scope and java's garbage collection feature should help us avoid any memory leaks. I want to know if there is any corner case where character array should be used instead of Strings in Java.

Character arrays have some slight advantage over plain strings when it comes to storing security sensitive data. There's a lot of resources on that, for example this question: Why is char[] preferred over String for passwords? (with an answer by Jon Skeet himself).
In general it boils down to two things:
You have very little influence on how long a String stays in memory. Because of that you might leak sensitive data through a memory dump.
Leaking sensitive data accidentally in application logs as clear text is much more likely with plain strings
More reading:
Why we read password from console in char array instead of String
https://www.codebyamir.com/blog/use-character-arrays-to-store-sensitive-data-java
https://www.geeksforgeeks.org/use-char-array-string-storing-passwords-java/amp/
https://www.baeldung.com/java-storing-passwords
https://javarevisited.blogspot.com/2012/03/why-character-array-is-better-than.html
https://javainsider.wordpress.com/2012/12/10/character-array-is-better-than-string-for-storing-password-in-java/amp/

String is a class, not a build in type. It most likely does what it does by using a char array underneath, but there is no guarantee. "We dont care how it is implemented". It has methods that make sense for strings, like comparing strings. Comparing arrays?? Hmm. Doesn't really make sense to do it. You could check if they are equal sure, but less or greater than...
Back in point. One scenario is you want to operate with chars, not a string. For example you have letters of the alphabet and want to sort them. Or grades in A-F system and you want to sort them. Generally where it makes sense having chars that are not connected to have some meaning together (like in a message string, or a text message). You would not generally need to sort the chars of a text message now, would you? So, you use an array.
To sort, you can take advantage of the Arrays.sort() method for example, while i dont think there is a method that does it for strings. Perhaps 3rd part libraries.
On another note(unrelated to question) , you can use StringBuilder to if you want to modify strings often. Its better at performace.

You don't have to look much further than at methods in the JDK core API that use char[].
Such as this one (java.io.Reader):
public int read(char[] cbuf)
throws IOException
Reads characters into an array. This method will block until some input is available, an I/O error occurs, or the end of the stream is reached.
Parameters:
cbuf - Destination buffer
Returns:
The number of characters read, or -1 if the end of the stream has been reached
Throws:
IOException - If an I/O error occurs
Instead of returning a String they ask you to pass in a char[] to use as a buffer to write the result into. The reason is efficiency.

You might be knowing String is immutable and how Substring can cause memory leak in Java.
Since Strings are immutable in Java if you store password as plain text it will be available in memory until Garbage collector clears it and since String are used in String pool for reusability there is pretty high chance that it will be remain in memory for long duration, which pose a security threat. Since any one who has access to memory dump can find the password in clear text. Since Strings are immutable there is no way contents of Strings can be changed because any change will produce new String, while if you char[] you can still set all his element as blank or zero. So Storing password in character array clearly mitigates security risk of stealing password.

using java to parse a csv then save in 2D array

Okay so i am working on a game based on a Trading card game in java. I Scraped all of the game peices' "information" into a csv file where each row is a game peice and each column is a type of attribute for that peice. I have spent hours upon hours writing code with Buffered reader and etc. trying to extract the information from my csv file into a 2d Array but to no avail. My csv file is linked Here: http://dl.dropbox.com/u/3625527/MonstersFinal.csv I have one year of computer science under my belt but I still cannot figure out how to do this.
So my main question is how do i place this into a 2D array that way i can keep the rows and columns?

Well, as mentioned before, some of your strings contain commas, so initially you're starting from a bad place, but I do have a solution and it's this:
--------- If possible, rescrape the site, but perform a simple encoding operation when you do. You'll want to do something like what you'll notice tends to be done in autogenerated XML files which contain HTML; reserve a 'control character' (a printable character works best, here, for reasons of debugging and... well... sanity) that, once encoded, is never meant to be read directly as an instance of itself. Ampersand is what I like to use because it's uncommon enough but still printable, but really what character you want to use is up to you. What I would do is write the program so that, at every instance of ",", that comma would be replaced by "&c" before being written to the CSV, and at every instance of an actual ampersand on the site, that "&" would be replaced by "&a". That way, you would never have the issue of accidentally separating a single value into two in the CSV, and you could simply decode each value after you've separated them by the method I'm about to outline in...
-------- Assuming you know how many columns will be in each row, you can use the StringTokenizer class (look it up- it's awesome and built into Java. A good place to look for information is, as always, the Java Tutorials) to automatically give you the values you need in the form of an array.
It works by your passing in a string and a delimiter (in this case, the delimiter would be ','), and it spitting out all the substrings which were separated by those commas. If you know how many pieces there are in total from the get-go, you can instantiate a 2D array at the beginning and just plug in each row the StringTokenizer gives them to you. If you don't, it's still okay, because you can use an ArrayList. An ArrayList is nice because it's a higher-level abstraction of an array that automatically asks for more memory such that you can continue adding to it and know that retrieval time will always be constant. However, if you plan on dynamically adding pieces, and doing that more often than retrieving them, you might want to use a LinkedList instead, because it has a linear retrieval time, but a much better relation than an ArrayList for add-remove time. Or, if you're awesome, you could use a SkipList instead. I don't know if they're implemented by default in Java, but they're awesome. Fair warning, though; the cost of speed on retrieval, removal, and placement comes with increased overhead in terms of memory. Skip lists maintain a lot of pointers.
If you know there should be the same number of values in each row, and you want them to be positionally organized, but for whatever reason your scraper doesn't handle the lack of a value for a row, and just doesn't put that value, you've some bad news... it would be easier to rewrite the part of the scraper code that deals with the lack of values than it would be to write a method that interprets varying length arrays and instantiates a Piece object for each array. My suggestion for this would again be to use the control character and fill empty columns with &n (for 'null') to be interpreted later, but then specifics are of course what will individuate your code and coding style so it's not for me to say.
edit: I think the main thing you should focus on is learning the different standard library datatypes available in Java, and maybe learn to implement some of them yourself for practice. I remember implementing a binary search tree- not an AVL tree, but alright. It's fun enough, good coding practice, and, more importantly, necessary if you want to be able to do things quickly and efficiently. I don't know exactly how Java implements arrays, because the definition is "a contiguous section of memory", yet you can allocate memory for them in Java at runtime using variables... but regardless of the specific Java implementation, arrays often aren't the best solution. Also, knowing regular expressions makes everything much easier. For practice, I'd recommend working them into your Java programs, or, if you don't want to have to compile and jar things every time, your bash scripts (if your using *nix) and/or batch scripts (if you're using Windows).

I think the way you've scraped the data makes this problem more difficult than it needs to be. Your scrape seems inconsistent and difficult to work with given that most values are surrounded by quotes inconsistently, some data already has commas in it, and not each card is on its own line.
Try re-scraping the data in a much more consistent format, such as:
R1C1|R1C2|R1C3|R1C4|R1C5|R1C6|R1C7|R1C8
R2C1|R2C2|R2C3|R2C4|R2C5|R2C6|R2C7|R3C8
R3C1|R3C2|R3C3|R3C4|R3C5|R3C6|R3C7|R3C8
R4C1|R4C2|R4C3|R4C4|R4C5|R4C6|R4C7|R4C8
A/D Changer|DREV-EN005|Effect Monster|Light|Warrior|100|100|You can remove from play this card in your Graveyard to select 1 monster on the field. Change its battle position.
Where each line is definitely its own card (As opposed to the example CSV you posted with new lines in odd places) and the delimiter is never used in a data field as something other than a delimiter.
Once you've gotten the input into a consistently readable state, it becomes very simple to parse through it:
BufferedReader br = new BufferedReader(new FileReader(new File("MonstersFinal.csv")));
String line = "";
ArrayList<String[]> cardList = new ArrayList<String[]>(); // Use an arraylist because we might not know how many cards we need to parse.
while((line = br.readLine()) != null) { // Read a single line from the file until there are no more lines to read
StringTokenizer st = new StringTokenizer(line, "|"); // "|" is the delimiter of our input file.
String[] card = new String[8]; // Each card has 8 fields, so we need room for the 8 tokens.
for(int i = 0; i < 8; i++) { // For each token in the line that we've read:
String value = st.nextToken(); // Read the token
card[i] = value; // Place the token into the ith "column"
}
cardList.add(card); // Add the card's info to the list of cards.
}
for(int i = 0; i < cardList.size(); i++) {
for(int x = 0; x < cardList.get(i).length; x++) {
System.out.printf("card[%d][%d]: ", i, x);
System.out.println(cardList.get(i)[x]);
}
}
Which would produce the following output for my given example input:
card[0][0]: R1C1
card[0][1]: R1C2
card[0][2]: R1C3
card[0][3]: R1C4
card[0][4]: R1C5
card[0][5]: R1C6
card[0][6]: R1C7
card[0][7]: R1C8
card[1][0]: R2C1
card[1][1]: R2C2
card[1][2]: R2C3
card[1][3]: R2C4
card[1][4]: R2C5
card[1][5]: R2C6
card[1][6]: R2C7
card[1][7]: R3C8
card[2][0]: R3C1
card[2][1]: R3C2
card[2][2]: R3C3
card[2][3]: R3C4
card[2][4]: R3C5
card[2][5]: R3C6
card[2][6]: R3C7
card[2][7]: R4C8
card[3][0]: R4C1
card[3][1]: R4C2
card[3][2]: R4C3
card[3][3]: R4C4
card[3][4]: R4C5
card[3][5]: R4C6
card[3][6]: R4C7
card[3][7]: R4C8
card[4][0]: A/D Changer
card[4][1]: DREV-EN005
card[4][2]: Effect Monster
card[4][3]: Light
card[4][4]: Warrior
card[4][5]: 100
card[4][6]: 100
card[4][7]: You can remove from play this card in your Graveyard to select 1 monster on the field. Change its battle position.
I hope re-scraping the information is an option here and I hope I haven't misunderstood anything; Good luck!
On a final note, don't forget to take advantage of OOP once you've gotten things worked out. a Card class could make working with the data even simpler.

I'm working on a similar problem for use in machine learning, so let me share what I've been able to do on the topic.
1) If you know before you start parsing the row - whether it's hard-coded into your program or whether you've got some header in your file that gives you this information (highly recommended) - how many attributes per row there will be, you can reasonably split it by comma, for example the first attribute will be RowString.substring(0, RowString.indexOf(',')), the second attribute will be the substring from the first comma to the next comma (writing a function to find the nth instance of a comma, or simply chopping off bits of the string as you go through it, should be fairly trivial), and the last attribute will be RowString.substring(RowString.lastIndexOf(','), RowString.length()). The String class's methods are your friends here.
2) If you are having trouble distinguishing between commas which are meant to separate values, and commas which are part of a string-formatted attribute, then (if the file is small enough to reformat by hand) do what Java does - represent characters with special meaning that are inside of strings with '\,' rather than just ','. That way you can search for the index of ',' and not '\,' so that you will have some way of distinguishing your characters.
3) As an alternative to 2), CSVs (in my opinion) aren't great for strings, which often include commas. There is no real common format to CSVs, so why not make them colon-separated-values, or dash-separated-values, or even triple-ampersand-separated-values? The point of separating values with commas is to make it easy to tell them apart, and if commas don't do the job there's no reason to keep them. Again, this applies only if your file is small enough to edit by hand.
4) Looking at your file for more than just the format, it becomes apparent that you can't do it by hand. Additionally, it would appear that some strings are surrounded by triple double quotes ("""string""") and some are surrounded by single double quotes ("string"). If I had to guess, I would say that anything included in a quotes is a single attribute - there are, for example, no pairs of quotes that start in one attribute and end in another. So I would say that you could:
Make a class with a method to break a string into each comma-separated fields.
Write that method such that it ignores commas preceded by an odd number of double quotes (this way, if the quote-pair hasn't been closed, it knows that it's inside a string and that the comma is not a value separator). This strategy, however, fails if the creator of your file did something like enclose some strings in double double quotes (""string""), so you may need a more comprehensive approach.

Algorithm to convert a double to a char array in Java (without using objects like Double.toString or StringBuilder)?

Does anyone know where I can find that algorithm? It takes a double and StringBuilder and appends the double to the StringBuilder without creating any objects or garbage. Of course I am not looking for:
sb.append(Double.toString(myDouble));
// or
sb.append(myDouble);
I tried poking around the Java source code (I am sure it does it somehow) but I could not see any block of code/logic clear enough to be re-used.

I have written this for ByteBuffer. You should be able to adapt it. Writing it to a direct ByteBuffer saves you having to convert it to bytes or copy it into "native" space.
See public ByteStringAppender append(double d)
If you are logging this to a file, you might use the whole library as it can write around 20 million doubles per second sustained. It can do this without system calls as it writes to a memory mapped file.

Why char[] performs better than String ?- Java

In reference to the link: File IO Tuning, last section titled "Further Tuning" where the author suggests using char[] to avoid generating String objects for n lines in the file, I need to understand how does
char[] arr = new char{'a','u','t','h', 'o', 'r'}
differ with
String s = "author"
in terms of memory consumption or any other performance factor? Isn't String object internally stored as a character array? I feel silly since I never thought of this before. :-)

In Oracle's JDK a String has four instance-level fields:
A character array
An integral offset
An integral character count
An integral hash value
That means that each String introduces an extra object reference (the String itself), and three integers in addition to the character array itself. (The offset and character count are there to allow sharing of the character array among String instances produced through the String#substring() methods, a design choice that some other Java library implementers have eschewed.) Beyond the extra storage cost, there's also one more level of access indirection, not to mention the bounds checking with which the String guards its character array.
If you can get away with allocating and consuming just the basic character array, there's space to be saved there. It's certainly not idiomatic to do so in Java though; judicious comments would be warranted to justify the choice, preferably with mention of evidence from having profiled the difference.

In the example you've referred to, it's because there's only a single character array being allocated for the whole loop. It's repeatedly reading into that same array, and processing it in place.
Compare that with using readLine which needs to create a new String instance on each iteration. Each String instance will contain a few int fields and a reference to a char[] containing the actual data - so it would need two new instances per iteration.
I'd usually expect the differences to be insignificant (with a decent GC throwing away unused "young" objects very efficiently) compared with the IO involved in reading the data - assuming it's from disk - but I believe that's the point the author was trying to make.

The author didn't get the reason right. The real overhead in in.readLine() is the copying a char[] buffer when making a String out of it. The additional copying is the most damning cost when dealing with large data.
It is possible to optimize this within JDK so that the additional copying is not needed.

Here are few reasons which makes sense to believe that character array is better choice in Java than String:
Say for Storing the Password
1) Since Strings are immutable in Java, if you store password as plain text it will be available in memory until Garbage collector clears it and since String are used in String pool for reusability there is pretty high chance that it will be remain in memory for long duration, which pose a security threat.
Since any one who has access to memory dump can find the password in clear text and that's another reason you should always used an encrypted password than plain text.
Since Strings are immutable there is no way contents of Strings can be changed because any change will produce new String, while if you char[] you can still set all his element as blank or zero. So Storing password in character array clearly mitigates security risk of stealing password.
2) Java itself recommends using getPassword() method of JPasswordField which returns a char[] and deprecated getText() method which returns password in clear text stating security reason. Its good to follow advice from Java team and adhering to standard rather than going against it.
3) With String there is always a risk of printing plain text in log file or console but if use Array you won't print contents of array instead its memory location get printed. though not a real reason but still make sense.
For this simple program
String strPassword="Unknown";
char[] charPassword= new char[]{'U','n','k','n','o','w','n'};
System.out.println("String password: " + strPassword);
System.out.println("Character password: " + charPassword);
Output:
String password: Unknown
Character password: [C#110b053
That's all on why character array is better choice than String for storing passwords in Java. Though using char[] is not just enough you need to erase content to be more secure.
Hope this will help.

My answer is going to focus on other stack questions along this similar line, others have already posted more direct answers.
There have been other questions similar to this, advice seems to go along the lines of using StringBuilder.
If you're concerned with string concentenation this have a look at the performance as described here between three different implementations. With another stack post which can give you some additional pointers and examples you could try yourself to see the performance.

Java's String.replace() vs. String.replaceFirst() vs. homebrew

I have a class that is doing a lot of text processing. For each string, which is anywhere from 100->2000 characters long, I am performing 30 different string replacements.
Example:
string modified;
for(int i = 0; i < num_strings; i++){
modified = runReplacements(strs[i]);
//do stuff
}
public runReplacements(String str){
str = str.replace("foo","bar");
str = str.replace("baz","beef");
....
return str;
}
'foo', 'baz', and all other "targets" are only expected to appear once and are string literals (no need for an actual regex).
As you can imagine, I am concerned about performance :)
Given this,
replaceFirst() seems a bad choice because it won't use Pattern.LITERAL and will do extra processing that isn't required.
replace() seems a bad choice because it will traverse the entire string looking for multiple instances to be replaced.
Additionally, since my replacement texts are the same everytime, it seems to make sense for me to write my own code otherwise String.replaceFirst() or String.replace() will be doing a Pattern.compile every single time in the background. Thinking that I should write my own code, this is my thought:
Perform a Pattern.compile() only once for each literal replacement desired (no need to recompile every single time) (i.e. p1 - p30)
Then do the following for each pX: p1.matcher(str).replaceFirst(Matcher.quoteReplacement("desiredReplacement"));
This way I abandon ship on the first replacement (instead of traversing the entire string), and I am using literal vs. regex, and I am not doing a re-compile every single iteration.
So, which is the best for performance?

So, which is the best for performance?
Measure it! ;-)
ETA: Since a two word answer sounds irretrievably snarky, I'll elaborate slightly. "Measure it and tell us..." since there may be some general rule of thumb about the performance of the various approaches you cite (good ones, all) but I'm not aware of it. And as a couple of the comments on this answer have mentioned, even so, the different approaches have a high likelihood of being swamped by the application environment. So, measure it in vivo and focus on this if it's a real issue. (And let us know how it goes...)

First, run and profile your entire application with a simple match/replace. This may show you that:
your application already runs fast enough, or
your application is spending most of its time doing something else, so optimizing the match/replace code is not worthwhile.
Assuming that you've determined that match/replace is a bottleneck, write yourself a little benchmarking application that allows you to test the performance and correctness of your candidate algorithms on representative input data. It's also a good idea to include "edge case" input data that is likely to cause problems; e.g. for the substitutions in your example, input data containing the sequence "bazoo" could be an edge case. On the performance side, make sure that you avoid the traps of Java micro-benchmarking; e.g. JVM warmup effects.
Next implement some simple alternatives and try them out. Is one of them good enough? Done!
In addition to your ideas, you could try concatenating the search terms into a single regex (e.g. "(foo|baz)" ), use Matcher.find(int) to find each occurrence, use a HashMap to lookup the replacement strings and a StringBuilder to build the output String from input string substrings and replacements. (OK, this is not entirely trivial, and it depends on Pattern/Matcher handling alternates efficiently ... which I'm not sure is the case. But that's why you should compare the candidates carefully.)
In the (IMO unlikely) event that a simple alternative doesn't cut it, this wikipedia page has some leads which may help you to implement your own efficient match/replacer.

Isn't if frustrating when you ask a question and get a bunch of advice telling you to do a whole lot of work and figure it out for yourself?!
I say use replaceAll();
(I have no idea if it is, indeed, the most efficient, I just don't want you to feel like you wasted your money on this question and got nothing.)
[edit]
PS. After that, you might want to measure it.
[edit 2]
PPS. (and tell us what you found)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.