Can Java skip .toUpperCase() on literal string constants already in upper case?

Can Java skip .toUpperCase() on literal string constants already in upper case? - java

I have a .toUpperCase() happening in a tight loop and have profiled and shown it is impacting application performance. Annoying thing is it's being called on strings already in capital letters. I'm considering just dropping the call to .toUpperCase() but this makes my code less safe for future use.
This level of Java performance optimization is past my experience thus far. Is there any way to do a pre-compilation, set an annotation, etc. to skip the call to toUpperCase on already upper case strings?

What you need to do if you can is call .toUpperCase() on the string once, and store it so that when you go through the loop you won't have to do it each time.
I don't believe there is a pre-compilation situation - you can't know in advance what data the code will be handling. If anyone can correct me on this, it's be pretty awesome.
If you post some example code, I'd be able to help a lot more - it really depends on what kind of access you have to the data before you get to the loop. If your loop is actually doing the data access (e.g., reading from a file) and you don't have control over where those files come from, your hands are a lot more tied than if the data is hardcoded.
Any many cases there's an easy answer, but in some, there's not much you can do.
You can try equalsIgnoreCase, too. It doesn't make a new string.

No you cannot do this using an annotation or pre-compilation because your input is given during the runtime and the annotation and pre-compilation are compile time constructions.
If you would have known the input in advance then you could simply convert it to uppercase before running the application, i.e. before you compile your application.
Note that there are many ways to optimize string handling but without more information we cannot give you any tailor made solution.

You can write a simple function isUpperCase(String) and call it before calling toUpperCase():
if (!isUpperCase(s)) {
s = s.toUpperCase()
}
It might be not significantly faster but at least this way less garbage will be created. If a majority of the strings in your input are already upper case this is very valid optimization.
isUpperCase function will look roughly like this:
boolean isUpperCase(String s) {
for (int i = 0; i < s.length; i++) {
if (Character.isLowerCase(s.charAt(i)) {
return false;
}
}
return true;
}

you need to do an if statement that conditions those letters out of it. the ideas good just have a condition. Then work with ascii codes so convert it using (int) then find the ascii numbers for uppercase which i have no idea what it is, and then continue saying if ascii whatever is true then ignore this section or if its for specific letters in a line then ignore it for charAt(i)
sorry its a rough explanation

Related

Count how many list entries have a string property that ends with a particular char

I have an array list with some names inside it (first and last names). What I have to do is go through each "first name" and see how many times a character (which the user specifies) shows up at the end of every first name in the array list, and then print out the number of times that character showed up.
public int countFirstName(char c) {
int i = 0;
for (Name n : list) {
if (n.getFirstName().length() - 1 == c) {
i++;
}
}
return i;
}
That is the code I have. The problem is that the counter (i) doesn't add 1 even if there is a character that matches the end of the first name.

You're comparing the index of last character in the string to the required character, instead of the last character itself, which you can access with charAt:
String firstName = n.getFirstName()
if (firstName.charAt(firstName.length() - 1) == c) {
i++;
}

When you're setting out learning to code, there is a great value in using pencil and paper, or describing your algorithm ahead of time, in the language you think in. Most people that learn a foreign language start out by assembling a sentence in their native language, translating it to foreign, then speaking the foreign. Few, if any, learners of a foreign language are able to think in it natively
Coding is no different; all your life you've been speaking English and thinking in it. Now you're aiming to learn a different pattern of thinking, syntax, key words. This task will go a lot easier if you:
work out in high level natural language what you want to do first
write down the steps in clear and simple language, like a recipe
don't try to do too much at once
Had I been a tutor marking your program, id have been looking for something like this:
//method to count the number of list entries ending with a particular character
public int countFirstNamesEndingWith(char lookFor) {
//declare a variable to hold the count
int cnt = 0;
//iterate the list
for (Name n : list) {
//get the first name
String fn = n.getFirstName();
//get the last char of it
char lc = fn.charAt(fn.length() - 1);
//compare
if (lc == lookFor) {
cnt++;
}
}
return cnt;
}
Taking the bullet points in turn:
The comments serve as a high level description of what must be done. We write them aLL first, before even writing a single line of code. My course penalised uncommented code, and writing them first was a handy way of getting the requirement out of the way (they're a chore, right? Not always, but..) but also it is really easy to write a logic algorithm in high level language, then translate the steps into the language learning. I definitely think if you'd taken this approach you wouldn't have made the error you did, as it would have been clear that the code you wrote didn't implement the algorithm you'd have described earlier
Don't try to do too much in one line. Yes, I'm sure plenty of coders think it looks cool, or trick, or shows off what impressive coding smarts they have to pack a good 10 line algorithm into a single line of code that uses some obscure language features but one day it's highly likely that someone else is going to have to come along to maintain that code, improve it or change part of what it does - at that moment it's no longer cool, and it was never really a smart thing to do
Aominee, in their comment, actually gives us something like an example of this:
return (int)list.stream().filter(e -> e.charAt.length()-1)==c).count();
It's a one line implementation of a solution to your problem. Cool huh? Well, it has a bug* (for a start) but it's not the main thrust of my argument. At a more basic level: have you got any idea what it's doing? can you look at it and in 2 seconds tell me how it works?
It's quite an advanced language feature, it's trick for sure, but it might be a very poor solution because it's hard to understand, hard to maintain as a result, and does a lot while looking like a little- it only really makes sense if you're well versed in the language. This one line bundles up a facility that loops over your list, a feature that effectively has a tiny sub method that is called for every item in the list, and whose job is to calculate if the name ends with the sought char
It p's a brilliant feature, a cute example and it surely has its place in production java, but it's place is probably not here, in your learning exercise
Similarly, I'd go as far to say that this line of yours:
if (n.getFirstName().length() - 1 == c) {
Is approaching "doing too much" - I say this because it's where your logic broke down; you didn't write enough code to effectively implement the algorithm. You'd actually have to write even more code to implement this way:
if (n.getFirstName().charAt(n.getFirstName().length() - 1) == c) {
This is a right eyeful to load into your brain and understand. The accepted answer broke it down a bit by first getting the name into a temporary variable. That's a sensible optimisation. I broke it out another step by getting the last char into a temp variable. In a production system I probably wouldn't go that far, but this is your learning phase - try to minimise the number of operations each of your lines does. It will aid your understanding of your own code a great deal
If you do ever get a penchant for writing as much code as possible in as few chars, look at some code golf games here on the stack exchange network; the game is to abuse as many language features as possible to make really short, trick code.. pretty much every winner stands as a testament to condense that should never, ever be put into a production system maintained by normal coders who value their sanity
*the bug is it doesn't get the first name out of the Name object

checking a value for reset value before resetting it - performance impact?

I have a variable that gets read and updated thousands of times a second. It needs to be reset regularly. But "half" the time, the value is already the reset value. Is it a good idea to check the value first (to see if it needs resetting) before resetting (a write operaion), or I should just reset it regardless? The main goal is to optimize the code for performance.
To illustrate:
Random r = new Random();
int val = Integer.MAX_VALUE;
for (int i=0; i<100000000; i++) {
if (i % 2 == 0)
val = Integer.MAX_VALUE;
else
val = r.nextInt();
if (val != Integer.MAX_VALUE) //skip check?
val = Integer.MAX_VALUE;
}
I tried to use the above program to test the 2 scenarios (by un/commenting the 2nd "if" line), but any difference is masked by the natural variance of the run duration time.
Thanks.

Don't check it.
It's more execution steps = more cycles = more time.
As an aside, you are breaking one of the basic software golden rules: "Don't optimise early". Unless you have hard evidence that this piece if code is a performance problem, you shouldn't be looking at it. (Note that doesn't mean you code without performance in mind, you still follow normal best practice, but you don't add any special code whose only purpose is "performance related")

The check has no actual performance impact. We'd be talking about a single clock cycle or something, which is usually not relevant in a Java program (as hard-core number crunching usually isn't done in Java).
Instead, base the decision on readability. Think of the maintainer who's going to change this piece of code five years on.
In the case of your example, using my rationale, I would skip the check.

Most likely the JIT will optimise the code away because it doesn't do anything.
Rather than worrying about performance, it is usually better to worry about what it
simpler to understand
cleaner to implement
In both cases, you might remove the code as it doesn't do anything useful and it could make the code faster as well.
Even if it did make the code a little slower it would be very small compared to the cost of calling r.nextInt() which is not cheap.

using java to parse a csv then save in 2D array

Okay so i am working on a game based on a Trading card game in java. I Scraped all of the game peices' "information" into a csv file where each row is a game peice and each column is a type of attribute for that peice. I have spent hours upon hours writing code with Buffered reader and etc. trying to extract the information from my csv file into a 2d Array but to no avail. My csv file is linked Here: http://dl.dropbox.com/u/3625527/MonstersFinal.csv I have one year of computer science under my belt but I still cannot figure out how to do this.
So my main question is how do i place this into a 2D array that way i can keep the rows and columns?

Well, as mentioned before, some of your strings contain commas, so initially you're starting from a bad place, but I do have a solution and it's this:
--------- If possible, rescrape the site, but perform a simple encoding operation when you do. You'll want to do something like what you'll notice tends to be done in autogenerated XML files which contain HTML; reserve a 'control character' (a printable character works best, here, for reasons of debugging and... well... sanity) that, once encoded, is never meant to be read directly as an instance of itself. Ampersand is what I like to use because it's uncommon enough but still printable, but really what character you want to use is up to you. What I would do is write the program so that, at every instance of ",", that comma would be replaced by "&c" before being written to the CSV, and at every instance of an actual ampersand on the site, that "&" would be replaced by "&a". That way, you would never have the issue of accidentally separating a single value into two in the CSV, and you could simply decode each value after you've separated them by the method I'm about to outline in...
-------- Assuming you know how many columns will be in each row, you can use the StringTokenizer class (look it up- it's awesome and built into Java. A good place to look for information is, as always, the Java Tutorials) to automatically give you the values you need in the form of an array.
It works by your passing in a string and a delimiter (in this case, the delimiter would be ','), and it spitting out all the substrings which were separated by those commas. If you know how many pieces there are in total from the get-go, you can instantiate a 2D array at the beginning and just plug in each row the StringTokenizer gives them to you. If you don't, it's still okay, because you can use an ArrayList. An ArrayList is nice because it's a higher-level abstraction of an array that automatically asks for more memory such that you can continue adding to it and know that retrieval time will always be constant. However, if you plan on dynamically adding pieces, and doing that more often than retrieving them, you might want to use a LinkedList instead, because it has a linear retrieval time, but a much better relation than an ArrayList for add-remove time. Or, if you're awesome, you could use a SkipList instead. I don't know if they're implemented by default in Java, but they're awesome. Fair warning, though; the cost of speed on retrieval, removal, and placement comes with increased overhead in terms of memory. Skip lists maintain a lot of pointers.
If you know there should be the same number of values in each row, and you want them to be positionally organized, but for whatever reason your scraper doesn't handle the lack of a value for a row, and just doesn't put that value, you've some bad news... it would be easier to rewrite the part of the scraper code that deals with the lack of values than it would be to write a method that interprets varying length arrays and instantiates a Piece object for each array. My suggestion for this would again be to use the control character and fill empty columns with &n (for 'null') to be interpreted later, but then specifics are of course what will individuate your code and coding style so it's not for me to say.
edit: I think the main thing you should focus on is learning the different standard library datatypes available in Java, and maybe learn to implement some of them yourself for practice. I remember implementing a binary search tree- not an AVL tree, but alright. It's fun enough, good coding practice, and, more importantly, necessary if you want to be able to do things quickly and efficiently. I don't know exactly how Java implements arrays, because the definition is "a contiguous section of memory", yet you can allocate memory for them in Java at runtime using variables... but regardless of the specific Java implementation, arrays often aren't the best solution. Also, knowing regular expressions makes everything much easier. For practice, I'd recommend working them into your Java programs, or, if you don't want to have to compile and jar things every time, your bash scripts (if your using *nix) and/or batch scripts (if you're using Windows).

I think the way you've scraped the data makes this problem more difficult than it needs to be. Your scrape seems inconsistent and difficult to work with given that most values are surrounded by quotes inconsistently, some data already has commas in it, and not each card is on its own line.
Try re-scraping the data in a much more consistent format, such as:
R1C1|R1C2|R1C3|R1C4|R1C5|R1C6|R1C7|R1C8
R2C1|R2C2|R2C3|R2C4|R2C5|R2C6|R2C7|R3C8
R3C1|R3C2|R3C3|R3C4|R3C5|R3C6|R3C7|R3C8
R4C1|R4C2|R4C3|R4C4|R4C5|R4C6|R4C7|R4C8
A/D Changer|DREV-EN005|Effect Monster|Light|Warrior|100|100|You can remove from play this card in your Graveyard to select 1 monster on the field. Change its battle position.
Where each line is definitely its own card (As opposed to the example CSV you posted with new lines in odd places) and the delimiter is never used in a data field as something other than a delimiter.
Once you've gotten the input into a consistently readable state, it becomes very simple to parse through it:
BufferedReader br = new BufferedReader(new FileReader(new File("MonstersFinal.csv")));
String line = "";
ArrayList<String[]> cardList = new ArrayList<String[]>(); // Use an arraylist because we might not know how many cards we need to parse.
while((line = br.readLine()) != null) { // Read a single line from the file until there are no more lines to read
StringTokenizer st = new StringTokenizer(line, "|"); // "|" is the delimiter of our input file.
String[] card = new String[8]; // Each card has 8 fields, so we need room for the 8 tokens.
for(int i = 0; i < 8; i++) { // For each token in the line that we've read:
String value = st.nextToken(); // Read the token
card[i] = value; // Place the token into the ith "column"
}
cardList.add(card); // Add the card's info to the list of cards.
}
for(int i = 0; i < cardList.size(); i++) {
for(int x = 0; x < cardList.get(i).length; x++) {
System.out.printf("card[%d][%d]: ", i, x);
System.out.println(cardList.get(i)[x]);
}
}
Which would produce the following output for my given example input:
card[0][0]: R1C1
card[0][1]: R1C2
card[0][2]: R1C3
card[0][3]: R1C4
card[0][4]: R1C5
card[0][5]: R1C6
card[0][6]: R1C7
card[0][7]: R1C8
card[1][0]: R2C1
card[1][1]: R2C2
card[1][2]: R2C3
card[1][3]: R2C4
card[1][4]: R2C5
card[1][5]: R2C6
card[1][6]: R2C7
card[1][7]: R3C8
card[2][0]: R3C1
card[2][1]: R3C2
card[2][2]: R3C3
card[2][3]: R3C4
card[2][4]: R3C5
card[2][5]: R3C6
card[2][6]: R3C7
card[2][7]: R4C8
card[3][0]: R4C1
card[3][1]: R4C2
card[3][2]: R4C3
card[3][3]: R4C4
card[3][4]: R4C5
card[3][5]: R4C6
card[3][6]: R4C7
card[3][7]: R4C8
card[4][0]: A/D Changer
card[4][1]: DREV-EN005
card[4][2]: Effect Monster
card[4][3]: Light
card[4][4]: Warrior
card[4][5]: 100
card[4][6]: 100
card[4][7]: You can remove from play this card in your Graveyard to select 1 monster on the field. Change its battle position.
I hope re-scraping the information is an option here and I hope I haven't misunderstood anything; Good luck!
On a final note, don't forget to take advantage of OOP once you've gotten things worked out. a Card class could make working with the data even simpler.

I'm working on a similar problem for use in machine learning, so let me share what I've been able to do on the topic.
1) If you know before you start parsing the row - whether it's hard-coded into your program or whether you've got some header in your file that gives you this information (highly recommended) - how many attributes per row there will be, you can reasonably split it by comma, for example the first attribute will be RowString.substring(0, RowString.indexOf(',')), the second attribute will be the substring from the first comma to the next comma (writing a function to find the nth instance of a comma, or simply chopping off bits of the string as you go through it, should be fairly trivial), and the last attribute will be RowString.substring(RowString.lastIndexOf(','), RowString.length()). The String class's methods are your friends here.
2) If you are having trouble distinguishing between commas which are meant to separate values, and commas which are part of a string-formatted attribute, then (if the file is small enough to reformat by hand) do what Java does - represent characters with special meaning that are inside of strings with '\,' rather than just ','. That way you can search for the index of ',' and not '\,' so that you will have some way of distinguishing your characters.
3) As an alternative to 2), CSVs (in my opinion) aren't great for strings, which often include commas. There is no real common format to CSVs, so why not make them colon-separated-values, or dash-separated-values, or even triple-ampersand-separated-values? The point of separating values with commas is to make it easy to tell them apart, and if commas don't do the job there's no reason to keep them. Again, this applies only if your file is small enough to edit by hand.
4) Looking at your file for more than just the format, it becomes apparent that you can't do it by hand. Additionally, it would appear that some strings are surrounded by triple double quotes ("""string""") and some are surrounded by single double quotes ("string"). If I had to guess, I would say that anything included in a quotes is a single attribute - there are, for example, no pairs of quotes that start in one attribute and end in another. So I would say that you could:
Make a class with a method to break a string into each comma-separated fields.
Write that method such that it ignores commas preceded by an odd number of double quotes (this way, if the quote-pair hasn't been closed, it knows that it's inside a string and that the comma is not a value separator). This strategy, however, fails if the creator of your file did something like enclose some strings in double double quotes (""string""), so you may need a more comprehensive approach.

Java's String.replace() vs. String.replaceFirst() vs. homebrew

I have a class that is doing a lot of text processing. For each string, which is anywhere from 100->2000 characters long, I am performing 30 different string replacements.
Example:
string modified;
for(int i = 0; i < num_strings; i++){
modified = runReplacements(strs[i]);
//do stuff
}
public runReplacements(String str){
str = str.replace("foo","bar");
str = str.replace("baz","beef");
....
return str;
}
'foo', 'baz', and all other "targets" are only expected to appear once and are string literals (no need for an actual regex).
As you can imagine, I am concerned about performance :)
Given this,
replaceFirst() seems a bad choice because it won't use Pattern.LITERAL and will do extra processing that isn't required.
replace() seems a bad choice because it will traverse the entire string looking for multiple instances to be replaced.
Additionally, since my replacement texts are the same everytime, it seems to make sense for me to write my own code otherwise String.replaceFirst() or String.replace() will be doing a Pattern.compile every single time in the background. Thinking that I should write my own code, this is my thought:
Perform a Pattern.compile() only once for each literal replacement desired (no need to recompile every single time) (i.e. p1 - p30)
Then do the following for each pX: p1.matcher(str).replaceFirst(Matcher.quoteReplacement("desiredReplacement"));
This way I abandon ship on the first replacement (instead of traversing the entire string), and I am using literal vs. regex, and I am not doing a re-compile every single iteration.
So, which is the best for performance?

So, which is the best for performance?
Measure it! ;-)
ETA: Since a two word answer sounds irretrievably snarky, I'll elaborate slightly. "Measure it and tell us..." since there may be some general rule of thumb about the performance of the various approaches you cite (good ones, all) but I'm not aware of it. And as a couple of the comments on this answer have mentioned, even so, the different approaches have a high likelihood of being swamped by the application environment. So, measure it in vivo and focus on this if it's a real issue. (And let us know how it goes...)

First, run and profile your entire application with a simple match/replace. This may show you that:
your application already runs fast enough, or
your application is spending most of its time doing something else, so optimizing the match/replace code is not worthwhile.
Assuming that you've determined that match/replace is a bottleneck, write yourself a little benchmarking application that allows you to test the performance and correctness of your candidate algorithms on representative input data. It's also a good idea to include "edge case" input data that is likely to cause problems; e.g. for the substitutions in your example, input data containing the sequence "bazoo" could be an edge case. On the performance side, make sure that you avoid the traps of Java micro-benchmarking; e.g. JVM warmup effects.
Next implement some simple alternatives and try them out. Is one of them good enough? Done!
In addition to your ideas, you could try concatenating the search terms into a single regex (e.g. "(foo|baz)" ), use Matcher.find(int) to find each occurrence, use a HashMap to lookup the replacement strings and a StringBuilder to build the output String from input string substrings and replacements. (OK, this is not entirely trivial, and it depends on Pattern/Matcher handling alternates efficiently ... which I'm not sure is the case. But that's why you should compare the candidates carefully.)
In the (IMO unlikely) event that a simple alternative doesn't cut it, this wikipedia page has some leads which may help you to implement your own efficient match/replacer.

Isn't if frustrating when you ask a question and get a bunch of advice telling you to do a whole lot of work and figure it out for yourself?!
I say use replaceAll();
(I have no idea if it is, indeed, the most efficient, I just don't want you to feel like you wasted your money on this question and got nothing.)
[edit]
PS. After that, you might want to measure it.
[edit 2]
PPS. (and tell us what you found)

Is "startsWith" faster than "indexOf"?

I am writing code in Java where I branch off based on whether a string starts with certain characters while looping through a dataset and my dataset is expected to be large.
I was wondering whether startsWith is faster than indexOf. I did experiment with 2000 records but not found any difference.

startsWith only needs to check for the presence at the very start of the string - it's doing less work, so it should be faster.
My guess is that your 2000 records finished in a few milliseconds (if that). Whenever you want to benchmark one approach against another, try to do it for enough time that differences in timing will be significant. I find that 10-30 seconds is long enough to show significant improvements, but short enough to make it bearable to run the tests multiple times. (If this were a serious investigation I'd probably try for longer times. Most of my benchmarking is for fun.)
Also make sure you've got varied data - indexOf and startsWith should have roughly the same running time in the case where indexOf returns 0. So if all your records match the pattern, you're not really testing correctly. (I don't know whether that was the case in your tests of course - it's just something to watch out for.)

In general, the golden rule of micro-optimization applies here:
"Measure, don't guess".
As with all optimizations of this type, the difference between the two calls almost certainly won't matter unless you are checking millions of strings that are each tens of thousands of characters long.
Run a profiler over your code, and only optimize this call when you can measure that it's slowing you down. Till then, go with the more readable options (startsWith, in this case). Once you know that this block is slowing you down, try both and use whichever is faster. Rinse. Repeat ;-)
Academically, my guess is that startsWith will likely be implemented using indexOf. Check the source code and see if you're interested. (Turns out that startsWith does not call indexOf)

Even without looking into the sources, it should be clear that startsWith() is faster at least for large strings and short pattern.
The running time of a.startsWith(b) is bound be the length of b. After at most the first b characters are checked, the search finished.
The running time of a.indexOf(b) is larger (depending on the actual algorithm). Every algorithm has at least a running time depending on the length of a. Roughly, you can say, that you have to look at each character once to check if the pattern starts at that position.
However, as always, it depends on the actual use case if you really see a difference in practice. Measuring the difference in real life is never bad.

Probably, if it doesn't match it can stop looking whereas indexOf needs to look for occurrences later in the string.

startsWith is clearer than indexOf == 0.
Have you identified the test as a performance bottleneck for which you need to sacrifice readability?

public class Test
{
public static void main(String args[]) {
long value1 = System.currentTimeMillis();
for(long i=0;i<100000000;i++)
{
"abcd".indexOf("a");
}
long value2 = System.currentTimeMillis();
System.out.println(value2-value1);
value1 = System.currentTimeMillis();
for(long i=0;i<100000000;i++)
{
"abcd".startsWith("a");
}
value2 = System.currentTimeMillis();
System.out.println(value2-value1);
}
}
Tested it with this piece of code and perf for startsWith seems to be better, for obvious reason that it doesn't have to traverse through string. But in best case scenario both should perform close while in a worst case scenario startsWith will always perform better than indexOf

You mentioned the dataset is expected to be large. So i will bet a lot of performanve will go into access this dataset and handle it in memory. That means use one or the other will not change the perfomance significant. But if this is important to you you may write your own startWith method that could be significant faster than standard library methods or at least you know exactly what is done.

Unfortunate, statsWith is not working as supposed to! it uses "indexOf" behind the sene (lazy developpers :D) so indexOf is 10x faster than implemented statsWith

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.