General methods for optimizing program for speed - java

What are some generic methods for optimizing a program in Java, in terms of speed. I am using a DOM Parser to parse an XML file and then store certain words in an ArrayList, remove any duplicates then spell check those words by creating Google search URL's for each word, get the html document, locate the corrected word and save it to another ArrayList.
Any help would be appreciated! Thanks.

Why do you need to improve performance? From your explanation, it is pretty obvious that the big bottleneck here (or performance hit) is going to be the IO resulting from the fact that you are accessing a URL.
This will surely dwarf by orders of magnitude any minor improvements you make in data structures or XML frameworks.
It is a good general rule of thumb that your big performance problems will involve IO. Humorously enough, I am at this very moment waiting for a database query to return in a batch process. It has been running for almost an hour. But I welcome any suggested improvements to my XML parsing library nevertheless!
Here are my general methods:
Does your program perform any obviously expensive task from the perspective of latency (IO)? Do you have enough logging to see that this is where the delay is (if significant)?
Is your program prone to lock-contention (i.e. can it wait around, doing nothing, waiting for some resource to be "free")? Perhaps you are locking an entire Map whilst you make an expensive calculation for a value to store, blocking other threads from accessing the map
Is there some obvious algorithm (perhaps for data-matching, or sorting) that might have poor characteristics?
Run up a profiler (e.g. jvisualvm, which ships with the JDK itself) and look at your code hotspots. Where is the JVM spending its time?

SAX is faster than DOM. If you don't want to go through the ArrayList searching for duplicates, put everything in a LinkedHashMap -- no duplicates, and you still get the order-of-insertion that ArrayList gives you.
But the real bottleneck is going to be sending the HTTP request to Google, waiting for the response, then parsing the response. Use a spellcheck library, instead.
Edit: But take my educated guesses with a grain of salt. Use a code profiler to see what's really slowing down your program.

Generally the best method is to figure out where your bottleneck is, and fix it. You'll usually find that you spend 90% of your time in a small portion of your code, and that's where you want to focus your efforts.
Once you've figured out what's taking a lot of time, focus on improving your algorithms. For example, removing duplicates from an ArrayList can be O(n²) complexity if you're using the most obvious algorithm, but that can be reduced to O(n) if you leverage the correct data structures.
Once you've figured out which portions of your code are taking the most time, and you can't figure out how best to fix it, I'd suggest narrowing down your question and posting another question here on StackOverflow.
Edit
As #oxbow_lakes so snidely put it, not all performance bottlenecks are to be found in the code's big-O characteristics. I certainly had no intention to imply that they were. Since the question was about "general methods" for optimizing, I tried to stick to general ideas rather than talking about this specific program. But here's how you can apply my advice to this specific program:
See where your bottleneck is. There are a number of ways to profile your code, ranging from high-end, expensive profiling software to really hacky. Chances are, any of these methods will indicate that your program spends the 99% of its time waiting for a response from Google.
Focus on algorithms. Right now your algorithm is (roughly):
Parse the XML
Create a list of words
For each word
Ping Google for a spell check.
Return results
Since most of your time is spent in the "ping Google" phase, an obvious way to fix this would be to avoid doing that step more times than necessary. For example:
Parse the XML
Create a list of words
Send list of words to spelling service.
Parse results from spelling service.
Return results
Of course, in this case, the biggest speed boost would probably be by using spell checker that runs on the same machine, but that isn't always an option. For example, TinyMCE runs as a javascript program within the browser, and it can't afford to download the entire dictionary as part of the web page. So it packages up all the words into a distinct list and performs a single AJAX request to get a list of those words that aren't in the dictionary.

These folks are probably right, but a few random pauses will turn *probably" into "definitely, and here's why".

Related

ArrayList or Multiple LinkedHashMap

I have an ArrayList of a custom object A. I need to retrieve 2 variables from A based on certain conditions. Should I simply use for loop to retrieve data from the list each time or create 2 LinkedHashMap and store the required variable in it as key/value pair for faster access later? Which is more efficient? Does creating 2 additional map objects justify the efficiency during search?
List will contain about 100-150 objects so does the two maps.
It will be used by concurrent users on daily basis.
Asking about "efficiency" is like asking about "beauty". What is "efficiency"? I argue that efficiency is what gets the code out soonest without bugs or other misbehavior. What's most efficient in terms of software costs is what saves programmer time, both for initial development and maintenance. In the time it took you to find "answers" on SO, you could have had a correct implementation coded and correct, and still had time to test your alternatives rigorously under controlled conditions to see which made any difference in the program's operation.
If you save 10 ms of program run time at the cost of horridly complex, over-engineered code that is rife with bugs and stupidly difficult to refactor or fix, is that "efficient"?
Furthermore, as phrased, the question is useless on SO. You provided no definition of "efficient" from your context. You provided no information on how the structures in question fit into your project architecture, or the extent of their use, or the size of the problem, or anything else relevant to any definition of "efficiency".
Even if you had, we'd have no more ability to answer such a question than if you asked a roomful of lawyers, "Should I sue so-and-so for what they did?" It all depends. You need advice, if you need advice at all, that is very specific to your situation and the exact circumstances of your development environment and process, your runtime environment, your team, the project goals, budget, and other relevant data.
If you are interested in runtime "efficiency", do the following. Precisely define what exactly you mean by "efficient", including an answer to "how 'efficient' is 'efficient' enough?", and including criteria to measure such "efficiency". Once you have such a precise and (dis)provable definition, then set up a rigorous test protocol to compare the alternatives in your context, and actually measure "efficiency".
When defining "efficiency", make sure that what you define matters. It makes no difference to be "efficient" in an area that has very low project cost or impact, and ignore an area that has huge cost or impact.
Don't expect any meaningful answer for your situation here on SO.
Use LinkedHashMap because it made for key value pair (according to your requirement).because data will increase in production environment.

Can you/How do you save CPU and memory by choosing wisely [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I understand the JVM optimizes some things for you (not clear on which things yet), but lets say I were to do this:
while(true) {
int var = 0;
}
would doing:
int var;
while(true) {
var = 0;
}
take less space? Since you aren't declaring a new reference every time, you don't have to specify the type every time.
I understand you really would only need to put var outside of while if I wanted to use it outside of that loop (instead of only being able to use it locally like in the first example). Also, what about objects, would it be different that primitive types in that situation? I understand it's a small situation, but build-up of this kind of stuff can cause my application to take a lot of memory/cpu. I'm trying to use the least amount of operations possible, but I don't completely understand whats going on behind the scenes.
If someone could help me out, even maybe link me to somewhere I can learn about saving cpu by decreasing amount of operations, it would be highly appreciated. Please no books (unless they're free! :D), no way of getting one right now /:
Don't. Premature optimization is the root of all evil.
Instead, write your code as it makes most sense conceptually. Write it thoughtfully, yes. But don't think you can be a 'human compiler' and optimize and still write good code.
Once you have written your code (more or less naively, depending on your level of experience) you write performance tests for it. Try to think of different ways in which the code may be used (many times in a row, from front to back or reversed, many concurrent invocations etc) and try to cover these in test cases. Then benchmark your code.
If you find that some test cases are not performing well, investigate why. Measure parts of the test case to see where the time is going. Zoom into the parts where most time is spent.
Mostly, you will find weird loops where, upon reading the code again, you will think 'that was silly to write it that way. Of course this is slow' and easily fix it. In my experience most performance problems can be solved this way and 'hardcore optimization' is hardly ever needed.
In the end you will find that 99* percent of all performance problems can be solved by touching only 1 percent of the code. The other code never comes into play. This is why you should not 'prematurely' optimize. You will be spending valuable time optimizing code that had no performance issues in the first place. And making it less readable in the process.
Numbers made up of course but you know what I mean :)
Hot Licks points out the fact that this isn't much of an answer, so let me expand on this with some good ol' perfomance tips:
Keep an eye out for I/O
Most performance problems are not in pure Java. Instead they are in interfacing with other systems. In particular disk access is notoriously slow. So is the network. So minimize it's use.
Optimize SQL queries
SQL queries will add seconds, even minutes, to your program's execution time if you don't watch out. So think about those very carefully. Again, benchmark them. You can write very optimized Java code, but if it first spends ten seconds waiting for the database to run some monster SQL query than it will never be fast.
Use the right kind of collections
Most performance problems are related to doing things lots of times. Usually when working with big sets of data. Putting your data in a Map instead of in a List can make a huge difference. Also there are specialized collection types for all sorts of performance requirements. Study them and pick wisely.
Don't write code
When performance really matters, squeezing the last 'drops' out of some piece of code becomes a science all in itself. Unless you are writing some very exotic code, chances are great there will be some library or toolkit to solve your kind of problems. It will be used by many in the real world. Tried and tested. Don't try to beat that code. Use it.
We humble Java developers are end-users of code. We take the building blocks that the language and it's ecosystem provides and tie it together to form an application. For the most part, performance problems are caused by us not using the provided tools correctly, or not using any tools at all for that matter. But we really need specifics to be able to discuss those. Benchmarking gives you that specifity. And when the slow code is identified it is usually just a matter of changing a collection from list to map, or sorting it beforehand, or dropping a join from some query etc.
Attempting to optimise code which doesn't need to be optimised increases complexity and decreases readability.
However, there are cases were improving readability also comes with improved performance.
For example,
if a numeric value cannot be null, use a primitive instead of a wrapper. This makes it clearer that the value cannot be null but also uses less memory and reduces pressure on the GC.
use a Set when you have a collection which cannot have duplicates. Often a List is used when in fact a Set would be more appropriate, depending on the operations you perform, this can also be faster by reducing time complexity.
consider using an enum with one instance for a singleton (if you have to use singletons at all) This is much simpler as well as faster than double check locking. Hint: try to only have stateless singletons.
writing simpler, well structured code is also easier for the JIT to optimise. This is where trying to out smart the JIT with more complex solutions will back fire because you end up confusing the JIT and what you think should be faster is actually slower. (And it's more complicated as well)
try to reduce how much you write to the console (and IO in general) in critical sections. Writing to the console is so expensive, both for the program and the poor human having to read it that is it worth spending more time producing concise console output.
try to use a StringBuilder when you have a loop of elements to add. Note: Avoid using StringBuilder for one liners, just series of append() as this can actually be slower and harder to read.
Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away. --
Antoine de Saint-Exupery,
French writer (1900 - 1944)
Developers like to solve hard problems and there is a very strong temptation to solve problems which don't need to be solved. This is a very common behaviour for developers of up to 10 years experience (it was for me anyway ;), after about this point you have already solved most common problem before and you start selecting the best/minimum set of solutions which will solve a problem. This is the point you want to get to in your career and you will be able to develop quality software in far less time than you could before.
If you dream up an interesting problem to solve, go ahead and solve it in your own time, see what difference it makes, but don't include it in your working code unless you know (because you measured) that it really makes a difference.
However, if you find a simpler, elegant solution to a problem, this is worth including not because it might be faster (thought it might be), but because it should make the code easier to understand and maintain and this is usually far more valuable use of your time. Successfully used software usually costs three times as much to maintain as it cost to develop. Do what will make the life of the poor person who has to understand why you did something easier (which is harder if you didn't do it for any good reason in the first place) as this might be you one day ;)
A good example on when you might make an application slower to improve reasoning, is in the use of immutable values and concurrency. Immutable values are usually slower than mutable ones, sometimes much slower, however when used with concurrency, mutable state is very hard to get provably right, and you need this because testing it is good but not reliable. Using concurrency you have much more CPU to burn so a bit more cost in using immutable objects is a very sensible trade off. In some cases using immutable objects can allow you to avoid using locks and actually improve throughput. e.g. CopyOnWriteArrayList, if you have a high read to write ration.

Sort a list with SQL or as a collection?

I have some entries with dates in my database. What is best?:
Fetch them with a sql statement and also apply order by.
Get the list with sql, and order them within the application with collection.sort or so?
Thanks
This a very broad question that is very difficult to answer, and it depends a lot on what you mean by best?
From a performance perspective, you will simply have to measure to determine what part of your system is the bottleneck. Databases are usually very efficient, but it could still be relevant to off-load that work to the client.
From a separation of concern perspective, it depends on how the sorting matters in the application and how the application is layered.
Ask your self: "where does the knowledge that the data is sorted belong?" and "What would happen if I where to change from a relational database storage to something different".
To some extent, it depends on how many values are in the complete collection. If it is, say, 20-30 values then you can sort anywhere — even a relatively poor sorting algorithm can do that quickly (avoid Stooge Sort though; that's terrible) — as that is the sort of size of data chunk which you might expect to actually fetch in one service response.
But once you get into larger datasets you need to plan much more carefully. In particular, you want to avoid moving data around if you don't have to. If the data is currently only present in the database, you really don't want to fetch it all into the client just to sort it (a relatively expensive operation) and then throw virtually all of it away. It's far better to actually keep the data sorted in the database to start with, so that picking it up in order is trivial; in relational database terms, keeping the data sorted is functionally identical to maintaining an index on the data. Indeed, you can have multiple indices on the data, which can make even rather complex queries quick. (NoSQL DBs are more varied; some even don't support the concept of keeping data sorted.) The downside of maintaining indices is that they take up more space and they take time to maintain, particularly when the data is being created in the first place.
So… to return to your question, you probably want to try to not sort the data in the application: for most data, an appropriate index can be much more efficient as it lets your code not even look at unwanted data. But if you have to fetch it all into your application for some other reason and you can't bring it in pre-sorted, there's no reason to avoid sorting it yourself: Java's sorting algorithms are efficient and stable. But you should measure whether fetching it from the DB in the new order is faster. (The question is whether the DB overheads exceed the super-linear costs of re-sorting; lots of problems are in the domain where “maybe; hard to tell” is the answer.)
The other thing to balance is whether it is simpler for your code to not do sorting itself and instead always delegate that to the DB. Keeping your code simpler (and more bug-free) is a good goal to have…
Database management systems (DMBS) are optimized for these tasks, so I think you should stick with them. Especially if you are accessing the database from a script written in PHP or (other scripting language), it might be slower to perform that task using a script. You might also reach a memory limit allowed to be used by PHP if you sort the array using a script.
I don't mean to raise a question of performance of different programming languages, just want to point out that it is a very good practice to rely on the DMBS whenever you can.
This is a very interesting question to me, and I want to present the other side of the accepted answer, which BTW is a very good answer with which I don't necessarily *dis*agree. Just want to present the other side.
When I started in my career, I was working on mainframe DB2, and the old-timers that taught me were VERY INSISTENT that sorting be done OUTSIDE of the db. Their rational for this is that it's work that CAN be offloaded, and this leaves the DB free to service other requests.
Of course, it's far more nuanced than this. In general, I'd say the factors you're weighing are:
A) How busy, or central to your system, is your database? If your db is very busy, if you have a lot of OLTP processing on clients or app servers, and your client or application servers have lots of excess capacity, why not sort on the app server or client? Even if it's less efficient, it spreads the work through the system and gets you more throughput from a whole-systems perspective.
B) How big is the sort? It would be silly to, say, blow your call stack or java heap because you sorted a gazillion MB of data.
C) Will sorting in your app or app server cause pauses, latency, etc? In other words, if your particular programming language has REALLY bad sorting libraries, and you don't want to write your own, maybe letting the DB take 0.5 seconds is better than making your application take 5.0 seconds.
So, as with all things, "it depends" ;-). But, I think these are the things upon which it depends.

How does a long name effect the response size?

I am using play framework for web development. I was wondering what is the effect of the longer variable on user's query processing.In other words how the name of the variable affects the query size? Does a longer name means longer query to be sent and thus longer time for a user request to process? Does a shorter JSON means shorter response time to user? Shorter variable names will significantly reduce the readability of the code.
What I have found
If I rename my variable to be send in a json from autoEngineId to aeId there is not much performance gain but thats maybe because I dont have significant amount of user requests to process. The site is in dev mode.
Can somebody please tell me what is the advantage/disadvantage of smaller varibale names in JSON?
Unless your variable names are the length of college essays, there's not going to be a noticeable difference in response times over the network. Spend your time optimizing where it counts. That will be where profiling tells you it will count.
The primary advantage of shorter names is less typing for the programmer. Make the names long enough so that when you come back to your code after a month, the name will provide at least a clue as to what it's being used for. Because you will have forgotten.
I suggest you profile your application to determine where you can improve performance. Even after 20 years experience of tuning performance systems I still find that measuring is the only way to be sure you are a) working on the most significant performance improvements b) you don't make matters worse.
In your case, I assume you have particular reason to believe the length of attribute names is a performance issue, it just something you change. This approach to performance tuning is more likely to make matter worse than better.
Smaller variable names may save you a little bandwidth, but the performance impact (unless you have hundreds of thousands of entries an object array), is going to be so small, you are unlikely to notice anything.
The impact of not having meaning variable names will cost you far more in the long run. Performance and Scalability are cheap, man-power is not, and well designed code will always pay for itself many times over in the long term.

The Efficiency of Hard-Coding vs. File Input

I'm working on a machine learning project in Java which will involve a very large model (the output of a Support Vector Machine, for those of you familiar with that) that will need to be retrieved fairly frequently for use by the end user. The bulk of the model consists of large two-dimensional array of fairly small objects.
Unfortunately, I do not know exactly how large the model is going to be (I've been working with benchmark data so far, and the data I'm actually going to be using isn't ready yet), nor do I know the specifications of the machine it will run on, as that is also up in the air.
I already have a method to write the model to a file as a string, but the write process takes a great deal of time and the read process takes the better part of a minute. I'd like to cut down on that time, so I had the either bright or insanely convoluted idea of writing the model to a .java file in such a way that it could be compiled and then run to produce a fully formed model.
My questions to you are, will storing and compiling the model in Java be significantly faster than reading it from the file, under the assumption that the model is about 1 MB in size? And is there some reason I haven't seen yet that this could be a fantastically stupid idea that I should not pursue under any circumstances?
Thank you for any ideas you can give me.
EDIT: apparently trying to automatically write several thousand values into code makes a method that is roughly two orders of magnitude larger than the compiler can handle. Ah well, live and learn.
Instead of writing to a string or to a java file, you might consider creating a compact binary format for you data.
Will storing and compiling the model in Java be significantly faster
than reading it from the file ?
That depends on the way you fashion your custom datastructure to contain your model.
The question IMHO is if the reading of the file takes long because of IO or because of computing time (=> CPU). If the later is the case then tough luck. If your IO (e.g. hard disc) is the cause then you can compress the file and extract it after/while reading. There is (of course) ZIP-support in Java (even for Streams).
I agree with the answer given above to use a binary input format. Let's try optimising that first. Can you provide some information? ...or have you googled working with binary data? ...buffering it? etc.?
Writing a .java file and compiling it will be quiet interesting... but it is bound to give your issues at some point. However, I think you will find that it will be slightly slower than an optimised binary format, but faster than text-based input.
Also, be very careful for early optimisation. Usually, "highly-configurable" and "blinding fast" is mutual exclusive. Rather, get everything to work first and then use a profiler to optimise the really slow sections of the application.

Categories

Resources