I want to benchmark 2 Libraries to evaluate them, with nearly Identical code.
But the issue is, that Java needs some Time to warm up.
Do you have any Idea how to properly set up a Benchmark, especially for "read write" operations with Java?
Its hard for me to grasp, how Java "warms up" or "cashes" Input data via streams.
My Usecase:
I read a template file. And fill it.
What I did:
Mesuring the Time of each Library to read a template and fill a document.
The issue I stumbled upon
In the first iteration the first library is significantly faster than the second one.
But when I do like 1000+ iterations, they come out really close.
I read the same tamplate file multiple times, that could be a issue.
Do you have any suggestions to create a realistic benchmark?
Because there is no Usecase, where 1000 Documents will be generated at once.
But one per "workflow". But I need to take into account, that the JVM optimizes read data not only initialisation at runtime, and "is warmed up".
And for how long it is warmed up, because since the usecase never includes multiple files at once, a "1000 iterations szenario" is not realistic.
Related
One is sometimes faced with the task of parsing data stored in files on the local system. A significant dilemma is whether to load and parse all of the file data at the beginning of the program run, or access the file throughout the run and read data on-demand (assuming the file is sorted, so search is performed in constant time).
When it comes to small data sets, the first approach seems favorable, but with larger ones the threat of clogging up the heap increases.
What are some general guidelines one can use in such scenarios?
That's the standard tradeoff in programming - memory vs performance, Space–time tradeoff etc. There is no "right" answer to that question. It depends on the memory you have, speed you need, size of files, how often you query them etc.
In your specific case and since it seems like a one time job (if you are able to read it in the beginning) then it probably won't matter that much ;)
That depends entirely on what your program needs to do. The general advice is to keep only as much data in memory as is necessary. For example, consider a simple program that reads each record from a file of transactions, and then reports the total number of transactions and the total dollar amount:
count = 0
dollars = 0
while not end of file
read record
parse record
increment count
add transaction amount to dollars
end
output count and dollars
Here, you clearly need to have only one transaction record in memory at a time. So you read a record, process it, and discard it. It makes no sense to load all of the records into a list or other data structure, and then iterate over the list to get the count and total dollar amount.
In some cases you do need multiple records, perhaps all of them, in memory. In those cases, all you do is re-structure the program a little bit. You keep the reading loop, but have it add records to a list. Then afterwards you can process the list:
list = []
while not end of file
read record
parse record
add record to list
end
process list
output results
It makes no sense to load the entire file into a list, and then scan the list sequentially to obtain count and dollar amount. Not only is that a waste of memory, it makes the program more complex, uses memory to no gain, will be slower, and will fail with large data sets. The "memory vs performance" tradeoff doesn't always apply. Often, as in this case, using more memory makes the program slower.
I generally find it a good practice to structure my solutions so that I keep as little data in memory as is practical. If the solution is simpler with sorted data, for example, I'll make sure that the input is sorted before I run the program.
That's the general advice. Without specific examples from you, it's hard to say what approach would be preferred.
I've written a program for class which takes in data from a URL, parses it for key phrases and then writes to a text file the phrase, line number, and column number.
Currently I am doing this as a single operation where the URL is fed to a BufferedReader for reading, to a Scanner for Parsing and then into a loop where each line is combed through and a series of conditional statements are used to check for the presence of said key phrases. When a match is found I write to a file.
The file read is about 60K lines of text and it takes about 4000ms on average to run this full operation from start to finish. Would it be more efficient to break apart the tasks and first read through the file into a Data Structure and then output the results to the file instead of doing both at the same time?
Also, how big of an impact would pulling the data from the URL have vs. reading it locally? I have the option to do both and but figure this would depend upon my broadband speeds.
EDIT: Somewhat of a nice test case. Over the week we've changed our ISP and upgraded our broadband speeds from 6Mb/sec to 30Mb/sec. This is brought my average read/parse/write times down to 1500ms. Interesting to see how small variances can make such impacts in performance.
This depends on the way you implement parallelism in your data crunching part.
At the moment you sequentially read everything - then crunch the data - then write it. So even if you broke it into 3 threads each one depends on the result of the previous.
So unless you start processing the data before it is fully received, this would not make a difference but only add overhead.
You would have to model a producer/consumer like flow where e.g. lines are read individually and then put on a work queue for processing. Same for processed lines which are then put on a queue to be written to a file.
This would allow parallel read / process / write actions to take place.
Btw - probably you are mostly limited by the speed to read the file from an URL, since all other steps happen locally and are orders of magnitudes faster.
I'm working on a machine learning project in Java which will involve a very large model (the output of a Support Vector Machine, for those of you familiar with that) that will need to be retrieved fairly frequently for use by the end user. The bulk of the model consists of large two-dimensional array of fairly small objects.
Unfortunately, I do not know exactly how large the model is going to be (I've been working with benchmark data so far, and the data I'm actually going to be using isn't ready yet), nor do I know the specifications of the machine it will run on, as that is also up in the air.
I already have a method to write the model to a file as a string, but the write process takes a great deal of time and the read process takes the better part of a minute. I'd like to cut down on that time, so I had the either bright or insanely convoluted idea of writing the model to a .java file in such a way that it could be compiled and then run to produce a fully formed model.
My questions to you are, will storing and compiling the model in Java be significantly faster than reading it from the file, under the assumption that the model is about 1 MB in size? And is there some reason I haven't seen yet that this could be a fantastically stupid idea that I should not pursue under any circumstances?
Thank you for any ideas you can give me.
EDIT: apparently trying to automatically write several thousand values into code makes a method that is roughly two orders of magnitude larger than the compiler can handle. Ah well, live and learn.
Instead of writing to a string or to a java file, you might consider creating a compact binary format for you data.
Will storing and compiling the model in Java be significantly faster
than reading it from the file ?
That depends on the way you fashion your custom datastructure to contain your model.
The question IMHO is if the reading of the file takes long because of IO or because of computing time (=> CPU). If the later is the case then tough luck. If your IO (e.g. hard disc) is the cause then you can compress the file and extract it after/while reading. There is (of course) ZIP-support in Java (even for Streams).
I agree with the answer given above to use a binary input format. Let's try optimising that first. Can you provide some information? ...or have you googled working with binary data? ...buffering it? etc.?
Writing a .java file and compiling it will be quiet interesting... but it is bound to give your issues at some point. However, I think you will find that it will be slightly slower than an optimised binary format, but faster than text-based input.
Also, be very careful for early optimisation. Usually, "highly-configurable" and "blinding fast" is mutual exclusive. Rather, get everything to work first and then use a profiler to optimise the really slow sections of the application.
What are some generic methods for optimizing a program in Java, in terms of speed. I am using a DOM Parser to parse an XML file and then store certain words in an ArrayList, remove any duplicates then spell check those words by creating Google search URL's for each word, get the html document, locate the corrected word and save it to another ArrayList.
Any help would be appreciated! Thanks.
Why do you need to improve performance? From your explanation, it is pretty obvious that the big bottleneck here (or performance hit) is going to be the IO resulting from the fact that you are accessing a URL.
This will surely dwarf by orders of magnitude any minor improvements you make in data structures or XML frameworks.
It is a good general rule of thumb that your big performance problems will involve IO. Humorously enough, I am at this very moment waiting for a database query to return in a batch process. It has been running for almost an hour. But I welcome any suggested improvements to my XML parsing library nevertheless!
Here are my general methods:
Does your program perform any obviously expensive task from the perspective of latency (IO)? Do you have enough logging to see that this is where the delay is (if significant)?
Is your program prone to lock-contention (i.e. can it wait around, doing nothing, waiting for some resource to be "free")? Perhaps you are locking an entire Map whilst you make an expensive calculation for a value to store, blocking other threads from accessing the map
Is there some obvious algorithm (perhaps for data-matching, or sorting) that might have poor characteristics?
Run up a profiler (e.g. jvisualvm, which ships with the JDK itself) and look at your code hotspots. Where is the JVM spending its time?
SAX is faster than DOM. If you don't want to go through the ArrayList searching for duplicates, put everything in a LinkedHashMap -- no duplicates, and you still get the order-of-insertion that ArrayList gives you.
But the real bottleneck is going to be sending the HTTP request to Google, waiting for the response, then parsing the response. Use a spellcheck library, instead.
Edit: But take my educated guesses with a grain of salt. Use a code profiler to see what's really slowing down your program.
Generally the best method is to figure out where your bottleneck is, and fix it. You'll usually find that you spend 90% of your time in a small portion of your code, and that's where you want to focus your efforts.
Once you've figured out what's taking a lot of time, focus on improving your algorithms. For example, removing duplicates from an ArrayList can be O(n²) complexity if you're using the most obvious algorithm, but that can be reduced to O(n) if you leverage the correct data structures.
Once you've figured out which portions of your code are taking the most time, and you can't figure out how best to fix it, I'd suggest narrowing down your question and posting another question here on StackOverflow.
Edit
As #oxbow_lakes so snidely put it, not all performance bottlenecks are to be found in the code's big-O characteristics. I certainly had no intention to imply that they were. Since the question was about "general methods" for optimizing, I tried to stick to general ideas rather than talking about this specific program. But here's how you can apply my advice to this specific program:
See where your bottleneck is. There are a number of ways to profile your code, ranging from high-end, expensive profiling software to really hacky. Chances are, any of these methods will indicate that your program spends the 99% of its time waiting for a response from Google.
Focus on algorithms. Right now your algorithm is (roughly):
Parse the XML
Create a list of words
For each word
Ping Google for a spell check.
Return results
Since most of your time is spent in the "ping Google" phase, an obvious way to fix this would be to avoid doing that step more times than necessary. For example:
Parse the XML
Create a list of words
Send list of words to spelling service.
Parse results from spelling service.
Return results
Of course, in this case, the biggest speed boost would probably be by using spell checker that runs on the same machine, but that isn't always an option. For example, TinyMCE runs as a javascript program within the browser, and it can't afford to download the entire dictionary as part of the web page. So it packages up all the words into a distinct list and performs a single AJAX request to get a list of those words that aren't in the dictionary.
These folks are probably right, but a few random pauses will turn *probably" into "definitely, and here's why".
I'm writing a small agent in java that will play a game against other agents. I want to keep a small amount of state (probably approx. 1kb at most) around between runs of the program so that I can try to tweak the performance of the agent based upon past successes. Essentially, I will be reading a small amount of data at the beginning of each game and writing a small amount at the end. It seems like I have 2 options, file I/O or derby. Is there a speed advantage to either? Or does it not really matter for such a small amount of data?
With 1kb of data, you are better off using standard file IO. Most likely, you could serialize the entire object tree to disk and dismply deserialize when you startup again. If you wanted to get fancy, you could use JAXB to serialize to XML instead of binary files.
As much as I love to fit every problem to the database solution, I don't think that's very practical here. Unless you have some special need of database specific capabilities, you are introducting a lot of overhead, complexity, maintenance problems by using a database.
The only areas where you might really want to use the database is if you have a lot of small objects/rows and you frequently perform sorts and filters on the data. But even then, you could probably keep a dozen in-memory ordered lists and get better performance with less resources and without the headache of a database.
If you really think you need a database in this scenario, consider HSQL. I don't consider it a real database, but it's a in-memory database that can persist to a file. Low overhead, low complexity, and relatively few points of failure. Plus, if you need to edit the persisted data, you can do so with a text editor. Can't say that about Derby.
Considering that these objects can vary per file size, and your computer's specs (bus speed, HD speed) affect this, the only way to be sure is to write your own benchmark. Just create a simple for loop, count from 1 to 1000, and read the file inside the loop over and over (but do not create and destroy the objects inside the loop, just focus on the reading part).
Of course this whole exercise reeks of pre-optimization, which can lead to bad coding habit. Just write your code in the most readable, simple fashion, and if there is a speed problem, refactor as needed.
But since it's a small amount of data, I would say it won't matter.