I am implementing a load testing tool in java. I want to store load test results temporally to match them with the expected results at the end. I cannot match responses while running the load test as it will affect the load test speed.
So I want to store the results in a minimal writing time. What is the best possible way to store data? Write to a local database or write to local hdd as files?
Note: Results cannot be kept in the RAM as results may be large several gigs.
File is more efficient than DB. DB is a software based on file system but if you implement your solution by file effectively it would be more rapid and does not have the overload of Db.
Also in DB if you use ORM (JPA) it makes the code more slower than pure JDBC. It's obvious that more convenience in data handling (file->JDBC->JPA) results more time consuming.
I suggest you to use file manipulation for this purpose and use some more fast technology like nio (New IO) in java.
With load testing, usual approach would be to have multiple agents putting load on the system. Then there would need to be some benchmark on system and on agents. Usually load will be split so one agent can keep own statistics. When test is finished you can aggregate agents statistics kind of "off line" and cross compare it with benchmark on system side.
Answering your question about I/O write speed: it depends on many different factors so you would need to benchmark both scenarios. However considering fact that database needs to support transactions, indexing and store data my blind guess would be that in your use case file and raw data would be faster.
Related
I'm working for well known company in a project that should bring integration with other system that are producing one csv per hour of 27Gb. The target is query these files without import em (the main problem is bureaucracy, nobody want resposibility if some data change).
Main filters on this files can be done by dates, the end-user must insert a range start-end dates. After that can be filter by few strings.
Context: spring boot microservices
Server: xeon processor 24 core 256gb Ram
Filesystem: NFS mounted from external server
Test data: 1000 files, each one 1Gb
For performance improvement i'm indexing files by date writing on each file name the range that contains and making a folder structure like yyyy/mm/dd. For each of following test the first step was make a raw file paths list that will be read.
research will read all files
Spring batch - buffered reader and parse into object: 12,097 sec
Plain java - threadpool, buffered reader and parse into object: 10,882 sec
Linux egrep with regex and parallel ran from java and parse into object: 7,701 sec
The dirtiest is also fastes. I want avoid it because security department warned me about all checks to make on input data to prevent shell injection.
Googling i found mariadb CONNECT engine that can point also huge csvs, so now i'm going on this way creating temporary table with files that research have interest, the bad part is i have to do one table for each query since dates can be different.
For first year We're expecting not more than 5 parallel researches in same time, with an average of 3 weeks of range. This queries will be done asyncronousely.
Do you know something that can help me on it? Not only for the speed but a good practice to apply.
Thanks a lot folks.
To answer your question:
No. There are no best practices. And, AFAIK there are no generally applicable "good" practices.
But I do have some general advice. If you allow considerations such as bureaucracy and (to a lesser extent) security edicts to dictate your technical solutions, then you are liable to end up with substandard solutions; i.e. solutions that are slow or costly to run and keep running. (If "they" want it to be fast, then "they" shouldn't put impediments in your way.)
I don't think we can give you an easy solution to your problem, but I can say some things about your analysis.
You said about the grep solution.
"I want avoid it because security department warned me about all checks to make on input data to prevent shell injection."
The solution to that concern simple: don't use an intermediate shell. The dangerous injection attacks will be via shell trickery rather than grep. Java's ProcessBuilder doesn't use a shell unless you explicitly use one. The grep program itself can only read the files that are specified in its arguments, and write to standard output and standard error.
You said about the general architecture:
"The target is query these files without import them (the main problem is bureaucracy, nobody want responsibility if some data change)."
I don't understand the objection here. We know that the CSV files are going to change. You are getting a new 27GB CSV file every hour!
If the objection is that the format of the CSV files is going to change, well that affects your ability to effectively query them. But with a little ingenuity, you could detect the the change in format and adjust the ingestion process on the fly.
"We're expecting not more than 5 parallel researches in same time, with an average of 3 weeks of range."
If you haven't done this already, you need to do some analysis to see whether your proposed solution is going to be viable. Estimate how much CSV data needs to be scanned to satisfy a typical query. Multiply that by the number of queries to be performed in (say) 24 hours. Then compare that against your NFS server's ability to satisfy bulk reads. Then redo the calculation assuming a given number of queries running in parallel.
Consider what happens if your (above) expectations are wrong. You only need a couple of "idiot" users doing unreasonable things ...
Having a 24 core server for doing the queries is one thing, but the NFS server also needs to be able to supply the data fast enough. You can improve things with NFS tuning (e.g. by tuning block sizes, the number of NFS daemons, using FS-Cache) but the the ultimate bottlenecks will be getting the data off the NFS server's disks and across the network to your server. Bear in mind that there could be other servers "hammering" the NFS server while your application is doing its thing.
I have a Java NLP project that I am working on which uses Stanford's CoreNLP package. I have several unit tests for the project and I like to run them frequently in order to see how minor tweaks impact the system's output. Unfortunately, the CoreNLP package needs to load a model of the English language in order to perform its classification and tagging, and this file is so large that it takes several seconds to load into memory. This may not seem like much wait time but it seems a shame that the unit tests themselves take milliseconds to run and each time I start a new test run I have to wait for the model file to load.
Is there any way to have the model file loaded once and subsequent unit test runs are run against that model which is already in-memory? Perhaps something like a test "server" that stores the model and can be called from the unit tests? I have never dealt with something like this before so I really have no idea where to start.
In unit-testing, the typical solution for such a scenario is to isolate your code from the 'disturbing' libraries (that is, eliminate the dependency) or use doubles (like stubs or mocks). Unit-testing against actual data bases is considered a 'test smell'.
In general, if you are on a modern operating system such as Linux, subsequent reads of the same file within a short amount of time will be cached by the buffer cache - unless the file is very large or you are short on free memory. This is not just theoretical - you can easily run a JUnit test with some profiling that shows that loading a file multiple times will result in near memcpy speeds for all but the first load, as long as the file approximately fits in free RAM.
That is, the file will generally load at 5 GB/s or faster on modern desktop or server hardware as long as it is in the cache. If the file is too big to keep in the cache - then a lot of the other solutions are already excluded: since the alternatives such as a daemon keeping the file in shared memory would require the same amount of RAM anyway.
That's all talking about the raw cost of reading the file (e.g., using Java's InputStream or other classes which read the raw file). It's entirely likely that the true cost of "loading" the file is in the application specific parsing you need to do to bring the file into the expected in-memory format. In that case, you could certainly consider some kind of long-lived cache process that keeps a file in memory across Java invocations. You could use something off the shelf like redis or memcached, but you'd have to make sure that your deserialization scheme was then at much faster than your parsing scheme.
Ultimately you need to profile the library's load of the problematic file. Is it IO limited (i.e., most time spent blocking in IO functions), or is it CPU limited (e.g., most time spend processing in parsing or other functions)? Only then can you determine at what level you need to cache at to be useful.
I have millions of rows to be read from database and multiple users come in a day to read the same data. so I want to create a cache. so that I don't have to go to database again for same data.
I have seen many option but couldn't find figure out which approach to use.
Creating my own cache I am thinking saving the data of a query result and writing in a file or
use some third party in memory caches?
Guava CacheBuilder ,LRUMap caching,whirlycache ,cache4j.
You are not the first person to have requirements like this, which is why there are dozens of cache implementations available as open source projects, and even a standard set of Java APIs for caching (JCache). If your needs go beyond those solutions, there are even commercial solutions that handle tens of terabytes of data transparently across RAM, flash, database, etc. If none of those are sufficient, then you should definitely write your own.
Its totally dependent on multiple factors. and i think answer will be based on environment, Size of data etc. here is the main points
You want to keep the cache in ram as much as possible because its faster to access than being in file system.
You can also use OS memory mapped files which does balance access vs utilization. I suggest any proven solution than creating your own
If you are running low on memory then you might need to ask question on what is more important like caching the top access data as they are most likely to be asked by client.
So there is not a sure or definite answer but you have to decide based on your constraints. Hope this helps
I think you are overengineering the problem, it isn't trivial to write a performant, transparent cache, unless you only need a simple HashMap to hold some values. You should focus on writing code to solve your domain problem and not writing too much framework code.
Stop reinventing the wheel, use either an in-memory cache (e.g. infinispan or redis) or a database (e.g. postgres). You will have less pain and better performance.
Is it better to make two database calls or one database call with some java processing?
One database call gets only the relevant data which is to be separated into two different list which requires few lines of java.
Database dips are always an expensive operation. If you can manage with one db fetch and do some java processing, it should be a better and faster choice for you.
But you may have to analyze in your scenarion, which one is turning to be a more efficent choice. I assume singly DB fetch and java processing should be better.
Testing is key. Some questions you may want to ask yourself:
How big is each database call
How much bigger/smaller would the calls be if I combined them
Should I push the procesing to the client?
Timing
How time critical is processing?
Do you need to swarm the DB or is it okay to piggy back on the client?
Is the difference negligible?
Java Processing is much faster than SQL Fetch, As I had the same problem so I recommend you to fetch single data with some processing, because maybe the time both options take has a minor difference but some Computers take a lot of time to fetch data from DB so I suggest you to just Get single data with some Java Processing.
Generally Javaprocessing is better if its not some simple DB query that you are doing.
I Would recomend you for trying them both, measure some time and load and see what fits your application the best.
It all depends on how intensive your processing is and how your database is setup. For instance an Oracle running on a native file system will most likely be more performant then doing the java processing code on your own for complex operations. Note that most build in operations on well known databases are highly optimized and usually very performant.
Does it help to use Redis with Java to develop data intensive applications (e.g. data-mining) in Java?
Does it work faster or consume less memory comparing to plain Java for similar operation on high volume of data?
Edit: My question is mostly about running on single machine. For example for working with a large number of list/set/maps and query and sort them.
Redis will definitely not be faster that native Java on a single machine. It would allow you to distribute processing, but if the chunks of data really are large, they're not likely to fit into memory anyway. Without knowing more about what you're doing, I would suggest storing the data on disk. When you get multiple machines, you can network mount the partition and share the data that way. Alternatively, Hadoop with MapReduce sounds like the right sort of thing for what you're doing.