How to Fetch 1.7 Million records in Java?

How to Fetch 1.7 Million records in Java? - java

I am using MySQL database in which a table has 1.7 million records. Through Restlet framework in Java I want to fetch these records and return it to the client. I am using Linux Centos which is remote server. I have created WAR file and uploaded on the server. When I run the service it is taking lot of time to respond. I waited for 40 mins but not getting any output.
So Can anybody please help me to resolve this problem?

That's probably not going to work: holding that many rows of data in memory will probably cause an out of memory exception (can you look at the logs on the server and see what exactly is happening?).
To do something like this you'll either need to abandon that plan and do pagination of some sort, or you'll need a solution that lets you stream the records to the client without holding them in memory. I'm unsure that the Restlet framework lets you do that: you'll probably need to implement that using servlets yourself.

When I have a very large number of rows I have used memory mapped files. e.g. I have one database where I have to retrieve and process 1.1 billion rows in around a minute. (Its over 200 GB)
This is a very specialist approach, and I suspect there is a way to tune your SQL database or use a NoSQL database to do what you want. I would have though you can retrieve 1.7 million in under a minute depending on what you are doing (e.g. if you are selecting this many amoungst a few TB its going to take a while)
But, if there really is no other option, you could write a custom data store.
BTW: Only a summary is produced. No one should be expected to read that many rows, certainly not display them in a browser. Perhaps there is something you can do to produce a report or summary so there is less to send the client.

I have successfully done just this kind of work in my apps. If your client is ready to accept a big response, there is nothing wrong with the approach. The main point is that you'll need to stream the response, which means you can't build te entire response as a string. Get the outputstream of the HTTP response and write records into it one by one. On the db-end you need to set up a scrollable resultset (easy to do at the JDBC level, as well as at the Hibernate level).

Related

direct logging on elasticsearch vs using logstash and filebeat

I'm using a Spring Boot back-end to provide some restful API and need to log all of my request-response logs into ElasticSearch.
Which of the following two methods has better performance?
Using Spring Boot ResponseBodyAdvice to log every request and response that is sent to the client directly to ElasticSearch.
Log every request and response into a log file and using filebeat and/or logstash to send them to ElasticSearch.

First off, I assume, that you have a distributed application, otherwise just write your stuff in a log file and that's it
I also assume that you have quite a log of logs to manage, otherwise, if you're planning to log like a couple of messages in a hour, then it doesn't really matter which way you go - both will do the job.
Technically both ways can be implemented, although for the first path I would suggest a different approach, at least I did something similar ~ 5 years ago in one of my projects:
Create a custom log appender that throws everything into some queue (for async processing) and from that took an Apache Flume project that can write stuff to the DB of your choice in a transaction manner with batch support, "all-or-nothing" semantics, etc.
This approach solves issues that might appear in the "first" option that you've presented, while some other issues will be left unsolved.
If I compare the first and the second option that you've presented,
I think you better off with filebeat / logstash or even both to write to ES, here is why:
When you log in the advice - you will "eat" the resources of your JVM - memory, CPU to maintain ES connections pool, thread pool for doing an actual log (otherwise the business flow might slow down because of logging the requests to ES).
In addition you won't be able to write "in batch" into the elasticsearch without the custom code and instead will have to create an "insert" per log message that might be wasty.
One more "technicality" - what happens if the application gets restarted for some reason, will you be able to write all the logs prior to the restart if everything gets logged in the advice?
Yet another issue - what happens if you want to "rotate" the indexes in the ES, namely create an index with TTL and produce a new index every day.
filebeat/logstash potentially can solve all these issues, however they might require a more complicated setup.
Besides, obviously you'll have more services to deploy and maintain:
logstash is way heavier than filebeat from the resource consumption standpoint, and usually you should parse the log message (usually with grok filter) in logstash.
filebeat is much more "humble" when it comes to the resource consumption, and if you have like many instances to log (really distributed logging, that I've assumed you have anyway) consider putting a service of filebeat (deamon set if you have k8s) on each node from which you'll gather the logs, so that a single filebeat process could handle different instances, and then deploy a cluster of instances of logstash on a separate machine so that they'll do a heavy log-crunching all the time and stream the data to the ES.
How does logstash/filebeat help?
Out of my head:
It will run in its own pace, so even if process goes down, the messages produced by this process will be written to the ES after all
It even can survive short outages of the ES itself I think (should check that)
It can handle different processes written in different technologies, what if tomorrow you'll want to gather logs from the database server, for example, that doesn't have spring/not written java at all
It can handle indices rotation, batch writing internally so you'll end up with effective ES management that otherwise you had to write by yourself.
What are the drawbacks of the logstash/filebeat approach?
Again, out of my head, not a full list or something:
Well, much more data will go through the network all-in-all
If you use "LogEvent" you don't need to parse the string, so this conversion is redundant.
As for performance implications - it basically depends on what do you measure how exactly does your application look like, what hardware do you have, so I'm afraid I won't be able to give you a clear answer on that - you should measure in your concrete case and come up with a way that works for you better.

Not sure if you can expect a clear answer to that. It really depends on your infrastructure and used hardware.
And do you mean by performance the performance of your spring boot backend application or performance in terms of how long it takes for your logs to arrive at ElasticSearch?
I just assume the first one.
When sending the logs directly to ElasticSearch your bottleneck will be the used network and while logging request and responses into a log file first, your bottleneck will probably be the used harddisk and possible max I/O operations.
Normally I would say that sending the logs directly to ElasticSearch via network should be the faster option when you are operating inside your company/network because writing to a disk is always quite slow in comparison. But if you are using fast SSDs the effect should be neglectable. And if you need to send your network packages to a different location/country this can also change fast.
So in summary:
If you have a fast network connection to your ElasticSearch and HDDs/slower SSDs the performance might be better using the network.
If your ElasticSearch is not at your location and you can use fast SSD, writing the logs into a file first might be the faster option.
But in the end you maybe have to try out both approaches, implement some timers and check for yourself.

we are using both solution. first approach have less complexity.
we choose second approach when we dont want to touch the code and have too many instance of app.
about performance. with writing directly on elasticsearch you have better performance because you are not occupying disk I/O. but assume that when the connection between your app and elasticsearch server is dropped. you would have lost log after some retrying attempts.
using rsyslog and logstash is more reliable for big clusters.

Caching and uploading to server in a periodic interval

I need to upload data on a periodic interval to the server which gets stored in the SQL(After some processing in BL).
Lets say every 15 mins i have 20000 JSON objects of 1kb each. Whats the best way to implement this.
I thought of writing it to a text , then zip and upload to the server. But now there might be better technologies like EHcache. How should i decide? is it better to use any of these Caching opensource tools ?
There might of 10-100's of clients each sending messages as mentioned earlier to the server.

Mongodb sounds the one suitable for your task, because you are talking about caching json documents.
You can cache the documents as they come in a certain collection, then you can later on sync with a persistent store.
As compared to other persistent cache like eh, it would offer greater scalability and reliability outside your typical application container, yet would help you easily dump records in a quick and easy way.
http://www.tutorialspoint.com/mongodb/mongodb_advantages.htm
http://www.mongodb.com/blog/post/why-mongodb-popular

How to return a lot of data from the database to a web client?

I have the following problem:
I have a web application that stores data in the database. I would like for the clients to be able to extract the data e.g. of 2 tables to a file (local to the client).
The database could be arbitrarily big (meaning I have no idea how many data can potentially be in the database. Could be huge).
What is the best approach for this?
Should all the data be SELECTed out of the tables and returned to the client as a single structure to be stored in a file?
Or should the data be retrieved in parts e.g. first 100 then next 100 entries etc and create the single structure in the client?
Are there any pros-cons to consider here?

I've built something similar - there are some really awkward problems here, especially as the filesize can grow beyond what you can comfortably handle in a browser. As the amount of data grows, the time to generate the file increases; this in turn is not what a web application is good at, so you run the risk of your web server getting unhappy with even a smallish number of visitors all requesting a large file.
What we did is split the application into 3 parts.
The "file request" was a simple web page, in which authenticated users can request their file. This kicks off the second part outside the context of the web page request:
File generator.
In our case, this was a windows service which looked at a database table with file requests, picked the latest one, ran the appropriate SQL query, wrote the output to a CSV file, and ZIPped that file, before moving it to the output directory and mailing the user with a link. It set the state of the record in the database to make sure only one process happened at any one point in time.
FTP/WebDAV site:
The ZIP files were written to a folder which was accessible via FTP and WebDAV - these protocols tend to do better with huge files than a standard HTTP download.
This worked pretty well - users didn't like to wait for their files, but the delay was rarely more than a few minutes.

We have a similar use case with an oracle cluster containig approx. 40GB of data. The solution working best for us is a maximum of data per select statement as it reduces DB-overhead significantly.
That being said, there are three optimizations which worked very well for us:
1.) We partition the data into 10 roughly same-sized sets and select them from the database in parallel. For our cluster we found that 8 connections in parallel work approx. 8 times faster than a single connection. There is some additional speedup up to 12 connections but that depends on your database and your dba.
2.) Keep away from hibernate or other ORMs and use custom made JDBCs once you talk about large amounts of data. Use all optimiziations you can get there (e.g. ResultSet.setFetchSize())
3.) Our data compresses very well and putting the data through a gziper saves lots of I/O time. In our case it eliminated I/O from the critical path. By the way, this is also true for storing the data in a file.

Is there a Java local queue library I can use that keeps memory usage low by dumping to the hard drive?

This maybe not possible but I thought I might just give it a try. I have some work that process some data, it makes 3 decisions with each data it proceses: keep, discard or modify/reprocess(because its unsure to keep/discard). This generates a very large amount of data because the reprocess may break the data into many different parts.
My initial method was to send it to my executionservice that was processing the data but because the number of items to process was large I would run out of memory very quickly. Then I decided to maybe offload the queue off to a messaging server(rabbitmq) which works fine but now I'm bound by network IO. What I like about rabbitmq is it keeps messages in memory up to a certain level and then dumps old messages to the local drive so if I have 8 gigs of memory on my server I can still have a 100 gig message queue.
So my question is, is there any library that has a similar feature in Java? Something that I can use as a nonblocking queue that keeps only X items in queue(either by number of items or size) and writes the rest to the local drive.
note: Right now I'm only asking for this to be used on one server. In the future I might add more servers but because each server is self-generating data I would try to take messages from one queue and push them to another if one server's queue is empty. The library would not need to have network access but I would need to access the queue from another Java process. I know this is a long shot but thought if anyone knew it would be SO.

Not sure if it id the approach you are looking for, but why not using a lightweight database like hsqldb and a persistence layer like hibernate? You can have your messages in memory, then commit to db to save on disk, and later query them, with a convenient SQL query.

Actually, as Cuevas wrote, HSQLDB could be a solution. If you use the "cached table" provided, you can specify the maximum amount of memory used, exceeding data will be sent to the hard drive.

Use the filesystem. It's old-school, yet so many engineers get bitten with libraries because they are lazy. True that HSQLDB provides lots of value-add features, but in the context of being light weight....

How to improve my software project's speed?

I'm doing a school software project with my class mates in Java.
We store the info on a remote db.
When we start the application we pull all the information from the database and transform it into objects to use in our application (using java sql statemens).
In the application we edit some of these objects and then when we exit the application
we save or update information in the database using Hibernate.
As you see we dont use Hibernate for pulling in information, we use it just for saving and updating.
We have 2, but very similar problems.
The loading of object(when we start the app) and the saving of objects(with Hibernate) in the db(when closing the app) is taking too much time.
And our project its not a huge enterprise application, its a quite small app, we just manage some students, teachers, homeworks and tests. So our db is also very very small.
How could we increase performance ?
later edit: if we use a local database it runs very quick, it just runs slow on remote databases

Are you saying you are loading the entire database into memory and then manipulating it? If that is the case, why don't you instead simply use the database as a storage device, and do lookups and manipulation as necessary (using Hibernate if you like, or something else if you don't)? The key there is to make sure that you are using connection pooling, as that will reduce the connection time.
If this is what you are doing, then you could be running into memory issues as well - first, by not caching the entire database in memory, you will reduce memory and will spread out the network load from the beginning/end to the times when it needs to happen.

These 2 sentences are red flags for me :
When we start the application we pull
all the information from the database
and transform it into objects to use
in our application (using java sql
statemens). In the application we edit
some of these objects and then when we
exit the application we save or update
information in the database using
Hibernate.
Is there a requirements reason that you are loading all the information from the database into memory at startup, or why you're waiting until shutdown to save changes back in the database?
If not, I'd suggest a design change. If you've already got Hibernate mappings for the tables in the DB, I'd use Hibernate for both all of your CRUD (create, read, update, delete) operations. And, I'd only load the data that each page in your app needs, as it needs it.
If you can't make that kind of design change at this point, I think you've got to look closely at how you're managing the database connections. Are you using connection pools? Are you opening up multiple connections? Forgetting to release them?
Something else to look at. How are you using Hibernate to save the entities to the db? Are you doing a getHibernateTemplate().get on each one and then doing an entity.save or entity.update on each one? If so, that means you are also causing Hibernate to run a select query for each database object before it does a save or update. So, essentially, you'd be loading each database object twice (once at the beginning of the program, once before saving). To see if that's what's happening, you can turn on the show_sql property or use P6Spy to see exactly what queries Hibernate is running.

For what you are doing, you may very well be better off serializing your objects and writing them out to a flat file.
But, much more likely, you should just read / update objects directly from your database as needed instead of all at once, for all the reasons aperkins gives.
Also, consider what happens if your application crashes? If all of your updates are saved only in memory until the application is closed, everything would be lost if the app closes unexpectedly.

The difference in loading everything from a remote DB server versus loading everything from a local DB server is the network latency / pipe size. The network is a much smaller pipe than anything else. Two questions: first, how much data are we really talking about? Second, what is your network speed? 10/100/1000? Figure between 10 and 20% of your pipe size is going to be overhead due to everything from networking protocols to the actual queries themselves.
As others have stated, the way you've architected is usually high on the list of "don't do". When starting, pull only enough data to initialize the app. As the user works through it, pull what you need for that task.
The ONLY time you pull everything is when they are working in a disconnected state. In that case, you still don't load everything as objects in the application, you just work from a local data store which gets sync'ed with the remote server every so often.

The project its pretty much complete. we cant do large refactoring on it now.
I tried to use a second level cache for Hibernate when saving. EhCacheProvider.
in hibernate.xml:
net.sf.ehcache.hibernate.EhCacheProvider
i have done a config for the cache, ehcache.xml:
i have put the cache.jar in the project build path
and i have set the hibernate property for every class and set in the mapping.
But this cache doesn't seem to have an effect. I dont know if it works(if it is used).

Try minimising number of SQL queries, since every query has its own overhead.
You can enable database compression, which should speed things up when there is a lot of data.
Maybe you are connecting to the database many times?
Check the ping time of remote database server - it might be the problem.

As your application is just slow when running on a remote database server, I'd assume that the performance loss is due to:
Connecting to the server: try to reuse connections (pass the instance around) or use connection pooling
Query round-trip time: use as few queries as possible, see here in case of a hand-written DAL:
Preferred way of retrieving row with multiple relating rows
For hibernate you may use its batch functionality and adjust hibernate.batch_size.
In all cases, especially when you can't refactor larger parts of the codebase, use a profiler (method time or sql queries) to find the bottleneck. I bet you'll find thousands of queries, each taking 10ms RTT) which could be merged into one.

Some other things you can look into:
You can allocate more memory to the JVM
Use the jconsole tool to investigate what the bottlenecks are.

Why dont you have two separate threads?
Thread 1 will load your objects one by one.
Thread 2 will process objects as they are loaded.
Your app will seem more interactive at startup.

It never hurts to review the basics:
Improving speed means reducing time (obviously), and to do that, you find activities that take significant time but can be eliminated or replaced with something that uses less time. What I mean by activity is almost always a function call, method call, or property call, performed on a specific line of code for a specific purpose. If may invoke I/O or it may invoke computation, or both. If its purpose is not essential, then it can be optimized.
Many people use profilers to try to find these time-wasting lines of code, but most profilers miss the target because they look at functions, not lines, they go to sleep during I/O, and they worry about "self time".
Many more people try to guess what could be the problem, or they ask others to guess, such as by asking on SO. Such guesses, in the nature of guesses, are sometimes right - more often not, but people still invest time and resources in them.
There's a very simple way to find out for sure, without guessing, what could fruitfully be optimized, and here is one way to do it in Java.

Thanks for your answers. Their were more than helpful.
We completely solved this problem like so:
Refactored the LOAD code. Now it uses Hibernate with Lazy Fetching.
Refactored the SAVE code. Now it saves, just the data that was modified and right after the time it was modified. This way we dont have a HUGE save an the end.
Im amazed of how good it all went. The amount of new code we had to write was very very small.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.