We have the background job which fetches records from particular table in batch of 1000 records at a time.
This job runs at a interval of 30 minutes.
Now,
These records have email (key) and reason (value).
The issues is we have to lookup these records against data warehouse, (sort of filtering kind of thing - fetches last 180 days data from warehouse).
As call to the data warehouse is very costly in terms of time (45 minutes approximately).
So, existing scenario is like this.
We check for flat file on disk. if it does not exists. we make a call to data warehouse, fetch the records ( size ranges from 0.2 million to 0.25 million )
and write these records on a disk.
on subsequent calls we do lookup from flat file only.
- Loading entire file in memory and do in-memory search and filtering.
This caused OutOfMemory issue so many times.
So, we modified the logic like this.
We modified the Data warehouse call in 4 equal intervals and stored each result in file again with ObjectOutputStream,
Now, on subsequent calls we load data into memory in interval, i.e. 0-30 days, 31-60 days and so on.
But, this also is not helping out. Can experts please suggest what should be the ideal way to tackle this issue ?. Someone in senior team suggested to use CouchDB for storing and querying the data. But, at first glance I would prefer if with existing infrastructure any good solution is available ? if not then can think of using other tools.
Filtering code as of now.
private void filterRecords(List<Record> filterRecords) {
long start = System.currentTimeMillis();
logger.error("In filter Records - start (ms) : "+start);
Map<String, String> invalidDataSet=new HashMap<String, String>(5000);
try{
boolean isValidTempResultFile = isValidTempResultFile();
//1. fetch invalid leads data from DWHS only if existing file is 7 days older.
String[] intervals = {"0","45","90","135"};
if(!isValidTempResultFile){
logger.error("#CCBLDM isValidTempResultFile false");
for(String interval : intervals){
invalidDataSet.clear();
// This call takes approx 45 minutes to fetch the data
getInvalidLeadsFromDWHS(invalidDataSet, interval, start);
filterDataSet(invalidDataSet, filterRecords);
}
}
else{
//Set data from temporary file in interval to avoid heap space issue
logger.error("#CCBLDM isValidTempResultFile true");
intervals = new String[]{"1","2","3","4"};
for(String interval : intervals){
// Here GC won't happen at immediately causing OOM issue.
invalidDataSet.clear();
// Keeps 45 days of data in memory at a time
setInvalidDataSetFromTempFile(invalidDataSet, interval);
//2. mark current record as incorrect if it belongs to invalid email list
filterDataSet(invalidDataSet, filterRecords);
}
}
}catch(Exception filterExc){
Scheduler.log.error("Exception occurred while Filtering Records", filterExc);
}finally{
invalidDataSet.clear();
invalidDataSet = null;
}
long end = System.currentTimeMillis();
logger.error("Total time taken to filter all records ::: ["+(end-start)+"] ms.");
}
I'd strongly suggest to slightly change your infrastructure. IIYC you're looking up something in a file and something in a map. Working with a file is a pain and loading everything to memory causes OOME. You can do better using a memory mapped file, which allows fast and simple access.
There's a Chronicle-Map offering a Map interface to data stored off-heap. Actually, the data reside on the disk and take main memory as needed. You need to make your keys and values Serializable (which AFAIK you did already) or use an alternative way (which may be faster).
It's no database, it's just a ConcurrentMap, which makes working with it very simple. The whole installation is just adding a line like compile "net.openhft:chronicle-map:3.14.5" to build.gradle (or a few maven lines).
There are alternatives, but Chronicle-Map is what I've tried (just started with it, but so far everything works perfectly).
There's also Chronicle-Queue, in case you need batch processing, but I strongly doubt you'll need it as you're limited by your disk rather than main memory.
This is a typical use case for a batch processing kind of code. You could introduce a new column, say 'isProcessed' with 'false' value. Read say 1-100 rows (where id>=1 && id<=100) , process them and mark that flag as true. Then read say next 100 and so on. At the end of the job, mark all flags as false again (reset). But with time, it might become difficult to maintain and develop features on such a custom framework. There are open source alternatives.
Spring batch is a very good framework for these use cases and could be considered: https://docs.spring.io/spring-batch/trunk/reference/html/spring-batch-intro.html.
There is also Easy batch: https://github.com/j-easy/easy-batch which doesn't require 'Spring framework' knowledge
But if it is a huge data set and would keep on growing in future, consider moving to 'big data' tech stack
Related
I am facing some optimization problem in java. I have to process a table which has 5 attributes. The table contains about 5 millions records. To simplify the problem, let say I have to read every record one by one. Then I have to process each record. From each record I have to generate a mathematical lattice structure which has 500 nodes. In other words each record generate 500 more new records which can be referred as parents of the original record. So in total there are 500 X 5 Millions records including original plus parent records. Now the job is to find the number of distinct records out of all 500 X 5 Millions records with their frequencies. Currently I have solved this problem as follow. I convert every record to a string with value of each attribute separated by "-". And I count them in a java HashMap. Since these records involve intermediate processing. A record is converted to a string and then back to a record during intermediate steps. The code is tested and it is working fine and produce accurate results for small number of records but it can not process 500 X 5 Millions records.
For large number of records it produce the following error
java.lang.OutOfMemoryError: GC overhead limit exceeded
I understand that the number of distinct records are not more than 50 thousands for sure. Which means that the data should not cause memory or heap overflow. Can any one suggest any option. I will be very thankful.
Most likely, you have some data-structure somewhere which is keeping references to the processed records, also known as a "memory leak". It sounds like you intend to process each record in turn and then throw away all the intermediate data, but in fact the intermediate data is being kept around. the garbage collector can't throw away this data if you have some collection or something still pointing to it.
Note also that there is the very important java runtime parameter "-Xmx". Without any further detail than what you've provided, I would have thought that 50,000 records would fit easily into the default values, but maybe not. Try doubling -Xmx (hopefully your computer has enough RAM). If this solves the problem then great. If it just gets you twice as far before it fails, then you know it's an algorithm problem.
Using a sqlite database can used to safe (1.3tb?) data. With query´s you can find fast info back. Also the data get saved when youre program ends.
You probably need to adopt a different approach to calculating the frequencies of occurrence. Brute force is great when you only have a few million :)
For instance, after your calculation of the 'lattice structure' you could combine that with the original data and take either the MD5 or SHA1. This should be unique except when the data is not "distinct". Which then should reduce your total data down back below 5 million.
In my project I am generating a report. This involves huge data transmission from the DB.
The logic is like user will give certain criteria,based on which first we will fetch parent items from db.There may be 100000 parent items.Not only this after getting this items we are gathering child items of this parent items and there detailed details. All to gather this parent and child information we are putting in one response xml.
It is fine for small records. But for huge records it is taking more time. We are using a tool as a back end system.Which stores the records.It has its own query set so query optimization did not work.All we have to do it with java.
Can any one from the team give some idea how to optimize this.
Not really a true answer, but too long for a comment
You must benchmark the different steps:
database - time a select extracting all the records (parents + childs) directly on the database (assuming it is a simple database)
network - time a transfert of the approximate size of the whole records.
processing - store the result on a local file and time the processing reading from local file (you must also time a copy of the file to know the time used to read from disk
Multithreading will only help if the bottleneck is processing.
We have some part of our application that need to load a large set of data (>2000 entities) and perform computation on this set. The size of each entity is approximately 5 KB.
On our initial, naïve, implementation, the bottleneck seems to be the time required to load all the entities (~40 seconds for 2000 entities), while the time required to perform the computation itself is very small (<1 second).
We had tried several strategies to speed up the entities retrieval:
Splitting the retrieval request into several parallel instances and then merging the result: ~20 seconds for 2000 entities.
Storing the entities at an in-memory cache placed on a resident backend: ~5 seconds for 2000 entities.
The computation needs to be dynamically computed, so doing a precomputation at write time and storing the result does not work in our case.
We are hoping to be able to retrieve ~2000 entities in just under one second. Is this within the capability of GAE/J? Any other strategies that we might be able to implement for this kind of retrieval?
UPDATE: Supplying additional information about our use case and parallelization result:
We have more than 200.000 entities of the same kind in the datastore and the operation is retrieval-only.
We experimented with 10 parallel worker instances, and a typical result that we obtained could be seen in this pastebin. It seems that the serialization and deserialization required when transferring the entities back to the master instance hampers the performance.
UPDATE 2: Giving an example of what we are trying to do:
Let's say that we have a StockDerivative entity that need to be analyzed to know whether it's a good investment or not.
The analysis performed requires complex computations based on many factors both external (e.g. user's preference, market condition) and internal (i.e. from the entity's properties), and would output a single "investment score" value.
The user could request the derivatives to be sorted based on its investment score and ask to be presented with N-number of highest-scored derivatives.
200.000 by 5kb is 1GB. You could keep all this in memory on the largest backend instance or have multiple instances. This would be the fastest solution - nothing beats memory.
Do you need the whole 5kb of each entity for computation?
Do you need all 200k entities when querying before computation? Do queries touch all entities?
Also, check out BigQuery. It might suit your needs.
Use Memcache. I cannot guarantee that it will be sufficient, but if it isn't you probably have to move to another platform.
This is very interesting, but yes, its possible & Iv seen some mind boggling results.
I would have done the same; map-reduce concept
It would be great if you would provide us more metrics on how many parallel instances do you use & what are the results of each instance?
Also, our process includes retrieval alone or retrieval & storing ?
How many elements do you have in your data store? 4000? 10000? Reason is because you could cache it up from the previous request.
regards
In the end, it does not appear that we could retrieve >2000 entities from a single instance in under one second, so we are forced to use in-memory caching placed on our backend instance, as described in the original question. If someone comes up with a better answer, or if we found a better strategy/implementation for this problem, I would change or update the accepted answer.
Our solution involves periodically reading entities in a background task and storing the result in a json blob. That way we can quickly return more than 100k rows. All filtering and sorting is done in javascript using SlickGrid's DataView model.
As someone has already commented, MapReduce is the way to go on GAE. Unfortunately the Java library for MapReduce is broken for me so we're using non optimal task to do all the reading but we're planning to get MapReduce going in the near future (and/or the Pipeline API).
Mind that, last time I checked, the Blobstore wasn't returning gzipped entities > 1MB so at the moment we're loading the content from a compressed entity and expanding it into memory, that way the final payload gets gzipped. I don't like that, it introduces latency, I hope they fix issues with GZIP soon!
Here is the scenario I am researching a solution for at work. We have a table in postgres which stores events happening on network. Currently the way it works is, rows get inserted as network events come and at the same time older records which match the specific timestamp get deleted in order to keep table size limited to some 10,000 records. Basically, similar idea as log rotation. Network events come in burst of thousands at a time, hence rate of transaction is too high which causes performance degradation, after sometime either server just crashes or becomes very slow, on top of that, customer is asking to keep table size up to million records which is going to accelerate performance degradation (since we have to keep deleting record matching specific timestamp) and cause space management issue. We are using simple JDBC to read/write on table. Can tech community out there suggest better performing way to handle inserts and deletes in this table?
I think I would use partitioned tables, perhaps 10 x total desired size, inserting into the newest, and dropping the oldest partition.
http://www.postgresql.org/docs/9.0/static/ddl-partitioning.html
This makes load on "dropping oldest" much smaller than query and delete.
Update: I agree with nos' comment though, the inserts/deletes may not be your bottleneck. Maybe some investigation first.
Some things you could try -
Write to a log, have a separate batch proc. write to the table.
Keep the writes as they are, do the deletes periodically or at times of lower traffic.
Do the writes to a buffer/cache, have the actual db writes happen from the buffer.
A few general suggestions -
Since you're deleting based on timestamp, make sure the timestamp is indexed. You could also do this with a counter / auto-incremented rowId (e.g. delete where id< currentId -1000000).
Also, JDBC batch write is much faster than individual row writes (order of magnitude speedup, easily). Batch writing 100 rows at a time will help tremendously, if you can buffer the writes.
I have a requirement where I have to select around 60 million plus records from database. Once I have all records in ResultSet then I have to formate some columns as per the client requirement(date format and number format) and then I have to write all records in a file(secondary memory).
Currently I am selecting records on day basis (7 selects for 7 days) from DB and putting them in a HashMap. Reading from HashMap and formating some columns and finally writing in a file(separate file for 7 days).
Finally I am merging all 7 files in a single file.
But this whole process is taking 6 hrs to complete. To improve this process I have created 7 threads for 7 days and all threads are writing separate files.
Finally I am merging all 7 files in a single file. This process is taking 2 hours. But my program is going to OutOfMemory after 1 hour and so.
Please suggest the best design for this scenario, should I used some caching mechanism, if yes, then which one and how?
Note: Client doesn't want to change anything at Database like create indexes or stored procedures, they don't want to touch database.
Thanks in advance.
Do you need to have all the records in memory to format them? You could try and stream the records through a process and right to the file. If your able to even break the query up further you might be able to start processing the results, while your still retrieving them.
Depending on your DB backend they might have tools to help with this such as SSIS for Sql Server 2005+.
Edit
I'm a .net developer so let me suggest what I would do in .net and hopefully you can convert into comparable technologies on the java side.
ADO.Net has a DataReader which is a forward only, read only (Firehose) cursor of a resultset. It returns data as the query is executing. This is very important. Essentially, my logic would be:
IDataReader reader=GetTheDataReader(dayOfWeek);
while (reader.Read())
{
file.Write(formatRow(reader));
}
Since this is executing while we are returning rows your not going to block on the network access which I am guessing is a huge bottleneck for you. The key here is we are not storing any of this in memory for long, as we cycle the reader will discard the results, and the file will write the row to disk.
I think what Josh is suggesting is this:
You have loops, where you currently go through all the result records of your query (just using pseudo code here):
while (rec = getNextRec() )
{
put in hash ...
}
for each rec in (hash)
{
format and save back in hash ...
}
for each rec in (hash)
{
write to a file ...
}
instead, do it like this:
while (rec = getNextRec() )
{
format fields ...
write to the file ...
}
then you never have more than 1 record in memory at a time ... and you can process an unlimited number of records.
Obviously reading 60 million records at once is using up all your memory - so you can't do that. (ie your 7 thread model). Reading 60 millions records one at a time is using up all your time - so you can't do that either (ie your initial read to file model).
So.... you're going to have to compromise and do a bit of both.
Josh has it right - open a cursor to your DB that simply reads the next record, one after the other in the simplest, most feature-light way. A "firehose" cursor (otherwise known as a read-only, forward-only cursor) is what you want here as it imposes the least load on the database. The DB isn't going to let you update the records, or go backwards in the recordset, which you don't want anyway, so it won't need to handle memory for the records.
Now you have this cursor, you're being given 1 record at a time by the DB - read it, and write it to a file (or several files), this should finish quite quickly. Your task then is to merge the files into 1 with the correct order, which is relatively easy.
Given the quantity of records you have to process, I think this is the optimum solution for you.
But... seeing as you're doing quite well so far anyway, why not just reduce the number of threads until you are within your memory limits. Batch processing is run overnight is many companies, this just seems to be another one of those processes.
Depends on the database you are using, but if it was SQL Server, I would recommend using something like SSIS to do this rather than writing a program.