I am trying to process a large number of rows imported from hive table(hundred millions of rows). As output it will be much more.
I need to generate new rows if some conditions are valid. But this is not a problem. The problem is how to store these hive rows.
In this moment, I use ArrayList of Objects because the order is very important for my algorithm of inserting new rows, but i get an "GC overhead limit exceeded".
You should be retrieving one page of results at a time into a new ArrayList, processing the rows in that result page, writing new rows as necessary, then load the next result page into a new ArrayList. Garbage Collection (GC) will clean up the older ArrayList.
Ordering of the results should be with an "ORDER BY" clause on your database query.
If you are inserting into the same table and want to avoid reprocessing rows you are adding then you'll need a column on the table to differentiate the new rows from the existing ones - such as an auto-incremented "id" or "date_created".
I am trying to process a large number of rows imported from hive table... but i get an "GC overhead limit exceeded".
You have a couple of options here. As #DarrenKennedy mentions, if there was some way to process only a page at a time then that would be optimal however it sounds like this isn't an option because of custom sorting.
So you have two possibilities:
Increase the size of your memory to fit all of the rows. I suspect that this also isn't an option but I thought I'd state it.
Process a batch in memory, write it to disk, then process the batch files at the end. See below.
To process in batches you will need to read a bunch of your input rows that will comfortably fit into memory, sort and filter them, and then write each of the bunches out their own local temporary file. If you are using cloud systems for this then ephemeral storage would be great for this.
Once you have processed all of the hive input rows, sorted and written them into a bunch of temp files then you will need to go back and read and process the sorted rows doing an insert sort. Here's some pseudo code:
Open all of your temp files using a BufferedReader or some such.
Read in the first line of all of the files.
Output the lowest line from all of the temp files. Read the next line from that file.
Repeat until all of the temporary files have been processed.
Close and delete the temp files.
Hope this helps.
Related
Background I am using SQLite to store around 10M entries, where the size of each entry is around 1Kb. I am reading this data back in chunks of around 100K entries at a time, using multiple parallel threads. Read and writes are not going in parallel and all the writes are done before starting the reads.
Problem I am experiencing too many disk reads. Each second around 3k reads are happening and I am reading only 30Kb data in those 3k reads (Hence around 100 bytes per disk read). As the result, I am seeing a really horrible performance (It is taking around 30 minutes to read the data)
Question
Is there any SQlite settings/pragmas that I can use to avoid the small size disk reads?
Are there any best practices for batch parallel reads in SQlite?
Does SQlite read all the results of a query in one go? Or read the results in smaller chunks? If latter is the case, then where does it stone partial out of a query
Implementation Details My using SQlite with Java and my application is running on linux. JDBC library is https://github.com/xerial/sqlite-jdbc (Version 3.20.1).
P.S I am already built the necessary Indexes and verified that no table scans are going on (using Explain Query planner)
When you are searching for data with an index, the database first looks up the value in the index, and then goes to the corresponding table row to read all the other columns.
Unless the table rows happen to be stored in the same order as the values in the index, each such table read must go to a different page.
Indexes speed up searches only if the seach reduces the number of rows. If you're going to read all (or most of the) rows anyway, a table scan will be much faster.
Parallel reads will be more efficient only if the disk can actually handle the additional I/O. On rotating disks, the additional seeks will just make things worse.
(SQLite tries to avoid storing temporary results. Result rows are computed on the fly (as much as possible) while you're stepping through the cursor.)
I am working on mysql and java application
I got requirement to save as many as possible files with some other records in DB. It should support at least 5 million records.
My question is that: Saving this much big records with files is quite difficult and I can get "Packet for query is too large" Exception.
I have to two options:
Increase the size of MAX_ALLOWED_PACKET
Store the records in limit with some job trigger (for example over 300k records, then split records in limit and save it in different queries,but not exactly sure how I can split the records).
I do not want to increase the size of MAX_ALLOWED_PACKET, as user can enter any amount of records. Let's say we increase the size to accept 500k records but later user sends a million records to insert then again will get same exception.
Is there any better solution for this?
I got the solution...rather than processing txt file directly I am zipping it first and then send for insertion and it's able to process 5 million records.
Posting here if someone facing same issue then he can try this solution.
i am working with two million records in the form of multiple input files(fixed space separated with 45 columns) i have to sort them and then consolidate them together, previously i was working with array lists, generating beans storing in these array lists, sorting and consolidating, it worked fine when the records were less but when i combined all the input files, it threw heap space memory exception .
Now I started using a database MS Access to counter this and read and put all my input files in the access tables using JDBC ODBC connection ,now this alone is taking 5 hours to read the files and store them in DB
I have to combine and sort theses files as well
Please point me towards the right direction
To combine sort and consolidate multiple input files with more than 2 million records and generate the output file based on the specification
For starters you could look into a more robust database. MySQL (or a fork of it, MariaDB) should be better suited to work with the volume of data you are after.
Reading the files could also be done asynchronously, that should further speed up the process.
As an extra note, depending on what type of consolidation you are after, you could also look into External Sorting algorithms, which are explicitly designed to sort data which does not all fit in memory.
I am facing some optimization problem in java. I have to process a table which has 5 attributes. The table contains about 5 millions records. To simplify the problem, let say I have to read every record one by one. Then I have to process each record. From each record I have to generate a mathematical lattice structure which has 500 nodes. In other words each record generate 500 more new records which can be referred as parents of the original record. So in total there are 500 X 5 Millions records including original plus parent records. Now the job is to find the number of distinct records out of all 500 X 5 Millions records with their frequencies. Currently I have solved this problem as follow. I convert every record to a string with value of each attribute separated by "-". And I count them in a java HashMap. Since these records involve intermediate processing. A record is converted to a string and then back to a record during intermediate steps. The code is tested and it is working fine and produce accurate results for small number of records but it can not process 500 X 5 Millions records.
For large number of records it produce the following error
java.lang.OutOfMemoryError: GC overhead limit exceeded
I understand that the number of distinct records are not more than 50 thousands for sure. Which means that the data should not cause memory or heap overflow. Can any one suggest any option. I will be very thankful.
Most likely, you have some data-structure somewhere which is keeping references to the processed records, also known as a "memory leak". It sounds like you intend to process each record in turn and then throw away all the intermediate data, but in fact the intermediate data is being kept around. the garbage collector can't throw away this data if you have some collection or something still pointing to it.
Note also that there is the very important java runtime parameter "-Xmx". Without any further detail than what you've provided, I would have thought that 50,000 records would fit easily into the default values, but maybe not. Try doubling -Xmx (hopefully your computer has enough RAM). If this solves the problem then great. If it just gets you twice as far before it fails, then you know it's an algorithm problem.
Using a sqlite database can used to safe (1.3tb?) data. With query´s you can find fast info back. Also the data get saved when youre program ends.
You probably need to adopt a different approach to calculating the frequencies of occurrence. Brute force is great when you only have a few million :)
For instance, after your calculation of the 'lattice structure' you could combine that with the original data and take either the MD5 or SHA1. This should be unique except when the data is not "distinct". Which then should reduce your total data down back below 5 million.
I have a requirement where I have to select around 60 million plus records from database. Once I have all records in ResultSet then I have to formate some columns as per the client requirement(date format and number format) and then I have to write all records in a file(secondary memory).
Currently I am selecting records on day basis (7 selects for 7 days) from DB and putting them in a HashMap. Reading from HashMap and formating some columns and finally writing in a file(separate file for 7 days).
Finally I am merging all 7 files in a single file.
But this whole process is taking 6 hrs to complete. To improve this process I have created 7 threads for 7 days and all threads are writing separate files.
Finally I am merging all 7 files in a single file. This process is taking 2 hours. But my program is going to OutOfMemory after 1 hour and so.
Please suggest the best design for this scenario, should I used some caching mechanism, if yes, then which one and how?
Note: Client doesn't want to change anything at Database like create indexes or stored procedures, they don't want to touch database.
Thanks in advance.
Do you need to have all the records in memory to format them? You could try and stream the records through a process and right to the file. If your able to even break the query up further you might be able to start processing the results, while your still retrieving them.
Depending on your DB backend they might have tools to help with this such as SSIS for Sql Server 2005+.
Edit
I'm a .net developer so let me suggest what I would do in .net and hopefully you can convert into comparable technologies on the java side.
ADO.Net has a DataReader which is a forward only, read only (Firehose) cursor of a resultset. It returns data as the query is executing. This is very important. Essentially, my logic would be:
IDataReader reader=GetTheDataReader(dayOfWeek);
while (reader.Read())
{
file.Write(formatRow(reader));
}
Since this is executing while we are returning rows your not going to block on the network access which I am guessing is a huge bottleneck for you. The key here is we are not storing any of this in memory for long, as we cycle the reader will discard the results, and the file will write the row to disk.
I think what Josh is suggesting is this:
You have loops, where you currently go through all the result records of your query (just using pseudo code here):
while (rec = getNextRec() )
{
put in hash ...
}
for each rec in (hash)
{
format and save back in hash ...
}
for each rec in (hash)
{
write to a file ...
}
instead, do it like this:
while (rec = getNextRec() )
{
format fields ...
write to the file ...
}
then you never have more than 1 record in memory at a time ... and you can process an unlimited number of records.
Obviously reading 60 million records at once is using up all your memory - so you can't do that. (ie your 7 thread model). Reading 60 millions records one at a time is using up all your time - so you can't do that either (ie your initial read to file model).
So.... you're going to have to compromise and do a bit of both.
Josh has it right - open a cursor to your DB that simply reads the next record, one after the other in the simplest, most feature-light way. A "firehose" cursor (otherwise known as a read-only, forward-only cursor) is what you want here as it imposes the least load on the database. The DB isn't going to let you update the records, or go backwards in the recordset, which you don't want anyway, so it won't need to handle memory for the records.
Now you have this cursor, you're being given 1 record at a time by the DB - read it, and write it to a file (or several files), this should finish quite quickly. Your task then is to merge the files into 1 with the correct order, which is relatively easy.
Given the quantity of records you have to process, I think this is the optimum solution for you.
But... seeing as you're doing quite well so far anyway, why not just reduce the number of threads until you are within your memory limits. Batch processing is run overnight is many companies, this just seems to be another one of those processes.
Depends on the database you are using, but if it was SQL Server, I would recommend using something like SSIS to do this rather than writing a program.