Let's consider a scenario
Accounts.csv
Transaction.csv
We have a mapping of each account number to transaction details, so 1 account number can have multiple transactions. Using these details we have to generate PDF for each account
If suppose, transaction CSV file is very large(>1 GB), then loading all the details and parsing could be the memory issue. So what could be the best approach to parse the transaction file ? Reading chunk by chunk also leading to memory consumption. Please advice
As others have said a Database would be a good solution.
Alternatively you could sort the 2 files on th account number. Most Operating systems provide efficient file sorting programs, e.g. for linux (sorting on 5th column)
LC_ALL=C sort -t, -k5 file.csv > sorted.csv
taken from Sorting csv file by 5th column using bash
You can then read the 2 files sequentially
Your Programming logic is:
if (Accounts.accountNumber < Transaction.accountNumber) {
read Accounts file
} else if (Accounts.accountNumber = Transaction.accountNumber) {
process transaction
read Transaction file
} else {
read Transaction file
}
The memory requirements will be tiny, you only need to hold one record from each file in memory.
Let's say you are using Oracle as Database,.
you could load the data into its corresponding tables using the Oracle SQL Loader tool.
Once the data is loaded you could use simple SQL Queries to Join and Query data from the loaded tables.
This will work in all types of Databases but you will need to find the appropriate tool for loading the data.
Of cause importing the data to a database first would be the most elegant way.
Beside that your question leaves the impression that this isn‘t an option.
So I recommend you read the transactions.csv line-by-line (for instance by using a BufferedReader). Because in CSV Format each line is a record you can then (while reading) filter out and forget about each record that is not for your current account.
After one file-traversal you have all transactions for one account and that should usually fit into memory.
A downfall of this approach is that you end up reading the transactions multiple times, once for each accounts PDF generation. But if your application would need to be highly optimized, I suggest you would have already used a database.
Related
I have a query in regards to what is the best way of handling huge files in Java?
Shall we use the no-sql database like Cassandra or try to use our existing Oracle database (to dump the content of the file).
My file can contain at most 1 or 2 fields. But mostly what I shall be able to do with the file content is just search an Id and return boolean.
File can contain records in tens of millions or as low as thousands.
Also this file can get refreshed on daily basis. Whenever refreshed I need to clear all previous values.
Any suggestions would be helpful!!
Regards,
Vicky
As per your requirements,
Oracle
Is good for indexing and fits your requirements if every day data is in tens of millions.
Index will be stored in memory and searches will be faster for this short data. If table is also short you can also request to keep table in memory and that will be even faster if any other column is also required.
You can drop table every day and import file again as new table. This should work.
Cassandra
Is also good for indexing. All your searches will also be faster (similar to oracle for such small data)
Cassandra is NoSQL database designed to provide scalability, high write throughput, availability for high volume data and queries.
Cassandra generally runs in clustered environments for above properties.
I would suggest to check your requirements, If you just to keep data in DB and wants to query once in a while or maybe 100 requests per sec then using Cassandra is like hitting a nail in wall with sledgehammer where small hammer or mallet is enough.
I have 36gb big file which has about 600 milion lines of data in this structure:
LogID,SensorID,ValueNumeric,ValueString,DateAdded,VariableName,Name
1215220724,1182,1,0000,,2016-01-04 12:56:57.7770000,Scan_Frequency,MIC_2
I am only interested in date, value, variable name and stream (name). Problem is that there is a lot of duplicate entrys and that data is not ordered by date.
My current solution is that I go trough first 100.000 lines, read name of the variables (there 833 of them) and create a table in DB for each of them. For primary key I use date(I had cut off seconds and milliseconds), so I DB will be free from duplicates. I know that this is not the best to have string for primary key.
Then I read the file again and enter that data in the tables, but it's to slow. My estimation is that I should have at least 10x less lines in the end.
Does anyone have a better idea, how to read such a big file and sort it by date and remove duplicates. It would be enough to save data for 5 min intervals.
I'd use an Elasticsearch + Logstash based solution (they are free and work very well with their installation defaults). Logstash is designed to ingest data from several sources (including CSV files) and Elasticsearch is a NoSql database that does an amazing job both indexing documents and querying them.
See this question and answer for a starting point, and here is the documentation.
Your database will offer a tool to import csv files directly. This is most likely much faster than using JDBC. Futhermore, chances are high it also offers a tool to remove the duplicates you mention during the import. Once you have the data in the database it will take care of sorting the data for you.
Just to give you an example: If you were using MySQL there is the MySQL import utility mysqlimport which also offers an option to remove duplicates during the import using --replace.
I have a program that creates multiple text files of rdf triples. I need to compare the triples and do it fast, what is the best way to do this? I thought of putting the triples into an array and comparing them but there could potentially be hundreds of thousands of triples per file and that would take forever. I need it to be as close to realtime as possible since the triples will be genreated constantly amoung the files. Any help would be great. The files are also in AllegroGraph repository's if it's easier to compare them there somehow.
A thought: if I stored the triples in excel (one triple per row) and one sheet per repository,
A: how could I find the duplicates amoung the sheets.
B: would it be fast.
and C: how could I automate that from Java?
You need to build a master index that will store each triple and in how many files it appears and the exact file name and location of the triple within each file. You can search the master index to answer the queries in real-time.
As you update, delete or create new rdf files, you need to update the master index.
You need to store the master index so that it can be updated, searched efficiently.
Simple choice could be to use relational database (like MySql) to store the master index. It can answer you queries like finding common triples with simple select statement select * from rdfindex where triplecount > 2.
EDIT: You cannot store hundreds of thousands of triples in memory using HashMap or similar datastructure. That's why I suggested using database, which can store the data and respond to your queries efficiently. You can look at embedded database like SQLite to store the data.
Read upon these topics
How to create SQLite database and create tables, access tables etc., Create a simple table to store triple, triplecount, filenames.
Convert all your Excel files to CSV files. You can use opencsv to parse the file in Java (check out the samples that come with opencsv).
Parse the CSV files and load the data into SQLite. If the triple is already in the database, then just update the count, if not insert the triple.
As far as I know there is a function to delete duplicate entries in AllegroGraph, this may be an option if all the triples come from there.
We have a large table of approximately 1 million rows, and a data file with millions of rows. We need to regularly merge a subset of the data in the text file into a database table.
The main reason for it being slow is that the data in the file has references to other JPA objects, meaning the other jpa objects need to be read back for each row in the file. ie Imagine we have 100,000 people, and 1,000,000 asset objects
Person object --> Asset list
Our application currently uses pure JPA for all of its data manipulation requirements. Is there an efficient way to do this using JPA/ORM methodologies or am I going to need to revert back to pure SQL and vendor specific commands?
why doesnt use age old technique: divide and conquer? Split the file into small chunks and then have parallel processes work on these small files concurrently.
And use batch inserts/updates that are offered by JPA and Hibernate. more details here
The ideal way in my opinion though is to use batch support provided by plain JDBC and then commit at regular intervals.
You might also wants to look at spring batch as it provided split/parallelization/iterating through files etc out of box. I have used all of these successfully for an application of considerable size.
One possible answer which is painfully slow is to do the following
For each line in the file:
Read data line
fetch reference object
check if data is attached to reference object
if not add data to reference object and persist
So slow it is not worth considering.
I have a requirement where I have to select around 60 million plus records from database. Once I have all records in ResultSet then I have to formate some columns as per the client requirement(date format and number format) and then I have to write all records in a file(secondary memory).
Currently I am selecting records on day basis (7 selects for 7 days) from DB and putting them in a HashMap. Reading from HashMap and formating some columns and finally writing in a file(separate file for 7 days).
Finally I am merging all 7 files in a single file.
But this whole process is taking 6 hrs to complete. To improve this process I have created 7 threads for 7 days and all threads are writing separate files.
Finally I am merging all 7 files in a single file. This process is taking 2 hours. But my program is going to OutOfMemory after 1 hour and so.
Please suggest the best design for this scenario, should I used some caching mechanism, if yes, then which one and how?
Note: Client doesn't want to change anything at Database like create indexes or stored procedures, they don't want to touch database.
Thanks in advance.
Do you need to have all the records in memory to format them? You could try and stream the records through a process and right to the file. If your able to even break the query up further you might be able to start processing the results, while your still retrieving them.
Depending on your DB backend they might have tools to help with this such as SSIS for Sql Server 2005+.
Edit
I'm a .net developer so let me suggest what I would do in .net and hopefully you can convert into comparable technologies on the java side.
ADO.Net has a DataReader which is a forward only, read only (Firehose) cursor of a resultset. It returns data as the query is executing. This is very important. Essentially, my logic would be:
IDataReader reader=GetTheDataReader(dayOfWeek);
while (reader.Read())
{
file.Write(formatRow(reader));
}
Since this is executing while we are returning rows your not going to block on the network access which I am guessing is a huge bottleneck for you. The key here is we are not storing any of this in memory for long, as we cycle the reader will discard the results, and the file will write the row to disk.
I think what Josh is suggesting is this:
You have loops, where you currently go through all the result records of your query (just using pseudo code here):
while (rec = getNextRec() )
{
put in hash ...
}
for each rec in (hash)
{
format and save back in hash ...
}
for each rec in (hash)
{
write to a file ...
}
instead, do it like this:
while (rec = getNextRec() )
{
format fields ...
write to the file ...
}
then you never have more than 1 record in memory at a time ... and you can process an unlimited number of records.
Obviously reading 60 million records at once is using up all your memory - so you can't do that. (ie your 7 thread model). Reading 60 millions records one at a time is using up all your time - so you can't do that either (ie your initial read to file model).
So.... you're going to have to compromise and do a bit of both.
Josh has it right - open a cursor to your DB that simply reads the next record, one after the other in the simplest, most feature-light way. A "firehose" cursor (otherwise known as a read-only, forward-only cursor) is what you want here as it imposes the least load on the database. The DB isn't going to let you update the records, or go backwards in the recordset, which you don't want anyway, so it won't need to handle memory for the records.
Now you have this cursor, you're being given 1 record at a time by the DB - read it, and write it to a file (or several files), this should finish quite quickly. Your task then is to merge the files into 1 with the correct order, which is relatively easy.
Given the quantity of records you have to process, I think this is the optimum solution for you.
But... seeing as you're doing quite well so far anyway, why not just reduce the number of threads until you are within your memory limits. Batch processing is run overnight is many companies, this just seems to be another one of those processes.
Depends on the database you are using, but if it was SQL Server, I would recommend using something like SSIS to do this rather than writing a program.