Best way to process lots of POJOs - java

I have an ever growing data set ( stored in a google spreadsheet from day one ) which I now want to do some analysis on. I have some basic spread sheet processing done which worked fine when the data set was < 10,000 but now that I have over 30,000 rows it takes a painful length of time to refresh the sheet when I make any changes.
So basically each data entry contains the following fields (among other things):
Name, time, score, initial value, final value
My spreadsheet was ok as a data analysis solution for stuff like giving me all rows where Name contained string "abc" and score was < 100.
However, as the number of rows increases it takes google sheets longer and longer to generate a result.
So I want to load all my data into a Java program ( Java because this is the language I am most familiar with and want to use this as a meaningful way to refresh my java skills also. )
I also have an input variable which my spread sheet uses when processing the data which I adjust in incremental steps to see how the output is affected. But to get a result for each incremental change to this input variable takes far too long. This is something I want to automate so I can set the range of the input value, increment step and then have the system generate the output for each incremental value.
My question is, what is the best way to load this data into a java program. I have the data in a txt file so figured I could read each line into its own pojo and when all 30,000 rows are loaded into an ArrayList start crunching through this. Is there a more efficient data container or method I could be using?

If you have a bunch of arbitrary (unspecified, probably ad-hoc) data processing to do, and using a spread-sheet is proving too slow, you would be better off looking for a better tool or more applicable language.
Here are some of the many possibilities:
Load the data into an SQL database and perform your analysis using SQL queries. There are many interactive database tools out there.
OpenRefine. Never used it, but I am told it is powerful and easy to use.
Learn Python or R and their associated data analysis libraries.
It would be possible to implement this all in Java and make it go really fast, but for a dataset of 30,000 records it is (IMO) not worth the development effort.

Related

Neo4j - Reading large amounts of data with Java

I'm currently trying to read large amounts of data into my Java application using the official Bolt driver. I'm having issues because the graph is fairly large (~17k nodes, ~500k relationships) and of course I'd like to read this in chunks for memory efficiency. What I'm trying to get is a mix of fields between the origin and destination nodes, as well as the relationship itself. I tried writing a pagination query:
MATCH (n:NodeLabel)-[r:RelationshipLabel]->(n:NodeLabel)
WITH r.some_date AS some_date, r.arrival_times AS arrival_times,
r.departure_times AS departure_times, r.path_ids AS path_ids,
n.node_id AS origin_node_id, m.node_id AS dest_node_id
ORDER BY id(r)
RETURN some_date, arrival_times, departure_times, path_ids,
origin_node_id, dest_node_id
LIMIT 5000
(I changed some of the label and field naming so it's not obvious what the query is for)
The idea was I'd use SKIP on subsequent queries to read more data. However, at 5000 rows/read this is taking roughly 7 seconds per read, presumably because of the full scan ORDER BY forces, and if I SKIP it goes up in execution time and memory usage significantly. This is way too long to read the whole thing, is there any way I can speed up the query? Or stream the results in chunks into my app? In general, what is the best approach to reading large amounts of data?
Thanks in advance.
Instead of skip. From the second call you can do id(r) > "last received id(r)" it should actually reduce the process time as you go.

How to Handle Large Data in Java?

My application needs to use data in a text file which is up to 5 GB in size. I cannot load all of this data into RAM as it is far too large.
The data is stored like a table, 5 million records (rows) and 40 columns each containing text that will be converted in memory to either string, ints, or doubles.
I've tried caching only 10 - 100 MB of data in memory and reloading from the file when I need data outside but it is way too slow! When I run calculations because I can randomly jump from any row within the table it would constantly need to open the file, read and close.
I need something fast, I was thinking of using some sort of DB. I know calculations with large data like this may take a while which is fine. If I do use a DB it needs to be setup on launch of the desktop application and not require some sort of server component to be installed before.
Any tips? Thanks
I think you need to clarify some things:
This is desktop application (I assume yes), what is the memory limit for it?
Do you use your file in read-only mode?
What kind of calculations are you trying to do? (how often random rows are accessed, how often consequent rows are read, do you need to modify data)
Currently I see two ways for further investigation:
Use SQLite. This is small single-file DB, oriented mainly for desktop applications and single-user use. It's doesn't require any server, all you need is to have appropriate jdbc library.
Create some kind of index, using, for example, binary tree. First time you read your file, index the start position of the rows within the file. In conjunction with permanently open random access file this will help you to seek and read quickly desired row. For binary tree, your index may be approximately 120M. (it's RowsCount * 2 * IndexValueSize for binary tree)
You can use an embedded database, you can find a comparison here: Java Embedded Databases Comparison.
Or, depending on your use case you may even try to use Lucene which is a full text search engine.

which data structure should I use for trading system?

I have finished my studying of a basic trading system structure and going to build one now. I am going to put a table of historical data (having columns of day high, day low...etcs of days of many years) into an array or other kind of data structure, then use the date to do analysis, and some analysis would need to put the result into another array (or another data structure).
So basically it would be few "tables" with each having around 6 columns and around 2000-5000 rows. I would do calculations within these tables and then store the result to another similar size table.
Array is good enough? Or I should choose another data structure like linked list?
It really depends on your requirements.
If you are just doing simple analysis on relatively small amounts of data (sounds like this is the case?) then there's no need to be too fancy. Probably a simple ArrayList of rows would work well. Keep it simple.
If performance is a concern then you need to understand the usage pattern a lot more. For example, if you are doing a lot of read-only lookups then you may want to create indexes into the data (e.g. using HashMaps). But that's getting quite advanced.
So for now, I suggest sticking with an ArrayList.

Large-scale processing of seralized Integer objects

I have a large data set in the following format:
In total, there are 3687 object files. Each of which contains 2,000,000 records. Each file is 42MB in size.
Each record contains the following:
An id (Integer value)
Value1 (Integer)
Value2 (Integer)
Value3 (Integer)
The content of each file is not sorted or ordered in any way as they are observed during a data collection process.
Ideally, I want to build an index for this data. (Indexed by the id) which would mean the following:
Dividing the set of ids into manageable chunks.
Scanning the files to get data related to the current working set of ids.
Build the index.
Go over the next chunk and repeat 1,2,3.
To me this sounds fine but loading 152GB back and forth is time-consuming and wonder about the best possible approach or even whether Java is actually the right language to use for such a process.
I've 256GB of ram and 32 cores on my machine.
Update:
Let me modify this, putting aside I/O, and assuming the file is in-memory in a byte array.
What would be the fastest possible way to decode a 42MB Object file that have 2,000,000 records and each record contains 4 Integers serialized.
You've made a very poor choice of file format. I would convert the lot from serialized Integers to binary ints written with DataOutputStream.writeInt(), and read them with DataInputStream.readInt(). With buffered streams underneath in both cases. You will save masses of disk space, which will therefore save you I/O time as well, and you also save all the serialization overhead time. And change your collection software to use this format in future. The conversion will take a while, but it only happens once.
Or else use a database as suggested, again with native ints rather than serialized objects.
So, what I would do is just load up each file and store the id into some sort of sorted structure - std::map perhaps [or Java's equivalent, but given that it's probably about 10-20 lines of code to read in the filename and then read the contents of the file into a map, close the file and ask for the next file, I'd probably just write the C++ to do that].
I don't really see what else you can/should do, unless you actually want to load it into a dbms - which I don't think is at all unreasonable of a suggestion.
Hmm.. it seems the better way of doing it is to use some kind of DBMS. Load all your data into database, and you can leverage its indexing, storage and querying facility. Ofcourse this depends on what is your requirement -- and whether or now a DBMS solution suits this
Given that your available memory is > than your dataset and you want very high performance, have you considered Redis? It's well suited to operations on simple data structures and the performance is very fast.
Just be a bit careful about letting java do default serialization when storing values. I've previously run into issues with my primitives getting autoboxed prior to serialization.

CSV file, faster in binary format? Fastest search?

If I have a CSV file, is it faster to keep the file as place text or to convert it to some other format? (for searching)
In terms of searching a CSV file, what is the fastest method of retrieving a particular row (by key)? Not referring to sorting the file sorry, what I mean was looking up a arbitrary key in the file.
Some updates:
the file will be read-only
the file can be read and kept in memory
There are several things to consider for this:
What kind of data do you store? Does it actually make sense, to convert this to a binary format? Will binary format take up less space (the time it takes to read the file is dependent on size)?
Do you have multiple queries for the same file, while the system is running, or do you have to load the file each time someone does a query?
Do you need to efficiently transfer the file between different systems?
All these factors are very important for a decision. The common case is that you only need to load the file once and then do many queries. In that case it hardly matters what format you store the data in, because it will be stored in memory afterwards anyway. Spend more time thinking about good data structures to handle the queries.
Another common case is, that you cannot keep the main application running and hence you cannot keep the file in memory. In that case, get rid of the file and use a database. Any database you can use will most likely be faster than anything you could come up with. However it is not easy to transfer a database between system.
Most likely though, the file format will not be a real issue to consider. I've read quite a few very long CSV files and most often the time it took to read the file was negligible compared to what I needed to do with the data afterwards.
If you have too much data and is very production level, then use Apache Lucene
If its small dataset or its about learning then read through Suffix tree and Tries
"Convert" it (i.e. import it) into a database table (or preferably normalised tables) with indexes on the searchable columns and a primary key on the column that has the highest cardinality - no need to re-invent the wheel... you'll save yourself a lot of issues - transaction management, concurrency.... really - if it will be in production, the chance that you will want to keep it in csv format is slim to zero.
If the file is too large to keep in memory, then just keep the keys in memory. Some number of rows can also be keep in memory, with least-recently-accessed rows paged out as additional rows are needed. Use fseeks (directed by keys) with the file to find the row in the file itself. Then load that row into memory in case other entries on that row might be needed.

Categories

Resources