I'm currently trying to read large amounts of data into my Java application using the official Bolt driver. I'm having issues because the graph is fairly large (~17k nodes, ~500k relationships) and of course I'd like to read this in chunks for memory efficiency. What I'm trying to get is a mix of fields between the origin and destination nodes, as well as the relationship itself. I tried writing a pagination query:
MATCH (n:NodeLabel)-[r:RelationshipLabel]->(n:NodeLabel)
WITH r.some_date AS some_date, r.arrival_times AS arrival_times,
r.departure_times AS departure_times, r.path_ids AS path_ids,
n.node_id AS origin_node_id, m.node_id AS dest_node_id
ORDER BY id(r)
RETURN some_date, arrival_times, departure_times, path_ids,
origin_node_id, dest_node_id
LIMIT 5000
(I changed some of the label and field naming so it's not obvious what the query is for)
The idea was I'd use SKIP on subsequent queries to read more data. However, at 5000 rows/read this is taking roughly 7 seconds per read, presumably because of the full scan ORDER BY forces, and if I SKIP it goes up in execution time and memory usage significantly. This is way too long to read the whole thing, is there any way I can speed up the query? Or stream the results in chunks into my app? In general, what is the best approach to reading large amounts of data?
Thanks in advance.
Instead of skip. From the second call you can do id(r) > "last received id(r)" it should actually reduce the process time as you go.
Related
I have a scenario where user is going to upload a zip file. This zip file can have 4999 json files, each json file can have 4999 nodes which I am parsing and creating objects. Eventually I am inserting them in db. When I tested this scenario it took me 30-50 min to parse.
I am looking for suggestions where
I want to read JSON files in parallel: let's say if I have a batch of 100 jsonfiles then I can have 50 threads running in parallel
Each thread will be responsible for parsing the JSON files, which might result in another perf bottleneck as we have 4999 nodes to parse. So I was thinking another batch of 100 node reads at a time which will cause 50 child threads again
So in total there will be 2500 threads in the system but should help parallel execution of around 25,000,000 sequential operations.
Let me know if this approach sounds fine or not?
What you described should not take so much time (30-50 min to parse), also a json file with ~5k nodes is relatively small.
The bottleneck will be in database, during mass insert, especially if you have indexes on fields.
So i suggest to:
Don't waste time on threading - unpacking and parsing jsons should be fast in your case, focus on bath inserts and do it properly: 1000+ batch queue and manual commit after.
Disable indexes before importing, especially full-text and enable (+reindex) after
I think, the performance problem may come from:
JSON parsing & create objects
Inserting data to DB: if you insert many times, performance reduce a lot
If you run 2500 threads, it's may not effective if you don't have much CPU, since the overhead may increase. Depend on your HW configuration, you can define number of thread.
And to insert data to DB, I suggest to do as bellow:
Each thread, after JSON parsing and create objects, you put the objects into CSV file
After finish, try to import CSV to DB
I would suggest you using DSM library. With DSM, you can easily parse very complex JSON files and process them during parsing.
You don't need to wait until all JSON files being processing. I guess this is your main problem.
BTW:
It uses Jackson stream API to read JSON so, it consumes very low memory.
Example usage can be found in this answer:
JAVA - Best approach to parse huge (extra large) JSON file
I'm trying to write a Dataset object as a Parquet file using java.
I followed this example to do so but it is absurdly slow.
It takes ~1.5 minutes to write ~10mb of data, so it isn't going to scale well when I want to write hundreds of mb of data.
I did some cpu profiling and found that 99% of the time came from the ParquetWriter.write() method.
I tried increasing the page size and block size of the ParquetWriter but it doesn't seem to have any effect on the performance. Is there any way to make this process faster or is it just a limitation of the Parquet library?
I've had reasonable luck using org.apache.parquet.hadoop.ParquetWriter to write org.apache.parquet.example.data.Group made by the org.apache.parquet.example.data.simple.SimpleGroupFactory.
https://github.com/apache/parquet-mr/blob/master/parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/DataGenerator.java
I'd love to know of a faster way (more columns x rows per second per thread).
I have an ever growing data set ( stored in a google spreadsheet from day one ) which I now want to do some analysis on. I have some basic spread sheet processing done which worked fine when the data set was < 10,000 but now that I have over 30,000 rows it takes a painful length of time to refresh the sheet when I make any changes.
So basically each data entry contains the following fields (among other things):
Name, time, score, initial value, final value
My spreadsheet was ok as a data analysis solution for stuff like giving me all rows where Name contained string "abc" and score was < 100.
However, as the number of rows increases it takes google sheets longer and longer to generate a result.
So I want to load all my data into a Java program ( Java because this is the language I am most familiar with and want to use this as a meaningful way to refresh my java skills also. )
I also have an input variable which my spread sheet uses when processing the data which I adjust in incremental steps to see how the output is affected. But to get a result for each incremental change to this input variable takes far too long. This is something I want to automate so I can set the range of the input value, increment step and then have the system generate the output for each incremental value.
My question is, what is the best way to load this data into a java program. I have the data in a txt file so figured I could read each line into its own pojo and when all 30,000 rows are loaded into an ArrayList start crunching through this. Is there a more efficient data container or method I could be using?
If you have a bunch of arbitrary (unspecified, probably ad-hoc) data processing to do, and using a spread-sheet is proving too slow, you would be better off looking for a better tool or more applicable language.
Here are some of the many possibilities:
Load the data into an SQL database and perform your analysis using SQL queries. There are many interactive database tools out there.
OpenRefine. Never used it, but I am told it is powerful and easy to use.
Learn Python or R and their associated data analysis libraries.
It would be possible to implement this all in Java and make it go really fast, but for a dataset of 30,000 records it is (IMO) not worth the development effort.
One is sometimes faced with the task of parsing data stored in files on the local system. A significant dilemma is whether to load and parse all of the file data at the beginning of the program run, or access the file throughout the run and read data on-demand (assuming the file is sorted, so search is performed in constant time).
When it comes to small data sets, the first approach seems favorable, but with larger ones the threat of clogging up the heap increases.
What are some general guidelines one can use in such scenarios?
That's the standard tradeoff in programming - memory vs performance, Space–time tradeoff etc. There is no "right" answer to that question. It depends on the memory you have, speed you need, size of files, how often you query them etc.
In your specific case and since it seems like a one time job (if you are able to read it in the beginning) then it probably won't matter that much ;)
That depends entirely on what your program needs to do. The general advice is to keep only as much data in memory as is necessary. For example, consider a simple program that reads each record from a file of transactions, and then reports the total number of transactions and the total dollar amount:
count = 0
dollars = 0
while not end of file
read record
parse record
increment count
add transaction amount to dollars
end
output count and dollars
Here, you clearly need to have only one transaction record in memory at a time. So you read a record, process it, and discard it. It makes no sense to load all of the records into a list or other data structure, and then iterate over the list to get the count and total dollar amount.
In some cases you do need multiple records, perhaps all of them, in memory. In those cases, all you do is re-structure the program a little bit. You keep the reading loop, but have it add records to a list. Then afterwards you can process the list:
list = []
while not end of file
read record
parse record
add record to list
end
process list
output results
It makes no sense to load the entire file into a list, and then scan the list sequentially to obtain count and dollar amount. Not only is that a waste of memory, it makes the program more complex, uses memory to no gain, will be slower, and will fail with large data sets. The "memory vs performance" tradeoff doesn't always apply. Often, as in this case, using more memory makes the program slower.
I generally find it a good practice to structure my solutions so that I keep as little data in memory as is practical. If the solution is simpler with sorted data, for example, I'll make sure that the input is sorted before I run the program.
That's the general advice. Without specific examples from you, it's hard to say what approach would be preferred.
I am working with a 2 large input files of the order of 5gb each..
It is the output of Hadoop map reduce, but as i am not able to do dependency calculations in Map reduce, i am switching to an optimized for loop for final calculations( see my previous question on map reduce design Recursive calculations using Mapreduce
I would like to have suggestion on reading such huge files in java and doing some basic operations, finally i will be writing out the data which will of the order of around 5gb..
I appreciate your help
If the files have properties as you described, i.e. 100 integer values per key and are 10GB each, you are talking about a very large number of keys, much more than you can feasibly fit into memory. If you can order files before processing, for example using OS sort utility or a MapReduce job with a single reducer, you can read two files simultaneously, do your processing and output result without keeping too much data in memory.
It sounds like there wouldn't be much to a simple implementation. Just open an InputStream/Reader for the file, then, in a loop:
Read in one piece of your data
Process the piece of data
Store the result: in memory if you'll have room for the complete dataset, in a database of some sort if not
If your result set will be too large to keep in memory, a simple way to fix that would be to use an H2 database with local file storage.
My approach,
Configured the map reduce program to use 16 reducers, so the final output consisted of 16 files(part-00000 to part-00015) of 300+ MB, and the keys were sorted in the same order for both the input files.
Now in every stage i read 2 input files(around 600 MB) and did the processing.. So at every stage i had to hold to 600 MB in memory, which the system could manage pretty well.
The program was pretty quick took around 20mins for the complete processing.
Thanks for all the suggestions!, I appreciate your help