I'm trying to write a Dataset object as a Parquet file using java.
I followed this example to do so but it is absurdly slow.
It takes ~1.5 minutes to write ~10mb of data, so it isn't going to scale well when I want to write hundreds of mb of data.
I did some cpu profiling and found that 99% of the time came from the ParquetWriter.write() method.
I tried increasing the page size and block size of the ParquetWriter but it doesn't seem to have any effect on the performance. Is there any way to make this process faster or is it just a limitation of the Parquet library?
I've had reasonable luck using org.apache.parquet.hadoop.ParquetWriter to write org.apache.parquet.example.data.Group made by the org.apache.parquet.example.data.simple.SimpleGroupFactory.
https://github.com/apache/parquet-mr/blob/master/parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/DataGenerator.java
I'd love to know of a faster way (more columns x rows per second per thread).
Related
I'm currently trying to read large amounts of data into my Java application using the official Bolt driver. I'm having issues because the graph is fairly large (~17k nodes, ~500k relationships) and of course I'd like to read this in chunks for memory efficiency. What I'm trying to get is a mix of fields between the origin and destination nodes, as well as the relationship itself. I tried writing a pagination query:
MATCH (n:NodeLabel)-[r:RelationshipLabel]->(n:NodeLabel)
WITH r.some_date AS some_date, r.arrival_times AS arrival_times,
r.departure_times AS departure_times, r.path_ids AS path_ids,
n.node_id AS origin_node_id, m.node_id AS dest_node_id
ORDER BY id(r)
RETURN some_date, arrival_times, departure_times, path_ids,
origin_node_id, dest_node_id
LIMIT 5000
(I changed some of the label and field naming so it's not obvious what the query is for)
The idea was I'd use SKIP on subsequent queries to read more data. However, at 5000 rows/read this is taking roughly 7 seconds per read, presumably because of the full scan ORDER BY forces, and if I SKIP it goes up in execution time and memory usage significantly. This is way too long to read the whole thing, is there any way I can speed up the query? Or stream the results in chunks into my app? In general, what is the best approach to reading large amounts of data?
Thanks in advance.
Instead of skip. From the second call you can do id(r) > "last received id(r)" it should actually reduce the process time as you go.
I am working with a 2 large input files of the order of 5gb each..
It is the output of Hadoop map reduce, but as i am not able to do dependency calculations in Map reduce, i am switching to an optimized for loop for final calculations( see my previous question on map reduce design Recursive calculations using Mapreduce
I would like to have suggestion on reading such huge files in java and doing some basic operations, finally i will be writing out the data which will of the order of around 5gb..
I appreciate your help
If the files have properties as you described, i.e. 100 integer values per key and are 10GB each, you are talking about a very large number of keys, much more than you can feasibly fit into memory. If you can order files before processing, for example using OS sort utility or a MapReduce job with a single reducer, you can read two files simultaneously, do your processing and output result without keeping too much data in memory.
It sounds like there wouldn't be much to a simple implementation. Just open an InputStream/Reader for the file, then, in a loop:
Read in one piece of your data
Process the piece of data
Store the result: in memory if you'll have room for the complete dataset, in a database of some sort if not
If your result set will be too large to keep in memory, a simple way to fix that would be to use an H2 database with local file storage.
My approach,
Configured the map reduce program to use 16 reducers, so the final output consisted of 16 files(part-00000 to part-00015) of 300+ MB, and the keys were sorted in the same order for both the input files.
Now in every stage i read 2 input files(around 600 MB) and did the processing.. So at every stage i had to hold to 600 MB in memory, which the system could manage pretty well.
The program was pretty quick took around 20mins for the complete processing.
Thanks for all the suggestions!, I appreciate your help
I need simple opinion from all Guru!
I developed a program to do some matrix calculations. It work all fine with
small matrix. However when I start calculating BIG thousands column row matrix. It
kills the speed.
I was thinking to do processing on each row and write the result in a file then free the
memory and start processing 2nd row and write in a file, so and so forth.
Will it help in improving speed? I have to make big changes to implement this change. Thats
why I need your opinion. What do you think?
Thanks
P.S: I know about colt and Jama matrix. I can not use these packages due to company
rules.
Edited
In my program I am storing all the matrix in 2 dimensional array and if matrix is small it is fine. However, when it has thousands column and rows. Then storing all this matrix in memory for calculation cause performance issues. Matrix contains floating values. For processing I read all the matrix store in memory then start calculation. After calculating I write the result in a file.
Is memory really your bottleneck? Because if it isn't, then stopping to write things out to a file is always going to be much, much slower than the alternative. It sounds like you are probably experiencing some limitation of your algorithm.
Perhaps you should consider optimising the algorithm first.
And as I always say for all performance issue - asking people is one thing, but there is no substitute for trying it! Opinions don't matter if the real-world performance is measurable.
I suggest using profiling tools and timing statements in your code to work out exactly where your performance problem is before your start making changes.
You could spend a long time 'fixing' something that isn't the problem. I suspect that the file IO you suggest would actually slow your code down.
If your code effectively has a loop nested within another loop to process each element then you will see your speed drop away quickly as you increase the size of the matrix. If so, an area to look at would be processing your data in parallel, allowing your code to take advantage of multiple CPUs/cores.
Consider a more efficient implementation of a sparse matrix data structure and not a multidimensional array (if you are using one now)
You need to remember that perfoming an NxN multipled by an NxN takes 2xN^3 calculations. Even so it shouldn't take hours. You should get an inprovement by transposing the second matrix (about 30%) but it really shouldn't be taking hours.
So as you 2x N you increase the time by 8x. Worse than that a matrix which fit into your cache is very fast but mroe than a few MB and they have to come from main memory which slows down your operations by another 2-5x.
Putting the data on disk will really slow down your calaculation, I only suggest you do this if you martix doesn't fit in memory, but it will make it 10x - 100x slower so buying a little more memory is a good idea. (In your case your matrixies should be small enough to fit into memory)
I tried Jama, which is a very basic library which use two dimensional arrays instead of one and on 4 year old labtop took 7 minutes. You should be able to get half this time by just using the latest hardware and with multiple threads cut this below one minute.
EDIT: Using a Xeon X5570, Jama multiplied two 5000x5000 matrices in 156 seconds. Using a parallel implementation I wrote, cut this time to 27 seconds.
Use the profiler in jvisualvm in the JDK to identify where the time is spent.
I would do some simple experiments to identify how your algoritm scales, because it sounds like you might use one that has a higher runtime complexity than you think. If it runs in N^3 (which is common if you want to multiply a list with an array) then doubling the input size will eight-double the run time.
I have a file of size 2GB which has student records in it. I need to find students based on certain attributes in each record and create a new file with results. The order of the filtered students should be same as in the original file. What's the efficient & fastest way of doing this using Java IO API and threads without having memory issues? The maxheap size for JVM is set to 512MB.
What kind of file? Text-based, like CSV?
The easiest way would be to do something like grep does: Read the file line by line, parse the line, check your filter criterion, if matched, output a result line, then go to the next line, until the file is done. This is very memory efficient, as you only have the current line (or a buffer a little larger) loaded at the same time. Your process needs to read through the whole file just once.
I do not think multiple threads are going to help much. It would make things much more complicated, and since the process seems to be I/O bound anyway, trying to read the same file with multiple threads probably does not improve throughput.
If you find that you need to do this often, and going through the file each time is too slow, you need to build some kind of index. The easiest way to do that would be to import the file into a DB (can be an embedded DB like SQLite or HSQL) first.
I wouldn't overcomplicate this until you find that the boringly simple way doesn't work for what you need. Essentially you just need to:
open input stream to 2GB file, remembering to buffer (e.g. by wrapping with BufferedInputStream)
open output stream to filtered file you're going to create
read first record from input stream, look at whatever attribute to decide if you "need" it; if you do, write it to output file
repeat for remaining records
On one of my test systems with extremely modest hardware, BufferedInputStream around a FileInputStream out of the box read about 500 MB in 25 seconds, i.e. probably under 2 minutes to process your 2GB file, and the default buffer size is basically as good as it gets (see the BufferedInputStream timings I made for more details). I imagine with state of the art hardware it's quite possible the time would be halved.
Whether you need to go to a lot of effort to reduce the 2/3 minutes or just go for a wee while you're waiting for it to run is a decision that you'll have to make depending on your requirements. I think the database option won't buy you much unless you need to do a lot of different processing runs on the same set of data (and there are other solutions to this that don't automatically mean database).
2GB for a file is huge, you SHOULD go for a db.
If you really want to use Java I/O API, then try out this: Handling large data files efficiently with Java and this: Tuning Java I/O Performance
I think you should use memory mapped files.This will help you to map the bigger file to a
smaller memory.This will act like virtual memory and as far as performance is concerned mapped files are the faster than stream write/read.
I'm writing a small agent in java that will play a game against other agents. I want to keep a small amount of state (probably approx. 1kb at most) around between runs of the program so that I can try to tweak the performance of the agent based upon past successes. Essentially, I will be reading a small amount of data at the beginning of each game and writing a small amount at the end. It seems like I have 2 options, file I/O or derby. Is there a speed advantage to either? Or does it not really matter for such a small amount of data?
With 1kb of data, you are better off using standard file IO. Most likely, you could serialize the entire object tree to disk and dismply deserialize when you startup again. If you wanted to get fancy, you could use JAXB to serialize to XML instead of binary files.
As much as I love to fit every problem to the database solution, I don't think that's very practical here. Unless you have some special need of database specific capabilities, you are introducting a lot of overhead, complexity, maintenance problems by using a database.
The only areas where you might really want to use the database is if you have a lot of small objects/rows and you frequently perform sorts and filters on the data. But even then, you could probably keep a dozen in-memory ordered lists and get better performance with less resources and without the headache of a database.
If you really think you need a database in this scenario, consider HSQL. I don't consider it a real database, but it's a in-memory database that can persist to a file. Low overhead, low complexity, and relatively few points of failure. Plus, if you need to edit the persisted data, you can do so with a text editor. Can't say that about Derby.
Considering that these objects can vary per file size, and your computer's specs (bus speed, HD speed) affect this, the only way to be sure is to write your own benchmark. Just create a simple for loop, count from 1 to 1000, and read the file inside the loop over and over (but do not create and destroy the objects inside the loop, just focus on the reading part).
Of course this whole exercise reeks of pre-optimization, which can lead to bad coding habit. Just write your code in the most readable, simple fashion, and if there is a speed problem, refactor as needed.
But since it's a small amount of data, I would say it won't matter.