I have a scenario where user is going to upload a zip file. This zip file can have 4999 json files, each json file can have 4999 nodes which I am parsing and creating objects. Eventually I am inserting them in db. When I tested this scenario it took me 30-50 min to parse.
I am looking for suggestions where
I want to read JSON files in parallel: let's say if I have a batch of 100 jsonfiles then I can have 50 threads running in parallel
Each thread will be responsible for parsing the JSON files, which might result in another perf bottleneck as we have 4999 nodes to parse. So I was thinking another batch of 100 node reads at a time which will cause 50 child threads again
So in total there will be 2500 threads in the system but should help parallel execution of around 25,000,000 sequential operations.
Let me know if this approach sounds fine or not?
What you described should not take so much time (30-50 min to parse), also a json file with ~5k nodes is relatively small.
The bottleneck will be in database, during mass insert, especially if you have indexes on fields.
So i suggest to:
Don't waste time on threading - unpacking and parsing jsons should be fast in your case, focus on bath inserts and do it properly: 1000+ batch queue and manual commit after.
Disable indexes before importing, especially full-text and enable (+reindex) after
I think, the performance problem may come from:
JSON parsing & create objects
Inserting data to DB: if you insert many times, performance reduce a lot
If you run 2500 threads, it's may not effective if you don't have much CPU, since the overhead may increase. Depend on your HW configuration, you can define number of thread.
And to insert data to DB, I suggest to do as bellow:
Each thread, after JSON parsing and create objects, you put the objects into CSV file
After finish, try to import CSV to DB
I would suggest you using DSM library. With DSM, you can easily parse very complex JSON files and process them during parsing.
You don't need to wait until all JSON files being processing. I guess this is your main problem.
BTW:
It uses Jackson stream API to read JSON so, it consumes very low memory.
Example usage can be found in this answer:
JAVA - Best approach to parse huge (extra large) JSON file
Related
I am working on Google Dataflow which pulls data from pubsub, converts to avro, and writes them to GCS.
According to the monitoring page, the bottleneck is writing avro file to GCS (spending 70-80 % of total execution time).
I use
10 workers of n1-standard-8
10 numShards
5sec fixedwindow
The region of GCS and Dataflow endpoint is same.
Then the performance is around 200,000 elements per second.
Is it fast on this situation or is there anything I can do to make it faster? (I really want to!)
Thanks
Have you considered naming your files following a specific convention in order to optimize access read and write ?
In order to maintain a high request rate, avoid using sequential names. Using completely random object names will give you the best load distribution. If you want to use sequential numbers or timestamps as part of your object names, introduce randomness to the object names by adding a hash value before the sequence number or timestamp.
Basically you need to follow the same rules as choosing a RowKey in BigTable.
I'm currently trying to read large amounts of data into my Java application using the official Bolt driver. I'm having issues because the graph is fairly large (~17k nodes, ~500k relationships) and of course I'd like to read this in chunks for memory efficiency. What I'm trying to get is a mix of fields between the origin and destination nodes, as well as the relationship itself. I tried writing a pagination query:
MATCH (n:NodeLabel)-[r:RelationshipLabel]->(n:NodeLabel)
WITH r.some_date AS some_date, r.arrival_times AS arrival_times,
r.departure_times AS departure_times, r.path_ids AS path_ids,
n.node_id AS origin_node_id, m.node_id AS dest_node_id
ORDER BY id(r)
RETURN some_date, arrival_times, departure_times, path_ids,
origin_node_id, dest_node_id
LIMIT 5000
(I changed some of the label and field naming so it's not obvious what the query is for)
The idea was I'd use SKIP on subsequent queries to read more data. However, at 5000 rows/read this is taking roughly 7 seconds per read, presumably because of the full scan ORDER BY forces, and if I SKIP it goes up in execution time and memory usage significantly. This is way too long to read the whole thing, is there any way I can speed up the query? Or stream the results in chunks into my app? In general, what is the best approach to reading large amounts of data?
Thanks in advance.
Instead of skip. From the second call you can do id(r) > "last received id(r)" it should actually reduce the process time as you go.
I have a huge file (3GB+) as a XML. Currently, i read in the XML in my Java code, parse it and store in a HashMap and then the HashMap is used as a lookup.
This process is done about 1000 times in 1000 different JVMs for each run of this code. The 1000 different JVMs operate on 1000 partitions of the input data, hence this process has to occur 1000 times.
I was wondering as a one time activity, if i serialized the HashMap and then stored the output. And then in the java program just deserialize the HashMap and avoid parsing the XML files 1000 times.
Will this help up speed up the code a lot? or is the serialization overhead going nullify any gains?
EDIT:
1. The 1000 different JVMs operate on 1000 partitions of the input data, hence this process has to occur 1000 times.
You might consider using Chronicle Map. It can be loaded once in off heap memory and shared across multiple JVMs without having to deserialize it. i.e. it uses very little heap and you only need to read the entries you map.get(key)
It works by memory mapping the file so you don't pay the price of loading it multiple times once the first program brings it into memory it can stay in memory even if there is no program using it.
Disclaimer: I helped write it.
Why are you loading and parsing the same map 1000 times? If nothing else, you could just make a copy of the first one you load to avoid reading another 3GB+ from disk.
It is likely that the serialized file will be faster, but there are no guarantees. The only way to be sure will be for you to try it on your machine and benchmark it to measure the difference. Just be aware of all the issues like JIT warmup etc that you need to do to get a good benchmark result.
The best way to get good performance will be to read the file once and keep it in memory. There are overheads to doing that but if you are calling it often enough that would be worthwhile. You should really think about using a database for something like this as well, you could always use a lightweight database running locally.
I would say from my experience that the best format for serializing XML is as XML. The XML representation will generally be smaller than the output of Java serialization and therefore faster to load. But try it and see.
What isn't clear to me is why you need to serialize the partitions at all, unless your processing is highly distributed (e.g. on a cluster without shared memory).
With Saxon-EE you can do the processing like this:
<xsl:template name="main">
<xsl:stream href="big-input.xml">
<xsl:for-each select="/*/partition" saxon:threads="50">
<xsl:sequence select="f:process-one-partition(copy-of(.))"/>
</xsl:for-each>
</xsl:stream>
</xsl:template>
The function f:process-one-partition can be written either in Java or in XSLT.
The memory needed for this will be of the order of number-of-threads * size-of-one-partition.
Our application is required to take client data presented in XML format (several files) and parse this into our common XML format (a single file with schema). For this purpose we are using apache's XMLBeans data binding framework. The steps of this process are briefly described below.
First, we take raw java.io.File objects pointing to the client XML files on-disk and load these into a collection. We then iterate over this collection creating a single apache.xmlbeans.XmlObject per file. After all files have been parsed into XmlObjects, we create 4 collections holding the individual objects from the XML documents that we are interested in (to be clear, these are not hand-crafted objects but what I can only describe as 'proxy' objects created by apache's XMLBeans framework). As a final step, we then iterate over these collections to produce our XML document (in memory) and then save this to disk.
For the majority of use cases, this process works fine and can easily run in the JVM when given the '-Xmx1500m' command-line argument. However, issues arise when we are given 'large datasets' by the client. Large in this instance is 123Mb of client XML spread over 7 files. Such datasets result in our in-code collections being populated with almost 40,000 of the aforementioned 'proxy objects'. In these cases the memory usage just goes through the roof. I do not get any outofmemory exceptions the program just hangs until garbage collection occurs, free-ing up a small amount of memory, the program then continues, uses up this new space and the cycle repeats. These parsing sessions currently take 4-5 hours. We are aiming to bring this down to within an hour.
Its important to note that the calculations required to transform client xml into our xml require all of the xml data to cross-reference. Therefore we cannot implement a sequential parsing model or batch this process into smaller blocks.
What I've tried so far
Instead of holding all 123Mb of client xml in memory, on each request for data, load the files, find the data and release the references to these objects. This does seem to reduce the amount of memory consumed during the process but as you can imagine, the amount of time the constant I/O takes removes the benefit of the reduced memory footprint.
I suspected an issue was that we are holding an XmlObject[] for 123Mb worth of XML files as well as the collections of objects taken from these documents (using xpath queries). To remedy, I altered the logic so that instead of querying these collections, the documents were queried directly. The idea here being that at no point does there exist 4 massive Lists with 10's of 1000's of objects in, just the large collection of XmlObjects. This did not seem to make a difference at all and in some cases, increases the memory footprint even more.
Clutching at straws now, I considered that the XmlObject we use to build our xml in-memory before writing to disk was growing too large to maintain alongside all the client data. However, doing some sizeOf queries on this object revealed that at its largest, this object is less than 10Kb. After reading into how XmlBeans manages large DOM objects, it seems to use some form of buffered writer and as a result, is managing this object quite well.
So now I am out of ideas; Can't use SAX approaches instead of memory intensive DOM approaches as we need 100% of the client data in our app at any one time, cannot hold off requesting this data until we absolutely need it as the conversion process requires a lot of looping and the disk I/O time is not worth the saved memory space and I cannot seem to structure our logic in such a way as to reduce the amount of space the internal java collections occupy. Am I out of luck here? Must I just accept that if I want to parse 123Mb worth of xml data into our Xml format that I cannot do it with the 1500m memory allocation? While 123Mb is a large dataset in our domain, I cannot imagine others have never had to do something similar with Gb's of data at a time.
Other information that may be important
I have used JProbe to try and see if that can tell me anything useful. While I am a profiling noob, I ran through their tutorials for memory leaks and thread locks, understood them and there doesn't appear to be any leaks or bottlenecks in our code. After running the application with a large dataset, we quickly see a 'sawblade' type shape on the memory analysis screen (see attached image) with PS Eden space being taken over with a massive green block of PS Old Gen. This leads me to believe that the issue here is simply sheer amount of space taken up by object collections rather than a leak holding onto unused memory.
I am running on a 64-Bit Windows 7 platform but this will need to run on a 32 Bit environment.
The approach I'd take would be make two passes on the files, using SAX in both cases.
The first pass would parse the 'cross-reference' data, needed in the calculations, into custom objects and store them Maps. If the 'cross-reference' data is large then look at using distributed cache (Coherence is the natural fit if you've started with Maps).
The second pass would parse the files, retreive the 'cross-reference' data to perform calculations as needed and then write the output XML using the javax.xml.stream APIs.
I am working with a 2 large input files of the order of 5gb each..
It is the output of Hadoop map reduce, but as i am not able to do dependency calculations in Map reduce, i am switching to an optimized for loop for final calculations( see my previous question on map reduce design Recursive calculations using Mapreduce
I would like to have suggestion on reading such huge files in java and doing some basic operations, finally i will be writing out the data which will of the order of around 5gb..
I appreciate your help
If the files have properties as you described, i.e. 100 integer values per key and are 10GB each, you are talking about a very large number of keys, much more than you can feasibly fit into memory. If you can order files before processing, for example using OS sort utility or a MapReduce job with a single reducer, you can read two files simultaneously, do your processing and output result without keeping too much data in memory.
It sounds like there wouldn't be much to a simple implementation. Just open an InputStream/Reader for the file, then, in a loop:
Read in one piece of your data
Process the piece of data
Store the result: in memory if you'll have room for the complete dataset, in a database of some sort if not
If your result set will be too large to keep in memory, a simple way to fix that would be to use an H2 database with local file storage.
My approach,
Configured the map reduce program to use 16 reducers, so the final output consisted of 16 files(part-00000 to part-00015) of 300+ MB, and the keys were sorted in the same order for both the input files.
Now in every stage i read 2 input files(around 600 MB) and did the processing.. So at every stage i had to hold to 600 MB in memory, which the system could manage pretty well.
The program was pretty quick took around 20mins for the complete processing.
Thanks for all the suggestions!, I appreciate your help