I am working on Google Dataflow which pulls data from pubsub, converts to avro, and writes them to GCS.
According to the monitoring page, the bottleneck is writing avro file to GCS (spending 70-80 % of total execution time).
I use
10 workers of n1-standard-8
10 numShards
5sec fixedwindow
The region of GCS and Dataflow endpoint is same.
Then the performance is around 200,000 elements per second.
Is it fast on this situation or is there anything I can do to make it faster? (I really want to!)
Thanks
Have you considered naming your files following a specific convention in order to optimize access read and write ?
In order to maintain a high request rate, avoid using sequential names. Using completely random object names will give you the best load distribution. If you want to use sequential numbers or timestamps as part of your object names, introduce randomness to the object names by adding a hash value before the sequence number or timestamp.
Basically you need to follow the same rules as choosing a RowKey in BigTable.
Related
I have a scenario where user is going to upload a zip file. This zip file can have 4999 json files, each json file can have 4999 nodes which I am parsing and creating objects. Eventually I am inserting them in db. When I tested this scenario it took me 30-50 min to parse.
I am looking for suggestions where
I want to read JSON files in parallel: let's say if I have a batch of 100 jsonfiles then I can have 50 threads running in parallel
Each thread will be responsible for parsing the JSON files, which might result in another perf bottleneck as we have 4999 nodes to parse. So I was thinking another batch of 100 node reads at a time which will cause 50 child threads again
So in total there will be 2500 threads in the system but should help parallel execution of around 25,000,000 sequential operations.
Let me know if this approach sounds fine or not?
What you described should not take so much time (30-50 min to parse), also a json file with ~5k nodes is relatively small.
The bottleneck will be in database, during mass insert, especially if you have indexes on fields.
So i suggest to:
Don't waste time on threading - unpacking and parsing jsons should be fast in your case, focus on bath inserts and do it properly: 1000+ batch queue and manual commit after.
Disable indexes before importing, especially full-text and enable (+reindex) after
I think, the performance problem may come from:
JSON parsing & create objects
Inserting data to DB: if you insert many times, performance reduce a lot
If you run 2500 threads, it's may not effective if you don't have much CPU, since the overhead may increase. Depend on your HW configuration, you can define number of thread.
And to insert data to DB, I suggest to do as bellow:
Each thread, after JSON parsing and create objects, you put the objects into CSV file
After finish, try to import CSV to DB
I would suggest you using DSM library. With DSM, you can easily parse very complex JSON files and process them during parsing.
You don't need to wait until all JSON files being processing. I guess this is your main problem.
BTW:
It uses Jackson stream API to read JSON so, it consumes very low memory.
Example usage can be found in this answer:
JAVA - Best approach to parse huge (extra large) JSON file
I'm currently trying to read large amounts of data into my Java application using the official Bolt driver. I'm having issues because the graph is fairly large (~17k nodes, ~500k relationships) and of course I'd like to read this in chunks for memory efficiency. What I'm trying to get is a mix of fields between the origin and destination nodes, as well as the relationship itself. I tried writing a pagination query:
MATCH (n:NodeLabel)-[r:RelationshipLabel]->(n:NodeLabel)
WITH r.some_date AS some_date, r.arrival_times AS arrival_times,
r.departure_times AS departure_times, r.path_ids AS path_ids,
n.node_id AS origin_node_id, m.node_id AS dest_node_id
ORDER BY id(r)
RETURN some_date, arrival_times, departure_times, path_ids,
origin_node_id, dest_node_id
LIMIT 5000
(I changed some of the label and field naming so it's not obvious what the query is for)
The idea was I'd use SKIP on subsequent queries to read more data. However, at 5000 rows/read this is taking roughly 7 seconds per read, presumably because of the full scan ORDER BY forces, and if I SKIP it goes up in execution time and memory usage significantly. This is way too long to read the whole thing, is there any way I can speed up the query? Or stream the results in chunks into my app? In general, what is the best approach to reading large amounts of data?
Thanks in advance.
Instead of skip. From the second call you can do id(r) > "last received id(r)" it should actually reduce the process time as you go.
In our project we're using an ELK stack for storing logs in a centralized place. However I've noticed that recent versions of ElasticSearch support various aggregations. In addition Kibana 4 supports nice graphical ways to build graphs. Even recent versions of grafana can now work with Elastic Search 2 datasource.
So, does all this mean that ELK stack can now be used for storing metering information gathered inside the system or it still cannot be considered as a serious competitor to existing solutions: graphite, influx db and so forth.
If so, does anyone use ELK for metering in production? Could you please share your experience?
Just to clarify the notions, I consider metering data as something that can be aggregated and and show in a graph 'over time' as opposed to regular log message where the main use case is searching.
Thanks a lot in advance
Yes you can use Elasticsearch to store and analyze time-series data.
To be more precise - it depends on your use case. For example in my use case (financial instrument price tick history data, in development) I am able to get 40.000 documents inserted / sec (~125 byte documents with 11 fields each - 1 timestamp, strings and decimals, meaning 5MB/s of useful data) for 14 hrs/day, on a single node (big modern server with 192GB ram) backed by corporate SAN (which is backed by spinning disks, not SSD!). I went to store up to 1TB of data, but I predict having 2-4TB could also work on a single node.
All this is with default config file settings, except for the ES_HEAP_SIZE of 30GB. I am suspecting it would be possible to get significantly better write performance on that hardware with some tuning (eg. I find it strange that iostat reports device util at 25-30% as if Elastic was capping it / conserving i/o bandwith for reads or merges... but it could also be that the %util is an unrealiable metric for SAN devices..)
Query performance is also fine - queries / Kibana graphs return quick as long as you restrict the result dataset with time and/or other fields.
In this case you would not be using Logstash to load your data, but bulk inserts of big batches directly into the Elasticsearch. https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html
You also need to define a mapping https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html to make sure elastic parses your data as you want it (numbers, dates, etc..) creates the wanted level of indexing, etc..
Other recommended practices for this use case are to use a separate index for each day (or month/week depending on your insert rate), and make sure that index is created with just enough shards to hold 1 day of data (by default new indexes get created with 5 shards, and performance of shards starts degrading after a shard grows over a certain size - usually few tens of GB, but it might differ for your use case - you need to measure/experiment).
Using Elasticsearch aliases https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html helps with dealing with multiple indexes, and is a generally recommended best practice.
Are there any ways to improve the MapReduce performance by changing the number of map tasks or changing the split sizes of each mapper?
For example, I have a 100GB text file and 20 nodes. I want to run a WordCount job on the text file, what is the ideal number of mappers or the ideal split size so that it can be done faster?
Would it be faster with more mappers?
Would it be faster with a smaller split size?
EDIT
I am using hadoop 2.7.1, just so you know there is YARN.
It is not necessarily faster when you use more mappers. Each mapper has a start up and setup time. In the early days of hadoop when mapreduce was the de facto standard it was said that a mapper should run ~10 minutes. Today the documentations recommends 1 minute. You can vary the number of map tasks by using setNumMapTasks(int) which you can define within the JobConf. IN the documentation of the method are very good information about the mapper count:
How many maps?
The number of maps is usually driven by the total size
of the inputs i.e. total number of blocks of the input files.
The right level of parallelism for maps seems to be around 10-100 maps
per-node, although it has been set up to 300 or so for very cpu-light
map tasks. Task setup takes awhile, so it is best if the maps take at
least a minute to execute.
The default behavior of file-based InputFormats is to split the input
into logical InputSplits based on the total size, in bytes, of input
files. However, the FileSystem blocksize of the input files is treated
as an upper bound for input splits. A lower bound on the split size
can be set via mapreduce.input.fileinputformat.split.minsize.
Thus, if you expect 10TB of input data and have a blocksize of 128MB,
you'll end up with 82,000 maps, unless setNumMapTasks(int) is used to
set it even higher.
Your question is probably related to this SO question.
To be honest, try to have a look at modern frameworks as well, like Apache Spark and Apache Flink.
I have a large data set in the following format:
In total, there are 3687 object files. Each of which contains 2,000,000 records. Each file is 42MB in size.
Each record contains the following:
An id (Integer value)
Value1 (Integer)
Value2 (Integer)
Value3 (Integer)
The content of each file is not sorted or ordered in any way as they are observed during a data collection process.
Ideally, I want to build an index for this data. (Indexed by the id) which would mean the following:
Dividing the set of ids into manageable chunks.
Scanning the files to get data related to the current working set of ids.
Build the index.
Go over the next chunk and repeat 1,2,3.
To me this sounds fine but loading 152GB back and forth is time-consuming and wonder about the best possible approach or even whether Java is actually the right language to use for such a process.
I've 256GB of ram and 32 cores on my machine.
Update:
Let me modify this, putting aside I/O, and assuming the file is in-memory in a byte array.
What would be the fastest possible way to decode a 42MB Object file that have 2,000,000 records and each record contains 4 Integers serialized.
You've made a very poor choice of file format. I would convert the lot from serialized Integers to binary ints written with DataOutputStream.writeInt(), and read them with DataInputStream.readInt(). With buffered streams underneath in both cases. You will save masses of disk space, which will therefore save you I/O time as well, and you also save all the serialization overhead time. And change your collection software to use this format in future. The conversion will take a while, but it only happens once.
Or else use a database as suggested, again with native ints rather than serialized objects.
So, what I would do is just load up each file and store the id into some sort of sorted structure - std::map perhaps [or Java's equivalent, but given that it's probably about 10-20 lines of code to read in the filename and then read the contents of the file into a map, close the file and ask for the next file, I'd probably just write the C++ to do that].
I don't really see what else you can/should do, unless you actually want to load it into a dbms - which I don't think is at all unreasonable of a suggestion.
Hmm.. it seems the better way of doing it is to use some kind of DBMS. Load all your data into database, and you can leverage its indexing, storage and querying facility. Ofcourse this depends on what is your requirement -- and whether or now a DBMS solution suits this
Given that your available memory is > than your dataset and you want very high performance, have you considered Redis? It's well suited to operations on simple data structures and the performance is very fast.
Just be a bit careful about letting java do default serialization when storing values. I've previously run into issues with my primitives getting autoboxed prior to serialization.