collect takes more time than first for small dataset - java

I have data saved as a single partition on HDFS (in bytes) and when I want to get the content of the data using below code, collect takes more time than first in a single partition of the data.
JavaRDD<String> mytext = sc.textFile("...");
List<String> lines = mytext.collect();
I was expecting collect and first to take the same time. Yet collect is slower than first for data in a single partition of HDFS.
What might be the reason behind this?

rdd.first() doesnt have to scan the whole partition. It gets only the first
item and returns it.
rdd.collect() has to scan the whole partition, collect all of it and send
all of it back (serialization + deserialization costs, etc.)
The reason (see apache-spark-developers forum) is likely because first() is entirely executed on the driver
node in the same process, while collect() needs to connect with worker
nodes.
Usually the first time you run an action, most of the JVM code are not
optimized, and the classloader also needs to load a lot of things on the
fly. Having to connect with other processes via RPC can slow the first
execution down in collect.
That said, if you run this a few times (in the same driver program) and it
is still much slower, you should look into other factors such as network
congestion, cpu/memory load on workers, etc.

Related

Reading parallel record in java from database

I have one database and in this we have millions of records. We are reading the record one by one using java and inserting those record to another system on daily basis after end of day. We have been told to make it faster.
I told them we will create multiple thread using thread pool and these thread will read data parallelly and inject into another system but I dont know how can we stop our thread to read same data again. how can make it faster and achieve data consistency as well. I mean how can we make this process faster using multithreading in java or is there any other way ,other than multithreading to achieve it?
One possible solution for your task would be taking the ids of records in your database, splitting them into chunks (e.g. with size 1000 each) and calling JpaRepository.findAllById(Iterable<ID>) within Runnables passed to ExecutorService.submit().
If you don't want to do it manually then you could have a look into Spring Batch. It is designed particularly for bulk transformation of large amounts of data.
I think you should identify the slowest part in this flow and try to optimize it step by step.
In the described flow you could:
Try to reduce the number of "roundtrips" between the java application (in coming from the driver driver) and the database: Stop reading records one by one and move to bulk reading. Namely, read, say, 2000 records at once from the db into memory and process the whole bulk. Consider even larger numbers (like 5000) but you should measure this really, it depends on the memory of the java application and other factors. Anyway, if there is an issue - discard the bulk.
The data itself might not be organized correctly: when you read the bulk of data you might need to order it by some criteria, so make sure it doesnt make a full table scan, define indices properly etc
If applicable, talk to your DBA, he/she might provide additional insights about data management itself: partitioning, storage related optimizations, etc.
If all this fails and reading from the db is still a bottleneck, consider the flow redesign (for instance - throw messages to kafka if you have one), these might be naturally partitioned so you could scale out the whole process, but this might be beyond the scope of this question.

Is there a Spark like Accumulator for Kafka Streams?

Spark has a useful API for accumulating data in a thread safe way https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.util.AccumulatorV2 and comes with some out-of-box useful accumulators e.g. for Longs https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.util.LongAccumulator
I usually use accumulators for wiring in debugging, profiling, monitoring and diagnostics into Spark jobs. I usually fire off a Future before running a Spark job to periodically print the stats (e.g. TPS, histograms, counts, timings, etc)
So far I cannot find anything that is similar for Kafka Streams. Does anything exist? I imagine this is possible at least for each instance of a Kafka app, but to make this work across several instances would require creating an intermediate topic.
Kafka Streams avoids concurrency by design -- if the accumulated does not need to be fault-tolerant, you can do it in memory and flush it out via a wall-clock time punctuation.
If it needs to be fault-tolerant, you can use a state store and scan the whole store in a punctuation to flush it out.
This will give you task-level accumulation. Not sure how Spark's accumulator works in detail, but if it give you a "global" view, I assume that it needs the send data over network, and one single instance only has access to the data (or maybe a broadcast instead -- not sure, how consistency would be guaranteed for the broadcast case). Similar, you could send the data to a topic (with 1 partition) to collect all data globally into a single place.

How to store big amount of data

i have a program, that at the start generates big amount of data ( several GB, possibly more than 10GB ) and then for several times process all data, do something, process all data, do something... That much data doesn't fit into my RAM and when it starts paging, its really painful. What is the optimal way to store my data and in general, how to solve this problem?
Should i use DB even though i dont need to save the data after my program ends?
Should i split my data somehow and just save it into files and load them when i need them? Or just keep using RAM and get over paging?
With DB and files there is a problem. I have to process the data by pieces. So i load chunk of data (lets say 500mb), calculate, load next chunk and after i load and calculate everything, i can do something and repeat the cycle. That means i would read from HDD the same chunks of data i read in previous cycle.
try to reduce the amount of data.
try to modify the algorithm, to extract the relevant data at an early stage
try to divide and / or parallelize the problem, and execute it over several clients in a cluster of computing nodes
File-style will be enough for your task, couple sample:
Use BuffereReader skip() method
RandomAccessFile
Read this two, and problem with duplication chunks should go away.
You should definitely try to reduce the amount of data and have multiple threads to handle your data.
FutureTask could help you :
ExecutorService exec = Executors.newFixedThreadPool(5);
FutureTask<BigDecimal> task1 = new FutureTask<>(new Callable<BigDecimal>() {
#Override
public BigDecimal call() throws Exception {
return doBigProcessing();
}
});
// start future task asynchronously
exec.execute(task1);
// do other stuff
// blocking till processing is over
BigDecimal result = task1.get();
In the same way, you could consider caching the future task to speed up your application if possible.
If not enough, you could use Apache Spark framework to process large datasets.
Before you think about performance you must consider belows:
find a good data structure for the data.
find good algorithms to process the data.
If you do not have enough memory space,
use memory mapped file to work on data
If you have a chance to process data without load all data
divide and conquer
And please give us more details.

Getting a huge amount of data from database in the most efficient way

In my application, i have to read a huge amount of data. After i have got all of my data, i put it in a list and process on it and work accordingly.
Now i was wondering if i can do anything, Anything to speed up the getting data from the database process? My database sits on a different server and i am working with java to interact with the database.
I dont have a definite size of the data, i.e. a specific number of rows that i need to process. Also I hear i can go for multithreading, but then how do go about it? since i wont know how to partition my data since it is indefinite. i.e. if the following pseudo code is to be applied
for(i=0 to number of partition) // Not certain on the number of partitions
create new thread and get data.
Or maybe i can hash data on the basis of some attribute and later tell each thread to fetch a particular index of the map, but then how do i map it before even fetching the data?
What all possible solutions can i look into, and how do i go about it? Let me know if you need any more info.
Thanks.
I hear i can go for multithreading, but then how do go about it?
This is definetly a good choice to speed up querying information from a remote server.
Usually in these tasks - the IO with the server is the main bottleneck, and by multithreading - one can "ask for" multiple rows concurrently - affectively reducing the IO wait times.
but then how do go about it?
The idea is to split the work into smaller tasks. Have a look at java high level concurrency API for more details.
One solution is to let each thread read a chunk of size M from the server, and repeat the process for each thread while there is still data in it (the server). Something like that (for each thread):
data = "start";
int chunk = threadNumber;
while (data != null) {
requestChunk(chunk);
chunk += numberOfThreads;
}
I assume here that once you are "out of bound" the server returns null (or requestChunk() processes it and returns null).
Or maybe i can hash data on the basis of some attribute and later tell
each thread to fetch a particular index of the map
If you need to iterate the data, and retrieve all of it - hashing is usually a bad solution. It is very cache inefficient and the overhead is just too big for this cases.

HBase Multithreaded Scan is really slow

I'm using HBase to store some time series data. Using the suggestion in the O'Reilly HBase book I am using a row key that is the timestamp of the data with a salted prefix. To query this data I am spawning multiple threads which implement a scan over a range of timestamps with each thread handling a particular prefix. The results are then placed into a concurrent hashmap.
Trouble occurs when the threads attmept to perform their scan. A query that normally takes approximately 5600 ms when done serially takes between 40000 and 80000 ms when 6 threads are spawned (corresponding to 6 salts/region servers).
I've tried to use HTablePools to get around what I thought was an issue with HTable being not thread-safe, but this did not result in any better performance.
in particular I am noticing a significant slow down when I hit this portion of my code:
for(Result res : rowScanner){
//add Result To HashMap
Through logging I noticed that everytime through the conditional of the loop I experienced delays of many seconds. These delays do not occur if I force the threads to execute serially.
I assume that there is some kind of issue with resource locking but I just can't see it.
Make sure that you are setting the BatchSize and Caching on your Scan objects (the object that you use to create the Scanner). These control how many rows are transferred over the network at once, and how many are kept in memory for fast retrieval on the RegionServer itself. By default they are both way too low to be efficient. BatchSize in particular will dramatically increase your performance.
EDIT: Based on the comments, it sounds like you might be swapping either on the server or on the client, or that the RegionServer may not have enough space in the BlockCache to satisfy your scanners. How much heap have you given to the RegionServer? Have you checked to see whether it is swapping? See How to find out which processes are swapping in linux?.
Also, you may want to reduce the number of parallel scans, and make each scanner read more rows. I have found that on my cluster, parallel scanning gives me almost no improvement over serial scanning, because I am network-bound. If you are maxing out your network, parallel scanning will actually make things worse.
Have you considered using MapReduce, with perhaps just a mapper to easily split your scan across the region servers? It's easier than worrying about threading and synchronization in the HBase client libs. The Result class is not threadsafe. TableMapReduceUtil makes it easy to set up jobs.

Categories

Resources