So I have a spark application that reads DB records (lets say 1000 records), processes them, and writes a CSV file (with 1000 lines) out to the cloud Object storage. So three questions here:
Is DB read request sent to executors? If so in case of 1000 DB records, would each executor read partial DB data (example 500 records each) and send the records back to driver? Or does it write to a central cache and driver would read it from there?
Next step processing the DB records (fold job), is sent to 2 executors. Lets say each executor gets 500 records or so. Once the executor finishes processing its partition does it send all 500 processed (formatted) rows back to driver? Or does it write some central cache and driver gets it back? How is the data exchange happening between driver and executor happen?
Last step is the .save csvfile call in my main() function. In this code I am doing a reparition(1) with the idea that I will only save this file from one executor. If so, how is the data collected into this one executor. Remember earlier we had two executors process 500 records each. How is a total of 1000 records sent to one executor and gets saved into the object storage by one executor? how is the data collected from all executors shared into that one executor executing the .save?
dataset.repartition(1)
.write()
.format("csv")
.option("header", "true")
.save(filepath);
If I dont do repartition(1), will the save happen from multiple executors and would it overwrite each other? I dont think there is a way we can specify the filename to be unique using spark. Do I have to save the file in temp and rename later and all that?
Are there any articles, youtube videos that will explain how data is distributed and collected or shared across executors? I can understand how .count() works. but how does .save work or how is large data results like millions of DB records or rows shared across executors? I have been looking for resources to read can't seem to find one that answers my questions. I am very new to spark, like 3 weeks new.
I've been reading a bit about the Kafka concurrency model, but I still struggle to understand whether I can have local state in a Kafka Processor, or whether that will fail in bad ways?
My use case is: I have a topic of updates, I want to insert these updates into a database, but I want to batch them up first. I batch them inside a Java ArrayList inside the Processor, and send them and commit them in the punctuate call.
Will this fail in bad ways? Am I guaranteed that the ArrayList will not be accessed concurrently?
I realize that there will be multiple Processors and multiple ArrayLists, depending on the number of threads and partitions, but I don't really care about that.
I also realize I will loose the ArrayList if the application crashes, but I don't care if some events are inserted twice into the database.
This works fine in my simple tests, but is it correct? If not, why?
Whatever you use for local state in your Kafka consumer application is up to you. So, you can guarantee only the current thread/consumer will be able to access the local state data in your array list. If you have multiple threads, one per Kafka consumer, each thread can have their own private ArrayList or hashmap to store state into. You could also have something like a local RocksDB database for persistent local state.
A few things to look out for:
If you're batching updates together to send to the DB, are those updates in any way related, say, because they're part of a transaction? If not, you might run into problems. An easy way to ensure this is the case is to set a key for your messages with a transaction ID, or some other unique identifier for the transaction, and that way all the updates with that transaction ID will end up in one specific partition, so whoever consumes them is sure to always have the
How are you validating that you got ALL the transactions before your batch update? Again, this is important if you're dealing with database updates inside transactions. You could simply wait for a pre-determined amount of time to ensure you have all the updates (say, maybe 30 seconds is enough in your case). Or maybe you send an "EndOfTransaction" message that details how many messages you should have gotten, as well as maybe a CRC or hash of the messages themselves. That way, when you get it, you can either use it to validate you have all the messages already, or you can keep waiting for the ones that you haven't gotten yet.
Make sure you're not committing to Kafka the messages you're keeping in memory until after you've batched and sent them to the database, and you have confirmed that the updates went through successfully. This way, if your application dies, the next time it comes back up, it will get again the messages you haven't committed in Kafka yet.
I have a system where I have multiple sensors and I need to collect data from each sensor every minute. I am using
final Runnable collector = new Runnable(){public void run() {{...}};
scheduler.scheduleAtFixedRate(collector, 0, 1, TimeUnit.MINUTES);
to initiate the process every minute and starts an individual thread for each sensor. Each thread opens a mysql connection and gets details of the sensor from database, opens a socket to collect data and stores data into the database and closes socket and db connection. (I make sure all the connections are closed)
Now there are other applications which I use to generate alerts and reports from that data.
Now as the number of sensors are increasing the server starts to get overload and the applications are getting slow.
I need some expert advice, how to optimise my system and what is the best way to implement these type of systems. Should I use only one application to (collect data + generate alarm + generate reports, generate chart images + etc).
Thanks in advance.
Here is the basic code for data collector application
public class OnlineSampling
{
private final ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);
public void startProcess(int start)
{
try
{
final Runnable collector = new Runnable()
{
#SuppressWarnings("rawtypes")
public void run()
{
DataBase db = new DataBase();
db.connect("localhost");
try
{
ArrayList instruments = new ArrayList();
//Check if the database is connected
if(db.isConnected())
{
String query="SELECT instrumentID,type,ip,port,userID FROM onlinesampling WHERE status = 'free'";
instruments = db.getData(query,5);
for(int i=0;i<instruments.size();i++)
{
...
OnlineSamplingThread comThread = new OnlineSamplingThread(userID,id,type,ip,port,gps,unitID,parameterID,timeZone,units,parameters,scaleFactors,offsets,storageInterval);
comThread.start();
//This onlineSamplingThread opens the socket and collects the data and does few more things
}
}
} catch (Exception e)
{
e.printStackTrace();
}
finally
{
//Disconnect from the database
db.disconnect();
}
}
};
scheduler.scheduleAtFixedRate(collector, 0, 60 , TimeUnit.SECONDS);
} catch (Exception e) {}
}
}
UPDATED:
How many sensors do you have? We have around 400 sensors (increasing).
How long is data-gathering session with each sensor?
Each sensor has a small webserver with a sim card in it wo connect to the internet. It depends on the 3G network, in normal conditions it does not take more than 3.5 seconds.
Are you closing the network connections properly after you're done with a single sensor? I make sure I close the socket everytime, I have also set the timeout duration for each socket which is 3.5 seconds.
What OS are you using to collect sensor data? We have our own protocol to communicate with the sensors using socket programming.
Is it configured as a server or a desktop? Each sensor is a server.
What you probably need is connection pooling - instead of opening one DB connection per sensor, have a shared pool of opened connections that each thread uses when it needs to access the DB. That way, the number of connections can be much smaller than the number of your sensors (assuming that most of the time, the program will do other things than read/write into the DB, like communicate with the sensor or wait for sensor response).
If you don't use a framework that has connection pooling feature, you can try Apache Commons DBCP.
Reuse any open files or sockets whenever you can. DBCP is a good start.
Reuse any threads if you can. That "comThread" is very suspect in that regard.
Consider adding queues to your worker threads. This will allow you to have threads that process tasks/jobs serially.
Profile, Profile, Profile!! You really have no idea what to optimize until you profile. JProfiler and YourKit are both very popular, but there are some free tools such as Netbeans and VisualVM.
Use Database caching such as Redis or Memcache
Consider using Stored Procedures versus inline queries
Consider using a Service Oriented Architecture or Micro-Services. Splitting each application function into a separate service which can be tightly optimized for that function.
These are from the smalll amount of code you posted. But profile should give you a much better idea.
Databases are made to handle loads of way more than "hundreds of inserts" per minute. In fact a MySQL database can easily handle hundreds of inserts per second.So, you problem it's probably not related to the load.
The first goal it's to find out "What is slow" or "What is collapsing", run all the queries that your application runs and see if any of them are abnormally slow compared to the others. Alternatively configure the Slow Query Log (https://dev.mysql.com/doc/refman/5.0/en/slow-query-log.html ) with parameters fitting to your problem, and then analice the output.
Once you find "What" is the problem, you can ask for help here with laying out more information. We have no way to help you with the information provided.
However, just as a hunch, what's the max_connections parameter value you have for your database? The default value it's 100 or 151 I think, so if you have more than 151 sensors connected at the database at the same time it will queue or drop the new incoming connections. If that's your issue you just have to minimise the time sensors are connected to your database and it will fix the issue.
Your system is (almost certainly) slowing down because of the enormous overhead of starting threads, opening database connections, and then closing them. 300 sensors means five of these operations per second, continuously. That's too many.
Here's what you need to do to make this scalable.
First step
Make your sampling program long-running, rather than starting it over frequently.
Have it start a sensor thread for each 20 sensors (approximately).
Each thread will query its sensors one by one and insert the results into some sort of thread-safe data structure. A Bag or a Queue would be suitable.
When your sensor threads come to the end of each minute's work, make each of them sleep for the remaining time before the next minute starts, then start over.
Have your program start a single database-writing thread. That thread will open a database connection and hold it open. It will then take results from the queue and write them to the database, waiting when no results are available.
The database-writing thread should start a MySQL transaction, then INSERT some number of rows (ten to 100), then Commit the transaction and start another one, rather than using the default autocommit behavior. (If you're using MyISAM tables, you don't need to do this.)
This will drastically improve your throughput and reduce your MySQL overhead.
Second step
When your workload gets too big for a single program instance with multiple sensor threads to handle, run multiple instances of the program, each with its own list of sensors.
Third step
When the workload gets too big for a single machine, add another one and run new instances of your program on that new machine.
Collecting data from hundreds of sensors should not pose a performance problem if done correctly. To scale this process you should carefully manage your database connections as well as your sensor connections and you should leverage queues for the sensor-sampling and sensor-data writing processes. If your sensor count is stable, you can cache the sensor connection data, possibly with periodic updates to your sensor connection cache.
Use a connection pool to talk to your database. Query your database for the sensor connection information, then release that connection back to the pool as soon as possible -- do not keep the database connection open while talking to the sensor. It's likely reading sensor connection data (which talks to your database) can be done in a single thread, and that thread creates sensor sampling jobs for your executor.
Within each sensor sampling job, open the HTTP sensor connection, collect sensor data, close HTTP sensor connection, and then create a sensor data write job to write the sampling data to the database. Assuming your sensors are distinct nodes, an HTTP sensor connection pool is not likely to help much because HTTP client and server connections are relatively light (unlike database connections).
Writing sensor-sampling data back to the database should also be made in a queue and these database write jobs should use your database connection pool.
With this design, you should be able to easily handle hundreds of sensors and likely thousands of sensors with modest hardware running a Linux server OS as the collector and a properly configured database.
I suggest you test these processes independently, so you know the sustainable rates for each step:
reading and caching sensor connection data and create sampling jobs;
execute sampling jobs and create writing jobs; and,
execute sample data writing jobs.
Let me know if you'd like code as well.
I have the next scenario:
the server send a lot of information from a Socket, so I need to read this information and validate it. The idea is to use 20 threads and batches, each time the batch size is 20, the thread must send the information to the database and keep reading from the socket waiting for more.
I don't know what it would be the best way to do this, I was thinking:
create a Socket that will read the information
Create a Executor (Executors.newFixedThreadPool(20)) and validate de information, and add each line into a list and when the size is 20 execute the Runnable class that will send the information to the database.
Thanks in advance for you help.
You don't want to do this with a whole bunch of threads. You're better off using a producer-consumer model with just two threads.
The producer thread reads records from the socket and places them on a queue. That's all it does: read record, add to queue, read next record. Lather, rinse, repeat.
The consumer thread reads a record from the queue, validates it, and writes it to the database. If you want to batch the items so that you write 20 at a time to the database, then you can have the consumer add the record to a list and when the list gets to 20, do the database update.
You probably want to look up information on using the Java BlockingQueue in producer-consumer programs.
You said that you might get a million records a day from the socket. That's only 12 records per second. Unless your validation is hugely processor intensive, a single thread could probably handle 1,200 records per second with no problem.
In any case, your major bottleneck is going to be the database updates, which probably won't benefit from multiple threads.
I am having web service that is receiving multiple XML file at a time which contains student's data
i need to process that file and store values to database.
for that i have used JMS queue. i am creating object message and pushing to queue.
but when queue is processing message another messages are available for process and due to that my database table gets locked.
consider that i am having one list that contains 5000 values and in for loop i am iterating list and processing JMS messages.
this is exactly my scenario is . The problem is while processing one message my table gets locked and rest of file remains as it is in queue.
suggest some solution
Make sure you use the right lock strategy (see Table level locking and Row level locking)
See if you can treat your messages one at a time, (JMS consumer conf.) this way, the first message will release the lock for the second one and so on
EDIT: Typo and links
If I understand you correctly, the database handling is in the listener that's taking messages off the queue.
You have to worry about database isolation and table/row locking, because each listener runs in its own thread.
You'll either have to lock rows or set the ISOLATION level on your database to SERIALIZABLE to guarantee that only one thread at a time will INSERT or UPDATE the table.