I am new to Hadoop and Kafka. I inherited code for a Kafka consumer that runs on a desktop Windows machine, receives the HDFS location of new XML data available on a remote cluster, downloads the data for processing, and writes the result back out to the HDFS cluster.
It seems to me that the consumer should run on the cluster because that's where the data is but all the sample Kafka consumer code I see suggests that producer/consumers run on regular desktop machines. What is the typical target platform for Kafka consumer?
Producers and consumers can run anywhere. The examples you see imply a desktop execution because that code is much simpler than, say, code running within a Storm topology and examples tend to be overly simple. The only reason for a desktop environment would be the presence of a UI for the application.
If the application is headless, then it does make a lot of sense to move the execution as close to the data (both Kafka and HDFS) as possible.
Related
How to create a Kafka stream which runs at a specific time everyday, reads messages from a topic, do some transformations and write messages back to a different topic.
For instance a stream that runs at 9pm everyday, fetches all the messages pushed to a topic and write them to another topic.
I tried windowing but all the examples were pertaining to aggregation only. I don't have to do aggregation.
I am using java DSL
Write your java code to do what you want and configure crontab to running your java when you want.
I am testing my first Spark Streaming pipline which processes messages from Kafka. However, after several testing runs, I got the following error message
There is insufficient memory for the Java Runtime Environment to continue.
My testing data is really small thus this should not happen. After looking into the process, I realized maybe previously submitted spark jobs were not removed completely?
I usually submit jobs like below, and I am using Spark 2.2.1
/usr/local/spark/bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 ~/script/to/spark_streaming.py
And stop it using `Ctrl+C'
Last few lines of the script looks like:
ssc.start()
ssc.awaitTermination()
Update
After I changing the way to submit a spark streaming job (command like below), I still ran into same issue which is after killing the job, memory will not be released.I only started Hadoop and Spark for those 4 EC2 nodes.
/bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 --py-files ~/config.py --master spark://<master_IP>:7077 --deploy-mode client ~/spark_kafka.py
When you press Ctrl-C, only the submitter process is interrupted, the job itself continues to run. Eventually your system runs out of memory so no new JVM can be started.
Furthermore, even if you restart the cluster, all previously running jobs will be restarted again.
Read how to stop a running Spark application properly.
It might be the problem of bunch of driver (spark-app-driver process) processes running on the host you use to submit spark job. Try doing something like
ps aux --forest
or similar depending on your platform to understand what are the processes running at the moment. Or you can have a look at answers over the stackoverflow Spark Streaming with Actor Never Terminates , it might give you a glue on what is happening.
I have a java application which does a flink batch processing of a batch obtained by querying the tables from the database and feed it into a kafka topic. How would I make this scheduled periodically. Is there a flink scheduler? For example, my java application should keep running in the background and the flink scheduler should periodically query the tables from the database and flink batch process it and feed into kafka(flink batch processing and feeding into kafka is already done part of my application). Please help if anyone has pointers on this.
Flink does not provide a job scheduler.
Have you considered implementing the use case with a continuously running Flink DataStream application? You could implement a SourceFunction that periodically queries the database.
Continuous streaming applications have the benefits of fewer moving parts (no scheduler, no failure handling if something goes wrong) and a consistent view across the boundaries of "batches". The down side is that the job is always consuming resources (Flink is not able to automatically scale-down at low load).
I am confused on how the Datanode's in a hadoop cluster runs the java code for the reduce function of a job. Like, how does hadoop send a java code to another computer to execute?
Does Hadoop inject a java code to the nodes? If so, where is the java code located in hadoop?
Or are the reduce functions ran on the master node not the datanodes?
Help me trace this code where the Master Node sends the java code for the reduce function to a datanode.
As shown in the picture, here is what happens:
You run the job on client by using hadoop jar command in which you pass jar file name, class name and other parameters such as input and output
Client will get new application id and then it will copy the jar file and other job resources to HDFS with high replication factor (by default 10 on large clusters)
Then Client will actually submit the application through resource manager
Resource manager keeps track of cluster utilization and submit application master (which co-ordinates the job execution)
Application master will talk to namenode and determine where the blocks for input are located and then work with nodemanagers to submit the tasks (in the form of containers)
Containers are nothing but JVMs and they run map and reduce tasks (mapper and reducer classes), when the JVM is bootstrapped job resources that are on HDFS will be copied to the JVM. For mappers these JVMs will be created on same nodes on which data exists. Once the processing is started the jar file will be executed to process the data locally on that machine (typical).
To answer your question, reducer will be running on one or more data nodes as part of the containers. Java code will be copied as part of the bootstrap process (when JVM is created). Data will be fetched from mappers over the network.
No. Reduce functions are executed on data nodes.
Hadoop transfers packaged code (jar files) to the data node that are going to process data. At run time data nodes download these code and process task.
We are running our calculations in a standalone Spark cluster, ver 1.0.2 - the previous major release. We do not have any HA or recovery logic configured.
A piece of functionality on the driver side consumes incoming JMS messages and submits respective jobs to spark.
When we bring the single & only Spark master down (for tests), it seems the driver program is unable to properly figure out that the cluster is no longer usable. This results in 2 major problems:
The driver tries to reconnect endlessly to the master, or at least we couldn't wait until it gives up.
Because of the previous point, submission of new jobs blocks (in org.apache.spark.scheduler.JobWaiter#awaitResult). I presume this is because the cluster is not reported unreacheable/down and the submission simply logic waits until the cluster comes back. For us this means that we run out of the JMS listener threads very fast since they all get blocked.
There are a couple of akka failure detection-related properties that you can configure on Spark, but:
The official documentation strongly doesn't recommend enabling akka's built-in failure detection.
I would really want to understand how this is supposed to work by default.
So, can anyone please explain what's the designed behavior if a single spark master in a standalone deployment mode fails/stops/shuts down. I wasn't able to find any proper doc on the internet about this.
In default, Spark can handle Workers failures but not for the Master (Driver) failure. If the Master crashes, no new applications can be created. Therefore, they provide 2 high availability schemes here: https://spark.apache.org/docs/1.4.0/spark-standalone.html#high-availability
Hope this helps,
Le Quoc Do