I have an application where multiple users are able to specify Spark workflows which are then send to the driver and executed on the cluster.
The workflows should now be extended to also support streamed data-sources. A possible workflow could involve:
Stream tweets with a specific hashtag
Transform each tweet
Do analysis on a windowed frame and visualization
This is working if only one single stream is started at once but gives the "Only one StreamingContext may be started in this JVM." error.
I tried different known approaches but none of them was working for me ("spark.driver.allowMultipleContexts = true", increasing "spark.streaming.concurrentJobs", trying to run each streaming context in a different pool, etc.)
Can anybody tell me what the current best practice regarding parallel streams with Spark streaming is?
Thx in advance!
I assume you're starting your spark streaming jobs programatically within an existing application - hence the error with the JVM. Spark is specifically not designed to run in the scope of an different application, even this is feasible in stand alone mode. If you want to start spark streaming jobs programatically on a cluster, you will want to use the Launcher, which looks like this:
import org.apache.spark.launcher.SparkLauncher
object Launcher extends App {
val spark = new SparkLauncher()
.setSparkHome("...")
.setAppResource("..path to your jar...")
.setMainClass("..your app...")
.setMaster("yarn")
.launch();
spark.waitFor();
}
there's a blog post with some examples:
https://blog.knoldus.com/2015/06/26/startdeploy-apache-spark-application-programmatically-using-spark-launcher/
the API docs are here:
https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/launcher/SparkLauncher.html
Related
I'm new to the streaming community
I'm trying to create a continuous query using kafka topics and flink but I haven't found any examples so I can get an idea of how to get started
can you help me with some examples?
thank you.
For your use case, I'm guessing you want to use kafka as source for continuous data. In this case you can use kafka-source-connector(linked below) and if you want to slice it with time you can use flink's Window Processing Function. This will group your kafka messages streamed in a particular timeframe like a list/map.
Flink Kafka source connector
Flink Window Processing Function
I have a scenario where i need to process the incomming requests from users in a Spark job on a 20 node cluster. The Spark application uses deep learning and does some prediction on user data which is stored on HDFS. The idea is to provide an environment like a REST webservice, to which the users can send requests and they should be processed using Spark in distributed mode on Yarn. Here are the issues:
When i build the jar file with dependencies, its size is more
than 1gb. The deep CNN models are not embedded in the jar file.
Running the application via spark-submit for every incomming request seems impractical because:
spark-submit has its own overhead. Resource allocation, jvm application containers assignment etc. takes time
The application loads deep CNN trained models on startup, the size of one model is ~700mb and it also takes time in loading
My idea is to submit the application once using spark-submit as an infinitly running job, keep the spark context and the models in memory, and expose a REST endpoint which the users can send request to. upon recieving a request, trigger a map operation from within the running job, get the result, and return it to user in Json format. This way, they will be processed instantly without any delay. Is this possible?
I have studied many articles, and also stackoverflow questions such as Using Spark to process requests, Best Practice to launch Spark Applications via Web Application?, run spark as java web application, how to deploy war file in spark-submit command (spark), and Creating a standalone on-demand Apache Spark web service, however, none of these are fitting the scenario i described.
From the articles and stackoverflow questions, i learned that Spark REST API as well as Apache Livy can be used to submit Spark jobs, however, in both cases, a spark job is submitted for every request, which suffers from the same problems as i described above (1+ gb jar file size plus loading models on starup). Also, what happens in multiple concurrent incomming requests? Am i right?
I read that Uber uses Spark for route calculation (article, article, article), but its closed source and i have no idea how they do it on the fly for every incomming user request.
In a nutshell, is it possible to embed a REST microservice within the Spark Job using a lightweight framework such as Spark Java? Spark streaming is also not applicable in this scenario because there is no streaming data source.
I have searched this for a long time and i never found a practical solution. If my understaning of Spark REST Api and Livy is wrong, can i be corrected please? And if my idea is wrong, can you guide me what other approach is possible to get the job done? Any help or suggestions will be highly appretiated.
Not very clear about the whole picture of spark. Let's say I create a regular java jar, without involving any spark relevant things, no sparksession, no sparkcontext, no rdd, no dataset, then what would happen if I submit it to spark cluster via spark-submit with deply-mode=cluster?
I wrote a simple jar which only prints some lines, and seems it works fine on my toy spark, I once thought that would result in some error since it's not a spark application...
I wonder to know if I can expect same result when submitting to a real-world spark cluster which has many nodes?
That can depend on a cluster manager and mode but in general nothing strange. Spark application is a plain JVM application with normal main function, it doesn't implement particular interface and lack of active session is not an issue.
I have a cluster of servers. Common tasks I manualy code are:
collect various stats (failures, success, times) with metrics library.
aggregate those combine cross cluster.
depending on conditions check out the aggregated stats cross cluster and based on that send alerts. (instead of having each server send an alert, increase global metrics which are polled then to graphite).
if a specific node send an alert its first accumulated and base on alerts from other nodes (again cross cluster scenario) then I woudl decide which alert to send (so if i have 100 servers not each of them send a separate alert but single one).
I looked into a few frameworks but none of them that I see achieve this: metrics, javamelody, netflix servo, netflix zuul
but none of them support for example my cross cluster scenario where i want to aggregate stats over time and only if certain conditions apply send an alert (as a method to avoid duplicating alerts cross servers). Do I need to build my own framework for that? or is there already something existing?
(and in case my use case sounded specific so that I should just code it, i have many more such similar use cases which makes me think why isn't there such a framework, before i start coding something i don't want to find i just duplicated some other framework).
Have you looked at using a combination of either Graphite or OpenTSDB with Riemann? You can aggregate your information in Graphite (with or without statsd) or dump everything into OpenTSDB and use Riemann for event processing? Riemann's config is in Clojure but I believe you can use client libraries in multiple languages (unless you want to do the event processing yourself using Esper/Siddhi). Another option could be to look at Rocksteady (whcih uses Graphite/Esper). Graphite is a Python/Django application (there are multiple forks of statsd - not just the one in NodeJS & besides, you can simply use metrics in place of that). OpenTSDB is Java on HBase (if you're looking to store time series information). For event processing, you could also choose to look into Storm (and use Esper/Siddhi as a bolt in Storm).
I am looking for a pattern and/or framework which can model the following problem in an easily configurable way.
Every say 3 minutes, I needs to have a set of jobs kick off in a web application context that will concurrently hit web services to obtain the latest version of data, and push it off to a database. The problem is the database will be being heavily used to read the data from to do tons of complex calculations on the data. We are currently using spring so I have been looking at Spring Batch to run this process does anyone have any suggestions/patterns/examples of using Spring or other technologies of a similar system?
We have used ServletContextlisteners to kick off TimerTasks in our web applications when we needed processes to run repeatedly. The ServletContextListener kicks off when the app server starts the application or when the application is restarted. Then the timer tasks act like a separate thread that repeats your code for the specified period of time.
ServletContextListener
http://www.javabeat.net/examples/2009/02/26/servletcontextlistener-example/
TimerTask
http://enos.itcollege.ee/~jpoial/docs/tutorial/essential/threads/timer.html
Is refactoring the job out of the web application and into a standalone app a possibility?
That way you could stick the batch job onto a separate batch server (so that the extra load of the batch job wouldn't impact your web application), which then calls the web services and updates the database. The job can then be kicked off using something like cron or Autosys.
We're using Spring-Batch for exactly this purpose.
The database design would also depend on what the batched data is used for. If it is for reporting purposes, I would recommend separating the operational database from the reporting database, using a database link to obtain the required data from the operational database into the reporting database and then running the complex queries on the reporting database. That way the load is shifted off the operational database.
I think it's worth also looking into frameworks like camel-integration. Also take a look at the so called Enterprise Integration Patterns. Check the catalog - it might provide you with some useful vocabulary to think about the scaling/scheduling problem at hand.
The framework itself integrates really well with Spring.