I have a scenario where i need to process the incomming requests from users in a Spark job on a 20 node cluster. The Spark application uses deep learning and does some prediction on user data which is stored on HDFS. The idea is to provide an environment like a REST webservice, to which the users can send requests and they should be processed using Spark in distributed mode on Yarn. Here are the issues:
When i build the jar file with dependencies, its size is more
than 1gb. The deep CNN models are not embedded in the jar file.
Running the application via spark-submit for every incomming request seems impractical because:
spark-submit has its own overhead. Resource allocation, jvm application containers assignment etc. takes time
The application loads deep CNN trained models on startup, the size of one model is ~700mb and it also takes time in loading
My idea is to submit the application once using spark-submit as an infinitly running job, keep the spark context and the models in memory, and expose a REST endpoint which the users can send request to. upon recieving a request, trigger a map operation from within the running job, get the result, and return it to user in Json format. This way, they will be processed instantly without any delay. Is this possible?
I have studied many articles, and also stackoverflow questions such as Using Spark to process requests, Best Practice to launch Spark Applications via Web Application?, run spark as java web application, how to deploy war file in spark-submit command (spark), and Creating a standalone on-demand Apache Spark web service, however, none of these are fitting the scenario i described.
From the articles and stackoverflow questions, i learned that Spark REST API as well as Apache Livy can be used to submit Spark jobs, however, in both cases, a spark job is submitted for every request, which suffers from the same problems as i described above (1+ gb jar file size plus loading models on starup). Also, what happens in multiple concurrent incomming requests? Am i right?
I read that Uber uses Spark for route calculation (article, article, article), but its closed source and i have no idea how they do it on the fly for every incomming user request.
In a nutshell, is it possible to embed a REST microservice within the Spark Job using a lightweight framework such as Spark Java? Spark streaming is also not applicable in this scenario because there is no streaming data source.
I have searched this for a long time and i never found a practical solution. If my understaning of Spark REST Api and Livy is wrong, can i be corrected please? And if my idea is wrong, can you guide me what other approach is possible to get the job done? Any help or suggestions will be highly appretiated.
Related
I have a gRPC java service which uses Hibernate for making database updates. The data ingestion part takes a long time, and I need to ingest even more data which would make it impossible to do it sequentially. I'd like to make the service run in a distributed environment so I can parallelize the data processing.
However, I'm not an expert in Hibernate and I'm wondering if it's possible to make the service distributed as it is, and how to do that. I have used Spark before, but I'm wondering if I want to use Spark for this purpose I'd have to essentially rewrite the entire service. Is there an easier way to do it?
What my service does:
supports backend API endpoints via gRPC
ingests data by reading inputs from AWS S3, comparing the incoming data with the current database and makes updates where necessary, using Hibernate
Not very clear about the whole picture of spark. Let's say I create a regular java jar, without involving any spark relevant things, no sparksession, no sparkcontext, no rdd, no dataset, then what would happen if I submit it to spark cluster via spark-submit with deply-mode=cluster?
I wrote a simple jar which only prints some lines, and seems it works fine on my toy spark, I once thought that would result in some error since it's not a spark application...
I wonder to know if I can expect same result when submitting to a real-world spark cluster which has many nodes?
That can depend on a cluster manager and mode but in general nothing strange. Spark application is a plain JVM application with normal main function, it doesn't implement particular interface and lack of active session is not an issue.
I am developing a product using microservices and am running into a bit of an issue. In order to do any work, I need to have all 9 services running on my local development environment. I am using Cloud Foundry to run the applications, but when running locally I am just running the Spring Boot Jars themselves. Is there anyway to setup a more lightweight environment so that I don't need everything running? Ideally, I would only like to have the service I am currently working on to have to be real.
I believe this is a matter of your testing strategy. If you have a lot of micro-services in your system, it is not wise to always perform end-to-end testing at development time -- it costs you productivity and the set up is usually complex (like what you observed).
You should really think about what is the thing you wanna test. Within one service, it is usually good to decouple core logic and the integration points with other services. Ideally, you should be able to write simple unit tests for your core logic. If you wanna test integration points with other services, use mock library (a quick google search shows this to be promising http://spring.io/blog/2007/01/15/unit-testing-with-stubs-and-mocks/)
If you don't have already, I would highly recommend to set up a separate staging area with all micro-services running. You should perform all your end-to-end testing there, before deploying to production.
This post from Martin Fowler has a more comprehensive take on micro-service testing stratey:
https://martinfowler.com/articles/microservice-testing
It boils down to a test technique that you use. Here my recent answer in another topic that you could find useful https://stackoverflow.com/a/44486519/2328781.
In general, I think that Wiremock is a good choice because of the following reasons:
It has out-of-the-box support by Spring Boot
It has out-of-the-box support by Spring Cloud Contract, which gives a possibility to use a very powerful technique called Consumer Driven Contracts.
It has a recording feature. Setup your Wiremock as a proxy and make requests through it. This will generate stubs for you automatically based on your requests and responses.
There are multiple tools out there that let you create mocked versions of your microservices.
When I encountered this exact problem myself I decided to create my own tool which is tailored for microservice testing. The goal is to never have to run all microservices at once, only the one that you are working on.
You can read more about the tool and how to use it to mock microservices here: https://mocki.io/mock-api-microservices. If you only want to run them locally, it is possible using the open source CLI tool
It can be solved if your microservices allow passing metadata along with requests.
Good microservice architecture should use central service discovery, also every service should be able to take metadata map along with request payload. Known fields of this map can be somehow interpreted and modified by the service then passed to next service.
Most popular usage of per-request metadata is request tracing (i.e. collecting tree of nodes used to process this request and timings for every node) but it also can be used to tell entire system which nodes to use
Thus plan is
register your local node in dev environment service discovery
send request to entry node of your system along with metadata telling everyone to use your local service instance instead of default one
metadata will propagate and your local node will be called by dev environment, then local node will pass processed results back to dev env
Alternatively:
use code generation for inter-service communication to reduce risk of failing because of mistakes in RPC code
resort to integration tests, mocking all client apis for microservice under development
fully automate deployment of your system to your local machine. You will possibly need to run nodes with reduced memory (which is generally OK as memory is commonly consumed only under load) or buy more RAM.
An approach would be to use / deploy an app which maps paths / urls to json response files. I personally haven't used it but I believe http://wiremock.org/ might help you
For java microservices, you should try Stybby4j. This will mock the json responses of other microservices using Stubby server. If you feel that mocking is not enough to map all the features of your microservices, you should setup a local docker environment to deploy the dependent microservices.
I have an application where multiple users are able to specify Spark workflows which are then send to the driver and executed on the cluster.
The workflows should now be extended to also support streamed data-sources. A possible workflow could involve:
Stream tweets with a specific hashtag
Transform each tweet
Do analysis on a windowed frame and visualization
This is working if only one single stream is started at once but gives the "Only one StreamingContext may be started in this JVM." error.
I tried different known approaches but none of them was working for me ("spark.driver.allowMultipleContexts = true", increasing "spark.streaming.concurrentJobs", trying to run each streaming context in a different pool, etc.)
Can anybody tell me what the current best practice regarding parallel streams with Spark streaming is?
Thx in advance!
I assume you're starting your spark streaming jobs programatically within an existing application - hence the error with the JVM. Spark is specifically not designed to run in the scope of an different application, even this is feasible in stand alone mode. If you want to start spark streaming jobs programatically on a cluster, you will want to use the Launcher, which looks like this:
import org.apache.spark.launcher.SparkLauncher
object Launcher extends App {
val spark = new SparkLauncher()
.setSparkHome("...")
.setAppResource("..path to your jar...")
.setMainClass("..your app...")
.setMaster("yarn")
.launch();
spark.waitFor();
}
there's a blog post with some examples:
https://blog.knoldus.com/2015/06/26/startdeploy-apache-spark-application-programmatically-using-spark-launcher/
the API docs are here:
https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/launcher/SparkLauncher.html
I am looking for a pattern and/or framework which can model the following problem in an easily configurable way.
Every say 3 minutes, I needs to have a set of jobs kick off in a web application context that will concurrently hit web services to obtain the latest version of data, and push it off to a database. The problem is the database will be being heavily used to read the data from to do tons of complex calculations on the data. We are currently using spring so I have been looking at Spring Batch to run this process does anyone have any suggestions/patterns/examples of using Spring or other technologies of a similar system?
We have used ServletContextlisteners to kick off TimerTasks in our web applications when we needed processes to run repeatedly. The ServletContextListener kicks off when the app server starts the application or when the application is restarted. Then the timer tasks act like a separate thread that repeats your code for the specified period of time.
ServletContextListener
http://www.javabeat.net/examples/2009/02/26/servletcontextlistener-example/
TimerTask
http://enos.itcollege.ee/~jpoial/docs/tutorial/essential/threads/timer.html
Is refactoring the job out of the web application and into a standalone app a possibility?
That way you could stick the batch job onto a separate batch server (so that the extra load of the batch job wouldn't impact your web application), which then calls the web services and updates the database. The job can then be kicked off using something like cron or Autosys.
We're using Spring-Batch for exactly this purpose.
The database design would also depend on what the batched data is used for. If it is for reporting purposes, I would recommend separating the operational database from the reporting database, using a database link to obtain the required data from the operational database into the reporting database and then running the complex queries on the reporting database. That way the load is shifted off the operational database.
I think it's worth also looking into frameworks like camel-integration. Also take a look at the so called Enterprise Integration Patterns. Check the catalog - it might provide you with some useful vocabulary to think about the scaling/scheduling problem at hand.
The framework itself integrates really well with Spring.