Extract result from Spark Stream into Java Object - java

Currently I have integrated Spark Stream with Kafka in Java and able to aggregate the stats. However, I cannot figure out a way to store the result into a Java object so I can pass this object with the result around with different methods/classes without storing them into database. I have spent quite amount of time searching for tutorial/examples online but all of them are end up with using print() to display the result in console. However, what I am trying to do is to return these results JSON string when users call a rest-api endpoint.
Is it possible that I can have these results in memory and pass them around with different methods, or I need to store them into database first and fetch them from there as needed?

If I got you right you want consume your results from Spark Streaming via Rest APIs.
Even if there are some ways to directly accomplish this (e.g. using Spark SQL/Thrift server) I would separate these two tasks. Else if you're Spark Streaming process fails, your service/REST-API layer will fail too.
Thus it has its advantages to separate these two layers. You are not forced to use a classical database. You could implement a service, which implements/uses JCache and send your results of the Spark streaming process to it.

Related

Is there a way to have a dynamic query or execute multiple queries with an apache beam pipeline?

I am using apache beam & google cloud dataflow to insert information into a cloud SQL database. So far this has been working great writing to one table. The information that is being sent is being broadened, including information destined to another table in the database.
I was curious if there was a way to dynamically use an SQL query based on the information I am receiving or am I able to somehow create the pipeline to execute multiple queries? Either would work...
Or, am I stuck with having to create a separate pipeline?
Cheers,
EDIT: Adding my current pipeline config
MainPipeline = Pipeline.create(options);
MainPipeline.apply(PubsubIO.readStrings().fromSubscription(MAIN_SUBSCRIPTION))
.apply(JdbcIO.<String> write()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create("com.mysql.cj.jdbc.Driver", JDBC_URL)
.withUsername(JDBC_USER).withPassword(JDBC_PASS))
.withStatement(QUERY_SQL).withPreparedStatementSetter(new NewPreparedStatementSetter() {
}));
I don't think you can have dynamic queries on JdbcIO based on the input elements, it's configured once at construction time as far as I can see.
However, I can think of couple of potential workarounds if they suit your use case.
One is to just to write your own ParDo in which you would call the JDBC driver manually. This will be basically re-implementing some part of the JdbcIO with new features added. Such ParDo can be as flexible as you like.
Another is to split the input PColleciton into multiple outputs. That will work if your use case is limited to some predefined set of queries that you can choose from based on the input. This way you split the input into multiple PCollections and then attach differently configured IOs to each.

can apache flink be used to join huge non real time data smart?

I am supposed to join some huge SQL tables with the json of some REST services by some common key ( we are talking about multiple sql tables with a few REST services calls ). The thing is this data is not real time/ infinite stream and also don’t think I could order the output of the REST services by the join columns. Now the silly way would be to bring all data and then match the rows, but that would imply to store everything in memory/ some storage like Cassandra or Redis.
But, I was wondering if flink could use some king of stream window to join say X elements ( so really just store in RAM just those elements at a point ) but also storing the nonmatched element for later match in maybe some kind of hash map. This is what I mean by smart join.
The devil is in the details, but yes, in principle this kind of data enrichment is quite doable with Flink. Your requirements aren't entirely clear, but I can provide some pointers.
For starters you will want to acquaint youself with Flink's managed state interfaces. Using these interfaces will ensure your application is fault tolerant, upgradeable, rescalable, etc.
If you wanted to simply preload some data, then you might use a RichFlatmap and load the data in the open() method. In your case a CoProcessFunction might be more appropriate. This is a streaming operator with two inputs that can hold state and also has access to timers (which can be used to expire state that is no longer needed, and to emit results after waiting for out-of-order data to arrive).
Flink also has support for asynchronous i/o, which can make working with external services more efficient.
One could also consider approaching this with Flink's higher level SQL and Table APIs, by wrapping the REST service calls as user-defined functions.

How to properly initialize task state at Apache Flink?

I am working on financial anti-fraud system, based on Apache Flink. I need to calculate many different aggregates, based on financial transactions. I use Kafka as stream data source. For example, in average transaction amount calculation I use MapState for storing total transactions count and total amount per card. Aggregated data stored at Apache Accumulo. I know about persistent states in Flink, but it is not that i need. Is there any way to load initial data into Flink before computation begins? Can it be done by using two connected streams with data from Accumulo with latest computed aggregates and transactions stream? Transactions stream is infinite, by aggregates stream not. Which way should i dig to? Any help is appreciated.
I've thought about AsyncIO, but states can't be used with async functions. My idea was: check for aggregates at in-memory state. If there is no data for card here - code makes call to storage service, fetch data from it, performs computations and updates in-memory state, so, next transaction for that card don't need to be processed with call to external data service. But i think its a big bottleneck.
You could try this way:
TASK::setInitialState
TASK::invoke
create basic utils (config, etc) and load the chain of operators
setup-operators
task-specific-init
initialize-operator-states
open-operators
run
close-operators
dispose-operators
task-specific-cleanup
common-cleanup

Java Spring: How to efficiently read and save large amount of data from a CSV file?

I am developing a web application in Java Spring where I want the user to be able to upload a CSV file from the front-end and then see the real-time progress of the importing process and after importing he should be able to search individual entries from the imported data.
The importing process would consist of actually uploading the file (sending it via REST API POST request) and then reading it and saving its contents to a database so the user would be able to search from this data.
What would be the fastest way to save the data to the database? Just looping over the lines and creating a new class object and saving it via JPARepository for each line takes too much time. It took around 90s for 10000 lines. I need to make it a lot faster. I need to add 200k rows in a reasonable amount of time.
Side Notes:
I saw Asynchronous approach, with Reactor. This should be faster as it uses multiple threads and the order of saving the rows basically isn't important (although the data has ID-s in the CSV).
Then I also saw Spring Batch jobs, but all of the examples use SQL. I am using repositories so I'm not sure if I can use it or whether it's the best approach.
This GitHub repo compares 5 different methods of batch inserting data. Acc. to him, using JdbcTemplate is the fastest (he claims 500000 records in 1.79 [+- 0.50] seconds). If you use JdbcTemplate with Spring Data, you'll need to create a custom repository; see this section in the docs for detailed instructions about that.
Spring Data CrudRepository has a save method that takes an Iterable, so you can use that too, although you'll have to time it to see how it performs against the JdbcTemplate. Using Spring Data, the steps are as follows (taken from here with some edit)
Add: rewriteBatchedStatements=true to the end of the connection string.
Make sure you use a generator that supports batching in your entity. E.g.
#Id
#GeneratedValue(generator = "generator")
#GenericGenerator(name = "generator", strategy = "increment")
Use the: save(Iterable<S> entities) method of the CrudRepository to save the data.
Use the: hibernate.jdbc.batch_size configuration.
The code for the solution #2 is here.
As for using multiple threads, remember that writing to the same table in the database from multiple threads may produce table level contentions and produce worse results. You will have to try and time it. How to write multithreaded code using project Reactor is a completely separate topic that's out of the scope here.
HTH.
If you are using SQLServer simply create a SSiS package that looks for the file and when it shows up simply grabs it and loads it and then renames the file. That make it a one time build and a million times execute and SSIS can load a ton of data fairly fast.
Rick

database connection: protocoling data for hundereds to million objects

I am running economic simulations. hundreds or thousands of agents = objects have to protocol data over time, the data has always the same structure. It typically is composed from a number of Booleans, floats and integers, possibly arrays/list (between 5 and 100 different variables). During the simulation the database has no read access. After the simulation the data will not be changed anymore. For every simulation I will create a new database. The current programming-languages are java for one project and python for a second. Its also possible that in the future the project is run on a network. If it matters: the objects communicate via 0mq. We are using mySql and sqllight.
How do I connect the thousands of objects to the database. The end result should be all be in one database.
/currently we send the data via zeromq messaging to one object that writes into the database.
If you use python (and you're okay with storing the data in non-sql format), I would recommend the object database called ZODB (by zope). Essentially, you'll be creating a dictionary (which can contain literally any data type you want). I use this for my own research and it has been great. There's also a fairly reasonable learning curve, meaning it won't take you more than a few days to really master it.
Since you mention that during your simulation the database "has no read access" you can use ZODB as a standalone tool. However, if you find yourself running multiple simulations in parallel (say multithreading or cloud computing), you will have to look into ZEO (also by zope) in order to make this a solution.
Each one of your simulations need not be it's own database with ZODB, it can be a separate key of a single "simulation" database, where the data for each run will be neatly nested underneath each key. You'll be able to say something like: print Simulation['run99']['output'] just as easily as you would with say the 98th run of your simulation.

Categories

Resources