How to read data from relational database in apache flink streaming - java

How we can read data from relational database using custom data source. I am new to flink streaming. I am facing problem while adding new custom data-source. So please help me to add custom data source and read data continuously from source DB.

As suggested by Chengzhi, relational databases are not designed to be processed in a streaming fashion and it would be better to use Kafka, Kinesis or some other system for that.
However you could write a custom source function that uses a JDBC connection to fetch the data. It would have to continuously query the DB for any new data. The issue here is that you need a way to determine which data you have already read/processed and which you did not. From the top of my head you could use a couple of things, like remembering what was the last processed primary key, and use it in subsequent query like:
SELECT * FROM events WHERE event_id > $last_processed_event_id;
Alternatively you could clear the events table inside some transaction like:
SELECT * FROM unprocessed_events;
DELETE FROM unprocessed_events WHERE event_id IN $PROCESSED_EVENT_IDS;
event_id can be anything that lets you uniquely identify the records, maybe it could be some timestamp or a set of fields.
Another thing to consider is that you would have to manually take care of checkpointing (of the last_processed_even_id offset) if you want to provide any reasonable at-least-once or exactly-once guarantees.

Related

ETL design: What Queue should I use instead of my SQL table and still be able to process in parallel?

Need your help with re-design my system. we have very simple ETL but also very old and now when we handle a massive amount of data it became extremely slow and not flexible
the first process is the collector process:
collector process- always up
collector collect the message from the queue (rabbitMQ)
parse the message properties (JSON format) to java object (for example if the JSON contains field like 'id' and 'name' and 'color' we will create java object with int field 'id' and string field 'name', and string field 'color')
after parsing we write the object to CSV file as CSV row with all the properties in the object
we send ack and continuing to the next message in the queue
processing work-flow - happens every hour once
a process named 'Loader' loads all the CSV files (the collector outputs) to DB table named 'Input' using SQL INFILE LOAD all new rows have 'Not handled' status. the Input table is like a Queue in this design
a process named 'Processor' read from the table all the records with 'Not handled' status, transform it to java object, make some enrichment and then insert the record to another table named 'Output' with new fields, **each iteration we process 1000 rows in parallel - and using JDBC batch update for the DB insert **.
the major problem in this flow:
The message are not flexible in the existing flow - if I want for example to add new property to the JSON message (for example to add also 'city' ) I have to add also column 'city' to the table (because of the CSV file Load), the table contains massive amount of data and its not possible to add column every time the message changes.
My conclusion
The table is not the right choice for this design.
I have to get rid of the CSV writing and remove the 'Input' table to be able to have a flexible system, I thought of maybe using a queue instead of the table like KAFKA and maybe use tools such KAFKA streams for the enrichment. - this will allow me flexible and I won't need to add a column to a table every time I want to add a field to the message
the huge problem that I won't be able to process in parallel like I process today.
What can I use instead of table that will allow me to process the data in parallel?
Yes, using Kafka will improve this.
Ingestion
Your process that currently write CSV-files can instead publish to a Kafka topic. This can possibly be a replacement of RabbitMQ, depending on your requirements and scope.
Loader (optional)
Your other process that load data in the initial format and writes to a database table can instead publish to another Kafka topic in the format you want. This step can be omitted if you can write in the format the processor want directly.
Processor
The way you use 'Not handled' status is a way to treat your data as a queue, but this is handled by design in Kafka that uses a log (were a relational database is modeled as a set).
The processor subscribe to messages written by loader or ingestion. It transform it to java object ,make some enrichment - but instead of inserting the result to a new table, it can publish the data to a new output-topic.
Instead of doing work in batches: "each iteration we process 1000 rows in parallel - and using JDBC batchupdate for the DB insert" with Kafka and stream processing this is done in a continuous real time stream - as data arrives.
Schema evolvability
if i want for example to add new property to the json message (for example to add also 'city' ) i have to add also column 'city' to the table (because of the csv infile Load) , the table contains massive amount of data and its not possible to add column every time the message changes .
You can solve this by using Avro Schema when publishing to a Kafka topic.

Configuring database change notification to get only newly inserted or updated data in Java

I am building an application that does some processing after looking up a database (oracle).
Currently, I have configured the application with Spring Integration and it polls data in a periodic fashion regardless of whether any data is updated or inserted.
The problem here is, I cannot add or use any column to distinguish between old and new records. Also, for no insert or update in table as well, poller polls data from database and feeds the data into message channel.
For that, I want to switch to database change notification and I need to register the query something like
SELECT * FROM EMPLOYEE WHERE STATUS='ACTIVE'
now this active status is true for old and new entries and I want to eliminate the old entries from my list. So that, only after a new insert or an existing update, I want to get data which are added newly or updated recently.
Well, that is really very sad that you can't modify the data model in the database. I'd really suggest to try to insist to change the table for your convenience. For example might really be just one more column LAST_MODIFIED, so could to filter the old records and only poll those which date is very fresh.
There is also possibility in Oracle like trigger, so you can perform some action on INSERT/UPDATE and modify some other table for your interest.
Otherwise you don't have choice unless use one more extra persistence service to track loaded records. For example MetadataStore based on Redis or MongoDB: https://docs.spring.io/spring-integration/docs/4.3.12.RELEASE/reference/html/system-management-chapter.html#metadata-store

When will the data streamed to BigQuery table be available for Query operations?

I have a use case in which I do the following:
Insert some rows into a BigQuery table (t1) which is date partitioned.
Run some queries on t1 to aggregate the data and store them in another table.
In the above use case, I faced an issue today where the queries I run had some discrepancy in the aggregated data. When I executed the same queries some time later from the Web UI of BigQuery, the aggregation were fine. My suspicion is that some of the inserted rows were not available for the query.
I read this documentation for BigQuery data availability. I have the following doubts in this:
The link says that "Streamed data is available for real-time analysis within a few seconds of the first streaming insertion into a table". Is there an upper limit on the number of seconds to wait before it is available for real time analysis?
From the same link: "Data can take up to 90 minutes to become available for copy and export operations". Does the following operations come under this restriction?
Copy the result of a query to another table
Exporting the result of a query to a csv file in cloud storage
Also from the same link- "when streaming to a partitioned table, data in the streaming buffer has a NULL value for the _PARTITIONTIME pseudo column". Does this mean that I should not use _PARTITIONTIME in the queries till data is present in the streamingBuffer?
Can somebody please clarify these?
You can use _PARTITIONTIME is null to detect which rows are in buffer. You can actually use this logic to further UNION this buffer to a date you wish (like today). You could do wire in some logic that reads the buffer and where time is null it will set a time for the rest of the query logic.
This buffer is by design a bit delayed, but if you need immediate access to data you need to use the IS NULL trick to be able to query it.
For the questions:
Does the following operations come under this restriction?
Copy the result of a query to another table
Exporting the result of a query to a csv file in cloud storage
The results of a query are immediately available for any operation (like copy and export) - even if that query had been ran on streamed data still in the buffer.

How to Convert SQL table into Redis Data

Hi I am new to redis and want some help over here. I am using java and sql server 2008 and redis server. To interact with redis I am using jedis api for java. I know that redis is used to store key value based things. Every key has values.
Problem Background:
I have a table names "user" which stores data like id, name, email, age, country. This is schema of sql table. Now this table have some rows(means some data as well). Now here my primary key is id and its just for DB use Its of no use for me in application.
Now in sql I can insert new row, can update a row, can search for any user, can delete a user.
I want to store this tables data into redis. Then I want to perform similar operations on redis as well, like search, insert, delete. But if I have a good design on "Storing this info in DB and Redis" then these operations will be carried out simply. Remember I can have multiple tables as well. So should store data in redis on basis of table.
My Problem
Any design or info you can advise me that how I can convert DB data to Redis and perform all operations. I am asking this because I know Facebook is also using redis to store data. Then how they are storing data.
Any help would be very appreciative.
This is a very hard question to answer as there are multiple ways you could do.
The best way in my opinion would be use hashes. This is basically a nested a nested key-value type. So your key would match to the hash so you can store username, password, etc.
One problem is indexing, you would need to have an ID stored in the key. For example each user would have to have a key like: USER:21414
The second thing unless you want to look at commands like KEYS or SCAN you are going to have to maintain your own list of users to iterate, only if you need to do that. For this you will need to look at lists or sorted sets.
To be honest there is no true answer to this question, SQL style data does not map to key-value's in any real way. You usually have to do a lot more work yourself.
I would suggest reading as much as you can and would start here http://redis.io/commands and here http://redis.io/documentation.
I have no experience using Jedis so I can't help on that side. If you want an example I have an open-source social networking site which uses Redis as it's sole data store. You can take a look at the code to get some ideas https://github.com/pjuu/pjuu/blob/master/pjuu/auth/backend.py. It uses Python but Redis is such an easy thing to use everywhere there will not be that much to difference.
Edit: My site above no longer solely uses Redis. An older branch will need to be checked such as 0.4 or 0.3 :)

Processing large number of data

Question Goes like this.
Form one application I am getting approx 2,00,000 Encrypted values
task
Read all Encrypted values in one Vo /list
Reformat it add header /trailers.
Dump this records to DB in one shot with header and trailer in seperated define coloums
I don't want to use any file in between processes
What would be the best way to store 2,00,000 records list or something
how to dump this record at one shot in db. is better to dived in chunks and use separate thread to work on it.
please suggest some less time consuming solution for this.
I am using spring batch for this and this process will be one job.
Spring batch is made to do this type of operation. You will want a chunk tasklet. This type of tasklet uses a reader, an item processor, and writer. Also, this type of tasklet uses streaming, so you will never have all items in memory at one time.
I'm not sure of the incoming format of your data, but there are existing readers for pretty much any use-case. And if you can't find the type you need, you can create your own. You will then want to implement ItemProcessor to handle any modifications you need to do.
For writing, you can just use JdbcBatchItemWriter.
As for these headers/footers, I would need more details on this. If they are an aggregation of all the records, you will need to process them beforehand. You can put the end results into the ExecutionContext.
There are a couple of generic tricks to make bulk insertion go faster:
Consider using the database's native bulk insert.
Sort the records into ascending order on the primary key before you insert them.
If you are inserting into an empty table, drop the secondary indexes first and then recreate them.
Don't do it all in one database transaction.
I don't know how well these tricks translate to spring-batch ... but if they don't you could consider bypassing spring-batch and going directly to the database.

Categories

Resources