Spark window function with synthetic timestamp? - java

Say I have a datafile with records where each record has a timestamp, like this:
foo,bar,blaz,timestamp1
foo,flibble,baz,timestamp2
bleh,foo,gnarly,timestamp3
...
and I want to process this using Spark, in a way that requires using the window() function. Is there any way to read these records, and get each one into the DStream so that the timestamp that will be used by the window() function is provided by my code explicitly (based on parsing the timestamp field in the input records in this case)?

No, the default Spark processing is based on the system time. And if you want to build the window using the event time. I suggest you use up "updateStateByKey" function to handle the logic inside the update function.

Related

ETL design: What Queue should I use instead of my SQL table and still be able to process in parallel?

Need your help with re-design my system. we have very simple ETL but also very old and now when we handle a massive amount of data it became extremely slow and not flexible
the first process is the collector process:
collector process- always up
collector collect the message from the queue (rabbitMQ)
parse the message properties (JSON format) to java object (for example if the JSON contains field like 'id' and 'name' and 'color' we will create java object with int field 'id' and string field 'name', and string field 'color')
after parsing we write the object to CSV file as CSV row with all the properties in the object
we send ack and continuing to the next message in the queue
processing work-flow - happens every hour once
a process named 'Loader' loads all the CSV files (the collector outputs) to DB table named 'Input' using SQL INFILE LOAD all new rows have 'Not handled' status. the Input table is like a Queue in this design
a process named 'Processor' read from the table all the records with 'Not handled' status, transform it to java object, make some enrichment and then insert the record to another table named 'Output' with new fields, **each iteration we process 1000 rows in parallel - and using JDBC batch update for the DB insert **.
the major problem in this flow:
The message are not flexible in the existing flow - if I want for example to add new property to the JSON message (for example to add also 'city' ) I have to add also column 'city' to the table (because of the CSV file Load), the table contains massive amount of data and its not possible to add column every time the message changes.
My conclusion
The table is not the right choice for this design.
I have to get rid of the CSV writing and remove the 'Input' table to be able to have a flexible system, I thought of maybe using a queue instead of the table like KAFKA and maybe use tools such KAFKA streams for the enrichment. - this will allow me flexible and I won't need to add a column to a table every time I want to add a field to the message
the huge problem that I won't be able to process in parallel like I process today.
What can I use instead of table that will allow me to process the data in parallel?
Yes, using Kafka will improve this.
Ingestion
Your process that currently write CSV-files can instead publish to a Kafka topic. This can possibly be a replacement of RabbitMQ, depending on your requirements and scope.
Loader (optional)
Your other process that load data in the initial format and writes to a database table can instead publish to another Kafka topic in the format you want. This step can be omitted if you can write in the format the processor want directly.
Processor
The way you use 'Not handled' status is a way to treat your data as a queue, but this is handled by design in Kafka that uses a log (were a relational database is modeled as a set).
The processor subscribe to messages written by loader or ingestion. It transform it to java object ,make some enrichment - but instead of inserting the result to a new table, it can publish the data to a new output-topic.
Instead of doing work in batches: "each iteration we process 1000 rows in parallel - and using JDBC batchupdate for the DB insert" with Kafka and stream processing this is done in a continuous real time stream - as data arrives.
Schema evolvability
if i want for example to add new property to the json message (for example to add also 'city' ) i have to add also column 'city' to the table (because of the csv infile Load) , the table contains massive amount of data and its not possible to add column every time the message changes .
You can solve this by using Avro Schema when publishing to a Kafka topic.

Is there a way to have a dynamic query or execute multiple queries with an apache beam pipeline?

I am using apache beam & google cloud dataflow to insert information into a cloud SQL database. So far this has been working great writing to one table. The information that is being sent is being broadened, including information destined to another table in the database.
I was curious if there was a way to dynamically use an SQL query based on the information I am receiving or am I able to somehow create the pipeline to execute multiple queries? Either would work...
Or, am I stuck with having to create a separate pipeline?
Cheers,
EDIT: Adding my current pipeline config
MainPipeline = Pipeline.create(options);
MainPipeline.apply(PubsubIO.readStrings().fromSubscription(MAIN_SUBSCRIPTION))
.apply(JdbcIO.<String> write()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create("com.mysql.cj.jdbc.Driver", JDBC_URL)
.withUsername(JDBC_USER).withPassword(JDBC_PASS))
.withStatement(QUERY_SQL).withPreparedStatementSetter(new NewPreparedStatementSetter() {
}));
I don't think you can have dynamic queries on JdbcIO based on the input elements, it's configured once at construction time as far as I can see.
However, I can think of couple of potential workarounds if they suit your use case.
One is to just to write your own ParDo in which you would call the JDBC driver manually. This will be basically re-implementing some part of the JdbcIO with new features added. Such ParDo can be as flexible as you like.
Another is to split the input PColleciton into multiple outputs. That will work if your use case is limited to some predefined set of queries that you can choose from based on the input. This way you split the input into multiple PCollections and then attach differently configured IOs to each.

Copy Database table data to text file without looping recordset

Is it possible to get the result set data to single string (for writing to notepad)? Using record set we need to loop through each fields.
Is there any other way we can get to a single string without loop through each fields? I am able to do this in VBA by copying the entire recordset to an excel sheet.
There isn't really something in standard Java that does this for you, except maybe using javax.sql.rowset.WebRowSet and one of its writeXml methods, but that is a very specific and verbose format.
If you want to output a result set in a specific format, you will need to do this yourself, or find a library that does this for you.
Well, in general you have to convert a result set into an any DTO class whatever you want and then implement a toString() method in the way you want to be.
There are some ways to achive this. Here is a one:
Mapping a JDBC ResultSet to an object

Spark Context not Serializable?

So, I am getting the infamous Task Not Serializable error in Spark. Here's the related code block:
val labeledPoints: RDD[LabeledPoint] = events.map(event => {
var eventsPerEntity = try {
HBaseHelper.scan(...filter entity here...)(sc).map(newEvent => {
Try(new Object(...))
}).filter(_.isSuccess).map(_.get)
} catch {
case e: Exception => {
logger.error(s"Failed to convert event ${event}." +
s"Exception: ${e}.")
throw e
}
}
})
Basically what I am trying to achieve is that I am accessing sc which is my Spark Context object in map. And in runtime, I am getting Task Not Serializable error.
Here is a potential solution I could think of:
Query HBase without sc, which I can do, but in turn I will have a list. (If I try to parallelize; I have to use sc again). Having a list will lead me to not being able to use reduceByKey, which is advised here in my other question. So I could not succesfully achieve this one as well, as I don't know how I would achieve this without reduceByKey. Also I would really want to use RDDs :)
So I am looking for another solution + asking whether if I am doing something wrong. Thanks in advance!
Update
So basically, my question have become like this:
I have an RDD named events. This is the whole HBase table. Note: Every event is performed by a performerId which is again a field in event, i.e. event.performerId.
For every event in events, I need to calculate the ratio of event.numericColumn to the average of numericColumnof the events (Subset of events) that are performed by the same performerId.
I was trying to do this when mapping events. Within map I was trying to filter events according to their performerId.
Basically, I am trying to convert every event to a LabeledPoint and the ratio above is going to be one of my features in my Vector. i.e. For every event, I am trying to get
// I am trying to calculate the average, but cannot use filter, because I am in map block.
LabeledPoint(
event.someColumn,
Vectors.dense(
averageAbove,
...
)
)
I would appreciate any help. Thanks!
One option, if applicable, is loading the entire HBase table (or - all the elements that might match one of the events in events RDD, if you have any way of isolating them without going over the RDD) into a Dataframe, and then using join.
To load data from an HBase table into a Dataframe, you can use the preview Spark-HBase Connector from Hortonworks. Then, performing the right join operation between the two dataframes should be easy.
You can add the list as a new field on the event - by that getting a new RDD (event+list of entities). You can then use regular Spark commands to "explode" the list and thus get multiple event+list item records (it is easier to do this with DataFrames/DataSets than with RDDs though)
Its simple you cant use spark context on RDD Closure so find another approach to handle this.

Processing large number of data

Question Goes like this.
Form one application I am getting approx 2,00,000 Encrypted values
task
Read all Encrypted values in one Vo /list
Reformat it add header /trailers.
Dump this records to DB in one shot with header and trailer in seperated define coloums
I don't want to use any file in between processes
What would be the best way to store 2,00,000 records list or something
how to dump this record at one shot in db. is better to dived in chunks and use separate thread to work on it.
please suggest some less time consuming solution for this.
I am using spring batch for this and this process will be one job.
Spring batch is made to do this type of operation. You will want a chunk tasklet. This type of tasklet uses a reader, an item processor, and writer. Also, this type of tasklet uses streaming, so you will never have all items in memory at one time.
I'm not sure of the incoming format of your data, but there are existing readers for pretty much any use-case. And if you can't find the type you need, you can create your own. You will then want to implement ItemProcessor to handle any modifications you need to do.
For writing, you can just use JdbcBatchItemWriter.
As for these headers/footers, I would need more details on this. If they are an aggregation of all the records, you will need to process them beforehand. You can put the end results into the ExecutionContext.
There are a couple of generic tricks to make bulk insertion go faster:
Consider using the database's native bulk insert.
Sort the records into ascending order on the primary key before you insert them.
If you are inserting into an empty table, drop the secondary indexes first and then recreate them.
Don't do it all in one database transaction.
I don't know how well these tricks translate to spring-batch ... but if they don't you could consider bypassing spring-batch and going directly to the database.

Categories

Resources