I am using Apache Beam to read messages from PubSub and write them to BigQuery. What I'm trying to do is write to multiple tables according to the information in the input. To reduce the amount of writes, I am using windowing on the input from PubSub.
A small example:
messages
.apply(new PubsubMessageToTableRow(options))
.get(TRANSFORM_OUT)
.apply(ParDo.of(new CreateKVFromRow())
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(10L))))
// group by key
.apply(GroupByKey.create())
// Are these two rows what I want?
.apply(Values.create())
.apply(Flatten.iterables())
.apply(BigQueryIO.writeTableRows()
.withoutValidation()
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withExtendedErrorInfo()
.withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
.to((SerializableFunction<ValueInSingleWindow<TableRow>, TableDestination>) input -> {
// Simplified for readability
Integer destination = (Integer) input.getValue().get("key");
return new TableDestination(
new TableReference()
.setProjectId(options.getProjectID())
.setDatasetId(options.getDatasetID())
.setTableId(destination + "_Table"),
"Table Destination");
}));
I couldn't find anything in the documentation, but I was wondering how many writes are done to each window? If these are multiple tables, is it one write for each table for all elements in the window? Or is it once for each element, as each table might by different for each element?
Since you're using PubSub as a source your job seems to be a streaming job. Therefore, the default insertion method is STREAMING_INSERTS(see docs). I don't see any benefit or reasons to reduce writes with this method as billig is based on the size of data. Btw, your example is more or less not really effectively reducing writes.
Although it is a streaming job, since a few versions the FILE_LOADS method is also supported. If withMethod is set to FILE_LOADS you can define withTriggeringFrequency on BigQueryIO. This setting defines the frequency in which the load job happens. Here the connector handles all for you and you don't need to group by key or window data. A load job will be started for each table.
Since it seems it is totally fine for you if it takes some time until your data is in BigQuery, I'd suggest to use FILE_LOADS as loading is free opposed to streaming inserts. Just mind the quotas when defining the triggering frequency.
Related
I've a Google dataflow pipeline, build using Apace Beam. The application receives about 50M records everyday, now to ignore duplicate records, we are planning to use the Deduplication Function provided by beam framework.
The document doesn't states the maximum input count for which the Deduplication function would work neither the maximum duration for which it can persist the data.
Would it be good design, to simply throw 50M records onto the deduplication function, out of which around half would be duplicates, and save keep the persistence duration of 7 days?
The deduplication function, as described in the link that you provide, performs a deduplicate per window.
If you have window of 1H, and you duplicate arrive every 3H, the function don't duplicate them, because they are in different windows.
So, you can define window over 1 day, or more. There is no limit. The data are stored on the workers (to save them), and also kept in memory (for efficiency). And more you have data, stronger and bigger must be the server config to manage the quantity of data.
I have a pipeline that takes URLs for files and downloads these generating BigQuery table rows for each line apart from the header.
To avoid duplicate downloads, I want to check URLs against a table of previously downloaded ones and only go ahead and store the URL if it is not already in this "history" table.
For this to work I need to either store the history in a database allowing unique values or it might be easier to use BigQuery for this also, but then access to the table must be strictly serial.
Can I enforce single-thread execution (on a single machine) to satisfy this for part of my pipeline only?
(After this point, each of my 100s of URLs/files would be suitable for processed on a separate thread; each single file gives rise to 10000-10000000 rows, so throttling at that point will almost certainly not give performance issues.)
Beam is designed for parallel processing of data and it tries to explicitly stop you from synchronizing or blocking except using a few built-in primitives, such as Combine.
It sounds like what you want is a filter that emits an element (your URL) only the first time it is seen. You can probably use the built-in Distinct transform for this. This operator uses a Combine per-key to group the elements by key (your URL in this case), then emits each key only the first time it is seen.
I am using apache beam & google cloud dataflow to insert information into a cloud SQL database. So far this has been working great writing to one table. The information that is being sent is being broadened, including information destined to another table in the database.
I was curious if there was a way to dynamically use an SQL query based on the information I am receiving or am I able to somehow create the pipeline to execute multiple queries? Either would work...
Or, am I stuck with having to create a separate pipeline?
Cheers,
EDIT: Adding my current pipeline config
MainPipeline = Pipeline.create(options);
MainPipeline.apply(PubsubIO.readStrings().fromSubscription(MAIN_SUBSCRIPTION))
.apply(JdbcIO.<String> write()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create("com.mysql.cj.jdbc.Driver", JDBC_URL)
.withUsername(JDBC_USER).withPassword(JDBC_PASS))
.withStatement(QUERY_SQL).withPreparedStatementSetter(new NewPreparedStatementSetter() {
}));
I don't think you can have dynamic queries on JdbcIO based on the input elements, it's configured once at construction time as far as I can see.
However, I can think of couple of potential workarounds if they suit your use case.
One is to just to write your own ParDo in which you would call the JDBC driver manually. This will be basically re-implementing some part of the JdbcIO with new features added. Such ParDo can be as flexible as you like.
Another is to split the input PColleciton into multiple outputs. That will work if your use case is limited to some predefined set of queries that you can choose from based on the input. This way you split the input into multiple PCollections and then attach differently configured IOs to each.
I am reading stream of data in my spark application from kafka stream. My requirement is to produce product recommendation for a user when he makes any request (search/browse etc.)
I already have a trained model containing score of users. I am using Java and org.apache.spark.mllib.recommendation.MatrixFactorizationModel model to read the model once at start of my spark application. Whenever there is any browsing event, I call recommendProducts(user_id, num_of_recommended_products) API to produce recommendation for a user from my already existing trained model.
This API is taking ~3-5 seconds for generating result per user which is very slow and hence my stream processing lags behind. Are there any ways in which I can optimise the time of this API? I am considering increasing stream duration from 15 seconds to 1 minute as an optimisation (not sure of its results now)
Calling recommendProducts in real time, doesn't make much sense. Since ALS model can make predictions only for users, which has been seen in the training dataset, it is better to recommendProductsForUser once, store the output in a store which supports first lookups by key and fetch results from there, when needed.
If adding storage layer is not an option, you can also take output of recommendProductsForUser, partition by id, checkpoint and cache predictions, and then join with input stream by id.
I’m struggling with how to design a Spring Batch job. The overall goal is to retrieve ~20 million records and save them to a sql database.
I’m doing it in two parts. First I retrieve the 20 million ids of the records I want to retrieve and save those to a file (or DB). This is a relatively fast operation. Second, I loop through my file of Ids, taking batches of 2,000, and retrieve their related records from an external service. I then repeat this, 2,000 Ids at a time, until I’ve retrieved all of the records. For each batch of 2,000 records I retrieve, I save them to a database.
Some may be asking why I’m doing this in two steps. I eventual plan to make the second step run in parallel so that I can retrieve batches of 2,000 records in parallel and hopefully greatly speed-up the download. Having the Ids allows me to partition the job into batches. For now, let’s not worry about parallelism and just focus on how to design a simpler sequential job.
Imagine I already have solved the first problem of saving all of the Ids locally. They are in a file, one Id per line. How do I design the steps for the second part?
Here’s what I’m thinking…
Read 2,000 Ids using a flat file reader. I’ll need an aggregator since I only want to do one query to my external service for each batch of 2K Ids. This is where i’m struggling. Do I nest a series of readers? Or can I do ‘reading’ in the processor or writer?
Essentially, my problem is that I want to read lines from a file, aggregate those lines, and then immediately do another ‘read’ to retrieve the respective records. I almost want to chain readers together.
Finally, once I’ve retrieved the records from the external service, I’ll have a List of records. Which means when they arrive at the Writer, I’ll have a list of lists. I want a list of objects so that I can use the JdbcItemWriter out of the box.
Thoughts? Hopefully that makes sense.
Andrew
This is a matter of design and is subjective, but based on the Spring Batch example I found (from SpringSource) and my personal experience, the pattern of doing addtional reading in the processor step is a good solution to this problem. You can also chain together multiple processors/readers in the 'processor' step. So, while the names don't exactly match, i find myself doing more and more 'reading' in my processors.
[http://docs.spring.io/spring-batch/trunk/reference/html/patterns.html#drivingQueryBasedItemReaders][1]
Given that you want to call your external service just once per chunk of 2.000 records, you 'll actually want to do this service call in an ItemWriter. That is the standard recommended way to do chunk-level processing.
You can create a custom ItemWriter<Long> implementation. It will receive the list of 2.000 IDs as input, and call the external service. The result from the external service should allow you to create a List<Item>. Your writer can then simply forward this List<Item> to your JdbcItemWriter<Item> delegate.