I am reading stream of data in my spark application from kafka stream. My requirement is to produce product recommendation for a user when he makes any request (search/browse etc.)
I already have a trained model containing score of users. I am using Java and org.apache.spark.mllib.recommendation.MatrixFactorizationModel model to read the model once at start of my spark application. Whenever there is any browsing event, I call recommendProducts(user_id, num_of_recommended_products) API to produce recommendation for a user from my already existing trained model.
This API is taking ~3-5 seconds for generating result per user which is very slow and hence my stream processing lags behind. Are there any ways in which I can optimise the time of this API? I am considering increasing stream duration from 15 seconds to 1 minute as an optimisation (not sure of its results now)
Calling recommendProducts in real time, doesn't make much sense. Since ALS model can make predictions only for users, which has been seen in the training dataset, it is better to recommendProductsForUser once, store the output in a store which supports first lookups by key and fetch results from there, when needed.
If adding storage layer is not an option, you can also take output of recommendProductsForUser, partition by id, checkpoint and cache predictions, and then join with input stream by id.
Related
I've a Google dataflow pipeline, build using Apace Beam. The application receives about 50M records everyday, now to ignore duplicate records, we are planning to use the Deduplication Function provided by beam framework.
The document doesn't states the maximum input count for which the Deduplication function would work neither the maximum duration for which it can persist the data.
Would it be good design, to simply throw 50M records onto the deduplication function, out of which around half would be duplicates, and save keep the persistence duration of 7 days?
The deduplication function, as described in the link that you provide, performs a deduplicate per window.
If you have window of 1H, and you duplicate arrive every 3H, the function don't duplicate them, because they are in different windows.
So, you can define window over 1 day, or more. There is no limit. The data are stored on the workers (to save them), and also kept in memory (for efficiency). And more you have data, stronger and bigger must be the server config to manage the quantity of data.
I am using Apache Beam to read messages from PubSub and write them to BigQuery. What I'm trying to do is write to multiple tables according to the information in the input. To reduce the amount of writes, I am using windowing on the input from PubSub.
A small example:
messages
.apply(new PubsubMessageToTableRow(options))
.get(TRANSFORM_OUT)
.apply(ParDo.of(new CreateKVFromRow())
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(10L))))
// group by key
.apply(GroupByKey.create())
// Are these two rows what I want?
.apply(Values.create())
.apply(Flatten.iterables())
.apply(BigQueryIO.writeTableRows()
.withoutValidation()
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withExtendedErrorInfo()
.withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
.to((SerializableFunction<ValueInSingleWindow<TableRow>, TableDestination>) input -> {
// Simplified for readability
Integer destination = (Integer) input.getValue().get("key");
return new TableDestination(
new TableReference()
.setProjectId(options.getProjectID())
.setDatasetId(options.getDatasetID())
.setTableId(destination + "_Table"),
"Table Destination");
}));
I couldn't find anything in the documentation, but I was wondering how many writes are done to each window? If these are multiple tables, is it one write for each table for all elements in the window? Or is it once for each element, as each table might by different for each element?
Since you're using PubSub as a source your job seems to be a streaming job. Therefore, the default insertion method is STREAMING_INSERTS(see docs). I don't see any benefit or reasons to reduce writes with this method as billig is based on the size of data. Btw, your example is more or less not really effectively reducing writes.
Although it is a streaming job, since a few versions the FILE_LOADS method is also supported. If withMethod is set to FILE_LOADS you can define withTriggeringFrequency on BigQueryIO. This setting defines the frequency in which the load job happens. Here the connector handles all for you and you don't need to group by key or window data. A load job will be started for each table.
Since it seems it is totally fine for you if it takes some time until your data is in BigQuery, I'd suggest to use FILE_LOADS as loading is free opposed to streaming inserts. Just mind the quotas when defining the triggering frequency.
I have around 15 million records in MySQL (read only) which will be fetched using joins of 10 tables. Around 50000 new records are inserted daily. Number will keep on increasing in future.
Each record will be processed independently by a java program. Multiple processing will be done on the same record and output will be calculated based on the processing.
Results will be stored in another database.
Processing shall be completed within an hour
My questions are
How to design the processing engine (cluster of java programs) in a distributed manner making the processing as fast as possible? To be more precise, I want to boot many spot instance at that time and finish the processing.
Will mysql be a read bottleneck?
I don't have any experience in big data solutions. Shall I use spark or any other map reduce solution? If yes, then how shall I proceed?
I was in a similar situation where we were collecting about 15 million records per day. What I did was create some collection tables that I rotated and performed initial processing. Once that was done, I moved the data to the next phase where further processing was done before adding it to the large collection of data. Breaking it down will get the best performance and avoid having to run through a large set of data.
I'm not sure what you mean about processing data and why you want to do it in Java, you may have a good reason for that. I would imagine that performance would be much better if you offload that to MySQL and let it do as much of the processing as possible.
I’m struggling with how to design a Spring Batch job. The overall goal is to retrieve ~20 million records and save them to a sql database.
I’m doing it in two parts. First I retrieve the 20 million ids of the records I want to retrieve and save those to a file (or DB). This is a relatively fast operation. Second, I loop through my file of Ids, taking batches of 2,000, and retrieve their related records from an external service. I then repeat this, 2,000 Ids at a time, until I’ve retrieved all of the records. For each batch of 2,000 records I retrieve, I save them to a database.
Some may be asking why I’m doing this in two steps. I eventual plan to make the second step run in parallel so that I can retrieve batches of 2,000 records in parallel and hopefully greatly speed-up the download. Having the Ids allows me to partition the job into batches. For now, let’s not worry about parallelism and just focus on how to design a simpler sequential job.
Imagine I already have solved the first problem of saving all of the Ids locally. They are in a file, one Id per line. How do I design the steps for the second part?
Here’s what I’m thinking…
Read 2,000 Ids using a flat file reader. I’ll need an aggregator since I only want to do one query to my external service for each batch of 2K Ids. This is where i’m struggling. Do I nest a series of readers? Or can I do ‘reading’ in the processor or writer?
Essentially, my problem is that I want to read lines from a file, aggregate those lines, and then immediately do another ‘read’ to retrieve the respective records. I almost want to chain readers together.
Finally, once I’ve retrieved the records from the external service, I’ll have a List of records. Which means when they arrive at the Writer, I’ll have a list of lists. I want a list of objects so that I can use the JdbcItemWriter out of the box.
Thoughts? Hopefully that makes sense.
Andrew
This is a matter of design and is subjective, but based on the Spring Batch example I found (from SpringSource) and my personal experience, the pattern of doing addtional reading in the processor step is a good solution to this problem. You can also chain together multiple processors/readers in the 'processor' step. So, while the names don't exactly match, i find myself doing more and more 'reading' in my processors.
[http://docs.spring.io/spring-batch/trunk/reference/html/patterns.html#drivingQueryBasedItemReaders][1]
Given that you want to call your external service just once per chunk of 2.000 records, you 'll actually want to do this service call in an ItemWriter. That is the standard recommended way to do chunk-level processing.
You can create a custom ItemWriter<Long> implementation. It will receive the list of 2.000 IDs as input, and call the external service. The result from the external service should allow you to create a List<Item>. Your writer can then simply forward this List<Item> to your JdbcItemWriter<Item> delegate.
We have some part of our application that need to load a large set of data (>2000 entities) and perform computation on this set. The size of each entity is approximately 5 KB.
On our initial, naïve, implementation, the bottleneck seems to be the time required to load all the entities (~40 seconds for 2000 entities), while the time required to perform the computation itself is very small (<1 second).
We had tried several strategies to speed up the entities retrieval:
Splitting the retrieval request into several parallel instances and then merging the result: ~20 seconds for 2000 entities.
Storing the entities at an in-memory cache placed on a resident backend: ~5 seconds for 2000 entities.
The computation needs to be dynamically computed, so doing a precomputation at write time and storing the result does not work in our case.
We are hoping to be able to retrieve ~2000 entities in just under one second. Is this within the capability of GAE/J? Any other strategies that we might be able to implement for this kind of retrieval?
UPDATE: Supplying additional information about our use case and parallelization result:
We have more than 200.000 entities of the same kind in the datastore and the operation is retrieval-only.
We experimented with 10 parallel worker instances, and a typical result that we obtained could be seen in this pastebin. It seems that the serialization and deserialization required when transferring the entities back to the master instance hampers the performance.
UPDATE 2: Giving an example of what we are trying to do:
Let's say that we have a StockDerivative entity that need to be analyzed to know whether it's a good investment or not.
The analysis performed requires complex computations based on many factors both external (e.g. user's preference, market condition) and internal (i.e. from the entity's properties), and would output a single "investment score" value.
The user could request the derivatives to be sorted based on its investment score and ask to be presented with N-number of highest-scored derivatives.
200.000 by 5kb is 1GB. You could keep all this in memory on the largest backend instance or have multiple instances. This would be the fastest solution - nothing beats memory.
Do you need the whole 5kb of each entity for computation?
Do you need all 200k entities when querying before computation? Do queries touch all entities?
Also, check out BigQuery. It might suit your needs.
Use Memcache. I cannot guarantee that it will be sufficient, but if it isn't you probably have to move to another platform.
This is very interesting, but yes, its possible & Iv seen some mind boggling results.
I would have done the same; map-reduce concept
It would be great if you would provide us more metrics on how many parallel instances do you use & what are the results of each instance?
Also, our process includes retrieval alone or retrieval & storing ?
How many elements do you have in your data store? 4000? 10000? Reason is because you could cache it up from the previous request.
regards
In the end, it does not appear that we could retrieve >2000 entities from a single instance in under one second, so we are forced to use in-memory caching placed on our backend instance, as described in the original question. If someone comes up with a better answer, or if we found a better strategy/implementation for this problem, I would change or update the accepted answer.
Our solution involves periodically reading entities in a background task and storing the result in a json blob. That way we can quickly return more than 100k rows. All filtering and sorting is done in javascript using SlickGrid's DataView model.
As someone has already commented, MapReduce is the way to go on GAE. Unfortunately the Java library for MapReduce is broken for me so we're using non optimal task to do all the reading but we're planning to get MapReduce going in the near future (and/or the Pipeline API).
Mind that, last time I checked, the Blobstore wasn't returning gzipped entities > 1MB so at the moment we're loading the content from a compressed entity and expanding it into memory, that way the final payload gets gzipped. I don't like that, it introduces latency, I hope they fix issues with GZIP soon!