Creating objects in parallel using RxJava - java

I have written a Spring Boot micro service using RxJava (aggregated service) to implement the following simplified usecase. The big picture is when an instructor uploads a course content document, set of questions should be generated and saved.
User uploads a document to the system.
The system calls a Document Service to convert the document into a text.
Then it calls another question generating service to generate set of questions given the above text content.
Finally these questions are posted into a basic CRUD micro service to save.
When a user uploads a document, lots of questions are created from it (may be hundreds or so). The problem here is I am posting questions one at a time sequentially for the CRUD service to save them. This slows down the operation drastically due to IO intensive network calls hence it takes around 20 seconds to complete the entire process. Here is the current code assuming all the questions are formulated.
questions.flatMapIterable(list -> list).flatMap(q -> createQuestion(q)).toList();
private Observable<QuestionDTO> createQuestion(QuestionDTO question) {
return Observable.<QuestionDTO> create(sub -> {
QuestionDTO questionCreated = restTemplate.postForEntity(QUESTIONSERVICE_API,
new org.springframework.http.HttpEntity<QuestionDTO>(question), QuestionDTO.class).getBody();
sub.onNext(questionCreated);
sub.onCompleted();
}).doOnNext(s -> log.debug("Question was created successfully."))
.doOnError(e -> log.error("An ERROR occurred while creating a question: " + e.getMessage()));
}
Now my requirement is to post all the questions in parallel to the CRUD service and merge the results on completion. Also note that the CRUD service will accept only one question object at a time and that can not be changed. I know that I can use Observable.zip operator for this purpose, but I have no idea on how to apply it in this context since the actual number of questions is not predetermined. How can I change the code in line 1 so that I can improve the performance of the application. Any help is appreciated.

By default the observalbes in flatMap operate on the same scheduler as you subscribed it on. In order to run your createQuestion observables in parallel, you have to subscribe them on a computation scheduler.
questions.flatMapIterable(list -> list)
.flatMap(q -> createQuestion(q).subscribeOn(Schedulers.computation()))
.toList();
Check this article for a full explanation.

Related

Persisting state into Kafka using Kafka Streams

I am trying to wrap my head around Kafka Streams and having some fundamental questions that I can't seem to figure out on my own. I understand the concept of a KTable and Kafka State Stores but am having trouble deciding how to approach this. I am also using Spring Cloud Streams, which adds another level of complexity on top of this.
My use case:
I have a rule engine that reads in a Kafka event, processes the event, returns a list of rules that matched and writes it into another topic. This is what I have so far:
#Bean
public Function<KStream<String, ProcessNode>, KStream<String, List<IndicatorEvaluation>>> process() {
return input -> input.mapValues(this::analyze).filter((host, evaluation) -> evaluation != null);
}
public List<IndicatorEvaluation> analyze(final String host, final ProcessNode process) {
// Does stuff
}
Some of the stateful rules look like:
[some condition] REPEATS 5 TIMES WITHIN 1 MINUTE
[some condition] FOLLOWEDBY [some condition] WITHIN 1 MINUTE
[rule A exists and rule B exists]
My current implementation is storing all this information in memory to be able to perform the analysis. For obvious reasons, it is not easily scalable. So I figured I would persist this into a Kafka State Store.
I am unsure of the best way to go about it. I know there is a way to create custom state stores that allow for a higher level of flexibility. I'm not sure if the Kafka DSL will support this.
Still new to Kafka Streams and wouldn't mind hearing a variety of suggestions.
From the description you have given, I believe this use case can still be implemented using the DSL in Kafka Streams. The code you have shown above does not track any state. In your topology, you need to add state by tracking the counts of the rules and store them in a state store. Then you only need to send the output rules when that count hits a threshold. Here is the general idea behind this as a pseudo-code. Obviously, you have to tweak this to satisfy the particular specifications of your use case.
#Bean
public Function<KStream<String, ProcessNode>, KStream<String, List<IndicatorEvaluation>>> process() {
return input -> input
.mapValues(this::analyze)
.filter((host, evaluation) -> evaluation != null)
...
.groupByKey(...)
.windowedBy(TimeWindows.of(Duration.ofHours(1)))
.count(Materialized.as("rules"))
.filter((key, value) -> value > 4)
.toStream()
....
}

How to dynamically update an RX Observable?

(Working in RxKotlin and RxJava, but using metacode for simplicity)
Many Reactive Extensions guides begin by creating an Observable from already available data. From The introduction to Reactive Programming you've been missing, it's created from a single string
var soureStream= Rx.Observable.just('https://api.github.com/users');
Similarly, from the frontpage of RxKotlin, from a populated list
val list = listOf(1,2,3,4,5)
list.toObservable()
Now consider a simple filter that yields an outStream,
var outStream = sourceStream.filter({x > 3})
In both guides the source events are declared apriori. Which means the timeline of events has some form
source: ----1,2,3,4,5-------
out: --------------4,5---
How can I modify sourceStream to become more of a pipeline? In other words, no input data is available during sourceStream creation? When a source event becomes available, it is immediately processed by out:
source: ---1--2--3-4---5-------
out: ------------4---5-------
I expected to find an Observable.add() for dynamic updates
var sourceStream = Observable.empty()
var outStream = sourceStream.filter({x>3})
//print each element as its added
sourceStream .subscribe({println(it)})
outStream.subscribe({println(it)})
for i in range(5):
sourceStream.add(i)
Is this possible?
I'm new, but how could I solve my problem without a subject? If I'm
testing an application, and I want it to "pop" an update every 5
seconds, how else can I do it other than this Publish subscribe
business? Can someone post an answer to this question that doesn't
involve a Subscriber?
If you want to pop an update every five seconds, then create an Observable with the interval operator, don't use a Subject. There are some dozen different operators for constructing Observables so you rarely need a subject.
That said, sometimes you do need one, and they come in very handy when testing code. I use them extensively in unit tests.
To Use Subject Or Not To Use Subject? is and excellent article on the subject of Subjects.

crawler4j asynchronously saving results to file

I'm evaluating crawler4j for ~1M crawls per day
My scenario is this: I'm fetching the URL and parsing its description, keywords and title, now I would like to save each URL and its words into a single file
I've seen how it's possible to save crawled data to files. However, since I have many crawls to perform I want different threads performing the save file operation on the file system (in order to not block the fetcher thread). Is that possible to do with crawler4j? If so, how?
Thanks
Consider using a Queue (BlockingQueue or similar) where you put the data to be written and which are then processed by one/more worker Threads (this approach is nothing crawler4j-specific). Search for "producer consumer" to get some general ideas.
Concerning your follow-up question on how to pass the Queue to the crawler instances, this should do the trick (this is only from looking at the source code, haven't used crawler4j on my own):
final BlockingQueue<Data> queue = …
// use a factory, instead of supplying the crawler type to pass the queue
controller.start(new WebCrawlerFactory<MyCrawler>() {
#Override
public MyCrawler newInstance() throws Exception {
return new MyCrawler(queue);
}
}, numberOfCrawlers);

Passing outputs between spring batch steps [duplicate]

This question already has answers here:
How can we share data between the different steps of a Job in Spring Batch?
(12 answers)
Closed 3 years ago.
I have two business logic steps:
download xml from external resource parse and transform it into objects
dispatch the output(object list) to external queue
#Bean
public Job job() throws Exception {
return this.jobs.get("job").start(getXmlViaHttpStep()).next(pushMessageToQueue()).build();
}
So my first Step is Tasklet which downloads (via http) the file and converts it into Objects.
My second task is another Tasklet that suppose to dispatch the output from the previous step.
Now how do I pass the output list from step1 into step2 (as its input)?
I could save that on temp file, but isn't there another best practice scenario for this?
I can see at least two options that are both viable.
Option 1: setup the job as one step
You can setup your job to contain one step where the reader simply reads the input from your URL and the writer posts to your queue.
Option 2: setup the job as two steps with intermediate storage
However, you may want to divide the job in two steps to be able to re-run a step if it fails and simplify debugging etc. In that cas, the following approach may work out for you:
Step 1: Create a step with a FlatFileItemReader or similar is used to download the file. The step can then configure a FlatFileItemWriter to move the contents to disk.
Step 2: Open the file produced by the ItemWriter from the previous step. One alternative is to use the org.springframework.batch.item.xml.StaxEventItemReader together with a Jaxb2Marshaller to handle the processing (as described in this blog). Configure the output step to post messages to a queue by using e.g. org.springframework.batch.item.jms.JmsItemWriter. The writer is (as always) chunked so multiple messages can be posted at for each write.
Personally, I would probably setup the whole thing as Option 2. I find simple steps without too much transformations are easier to follow and also easier to test but that is just a matter of taste.

Sequential execution of async operations in Android

Sequential execution of asynchronous operations in Android is at least complicated.
Sequential execution that used to be a semi-colon between two operators like in do_this(); do_that() now requires chaining listeners, which is ugly and barely readable.
Oddly enough, the examples that demonstrate the need for chaining sequential operations usually look contrived, but today I found a perfectly reasonable one.
In Android there is in-app billing, an application can support multiple so-called in-app products (also known as SKU = stock keeping unit), letting the user, for example, buy (pay for) only the functionality that he/she needs (and, alas, also letting bearded men sell bitmaps to teenagers).
The function that retrieves in-app product info is
public void queryInventoryAsync(final boolean querySkuDetails,
final List<String> moreSkus,
final QueryInventoryFinishedListener listener)
and it has a restriction that the list must contain at most 20 items. (Yes it does.)
Even if only a few of these 20 are registered as in-app products.
I want to retrieve, say, information about one hundred in-app products. The first thought would be to invoke this function in a loop, but only one asynchronous operation with the market is allowed at any moment.
One may of course say "do not reuse, change the source", and even provide very good arguments for that, and this is probably what I will finally do, but I write this because I want see an elegant reuse solution.
Is there an elegant (=not cumbersome) pattern or trick that allows to chain several asynchronous operations in the general case?
(I underline that the asynchronous operation that uses a listener is pre-existing code.)
UPD this is what is called "callback hell" ( http://callbackhell.com/ ) in the JavaScript world.
You can sequence AsyncTasks one after the other by calling the execute() method of the next AsyncTask in the onPostExecute() method of the previous one.
Handlers are useful for sequential work on any thread, not only on the UI thread.
Check out HandlerThread, create a Handler based on its Looper, and post background work to the handler.
It looks like ReactiveX promises exactly this.
http://blog.danlew.net/2014/09/22/grokking-rxjava-part-2/
query("Hello, world!") // Returns a List of website URLs based on a text search
.flatMap(urls -> Observable.from(urls))
.flatMap(url -> getTitle(url)) // long operation
.filter(title -> title != null)
.subscribe(title -> System.out.println(title));
ReactiveX for Android:
https://github.com/ReactiveX/RxAndroid
Retrolambda: https://github.com/orfjackal/retrolambda (Lambdas for Java 5,6,7)

Categories

Resources