RxJava vs Java 8 Parallelism Stream - java

What are all the similarities and diferences between them, It looks like Java Parallel Stream has some of the element available in RXJava, is that right?

Rx is an API for creating and processing observable sequences. The Streams API is for processing iterable sequences. Rx sequences are push-based; you are notified when an element is available. A Stream is pull-based; it "asks" for items to process. They may appear similar because they both support similar operators/transforms, but the mechanics are essentially opposites of each other.

Stream is pull based. Personally I feel it is Oracle's answer to C# IEnumerable<>, LINQ and their related extension methods.
RxJava is push based, which I am not sure whether it is .NET's reactive extensions released first or Rx project goes live first.
Conceptually they are totally different and their applications are also different.
If you are implementing a text searching program on a text file that's so large that you can't load everything and fit into memory, you would probably want to use Stream since you can easily determine if you have next lines available by keeping track of your iterator, and scan line by line.
Another application of Stream would be parallel calculations on a collection of data. Nowadays every machine has multiple cores but you won't know easily exactly how many cores your client machine are available. It would be hard to pre-configure the number of threads to operate. So we use parallel stream and let the JVM to determine that for us (supposed to be more optimal).
On the other hand, if you are implementing a program that takes an user input string and searches for available videos on the web, you would use RX since you won't even know when the program will start getting any results (or receive an error of network timeout). To make your program responsive you have to let the program "subscribe" for network updates and complete signals.
Another common application of Rx is on GUI to "detect user finished input" without requiring the user to click a button to confirm. For example you want to have a text field whenever the user stops typing you start searching without waiting a "Search button" click. In this case you use Rx to create an observable on "KeyEvent" and "throttle" (e.g. at 500ms), so that whenever he stopped typing for 500ms you receive an onNext() to "start searching".

There is also a difference in threading.
Stream#parallel splits the sequence into parts, and each part is processed in the separate thread.
Observable#subscribeOn and Observable#observeOn are both 'move' execution to another thread, but don't split the sequence.
In other words, for any particular processing stage:
parallel Stream may process different elements on different threads
Observable will use one thread for the stage
E. g. we have Observable/Stream of many elements and two processing stages:
Observable.create(...)
.observeOn(Schedulers.io())
.map(x -> stage1(x))
.observeOn(Schedulers.io())
.map(y -> stage2(y))
.forEach(...);
Stream.generate(...)
.parallel()
.map(x -> stage1(x))
.map(y -> stage2(y))
.forEach(...);
Observable will use no more than 2 additional threads (one per stage), so no two x'es or y's are accessed by different threads. Stream, on the countrary, may span each stage across several threads.

Related

Let a queue build up to a certain amount before processing

So let me give you an idea of what I'm trying to do:
I've got a program that records statistics, lots and lots of them, but it records them as they happen one at a time and puts them into an ArrayList, for example:
Please note this is an example, I'm not recording these stats, I'm just simplifying it a bit
User clicks -> Add user_click to array
User clicks -> Add user_click to array
Key press -> Add key_press to array
After each event(clicks, key presses, etc) it checks the size of the ArrayList, if it is > 150 the following happens:
A new thread is created
That thread is given a copy of the ArrayList
The original ArrayList is .clear()'ed
The new thread combines similar items so user_click would now be one item with a quantity of 2, instead of 2 items with a quantity of 1 each
The thread processes the data to a MySQL db
I would love to find a better approach to this, although this works just fine. The issue with threadpools and processing immediately is there would be literally thousands of MySQL queries per day without combining them first..
Is there a better way to accomplish this? Is my method okay?
The other thing to keep in mind is the thread where events are fired and recorded can't be slowed down so I don't really want to combine items in the main thread.
If you've got code examples that would be great, if not just an idea of a good way to do this would be awesome as-well!
For anyone interested, this project is hosted on GitHub, the main thread is here, the queue processor is here and please forgive my poor naming conventions and general code cleanliness, I'm still(always) learning!
The logic described seems pretty good, with two adjustments:
Don't copy the list and clear the original. Send the original and create a new list for future events. This eliminates the O(n) processing time of copying the entries.
Don't create a new thread each time. Events are delayed anyway, since you're collecting them, so timeliness of writing to database is not your major concern. Two choices:
Start a single thread up front, then use a BlockingQueue to send list from thread 1 to thread 2. If thread 2 is falling behind, the lists will simply accumulate in the queue until thread 2 can catch up, without delaying thread 1, and without overloading the system with too many threads.
Submit the job to a thread pool, e.g. using an Executor. This would allow multiple (but limited number of) threads to process the lists, in case processing is slower than event generation. Disadvantage is that events may be written out of order.
For the purpose of separation of concern and reusability, you should encapsulate the logic of collecting events, and sending them to thread in blocks for processing, in a separate class, rather than having that logic embedded in the event-generation code.
That way you can easily add extra features, e.g. a timeout for flushing pending events before reaching normal threshold (150), so events don't sit there too long if event generation slows down.

JVM: is it possible to manipulate frame stack?

Suppose I need to execute N tasks in the same thread. The tasks may sometimes need some values from an external storage. I have no idea in advance which task may need such a value and when. It is much faster to fetch M values in one go rather than the same M values in M queries to the external storage.
Note that I cannot expect cooperation from tasks themselves, they can be concidered as nothing more than java.lang.Runnable objects.
Now, the ideal procedure, as I see it, would look like
Execute all tasks in a loop. If a task requests an external value, remember this, suspend the task and switch to the next one.
Fetch the values requested at the previous step, all at once.
Remove all completed task (suspended ones don't count as completed).
If there are still tasks left, go to step 1, but instead of executing a task, continue its execution from the suspended state.
As far as I see, the only way to "suspend" and "resume" something would be to remove its related frames from JVM stack, store them somewhere, and later push them back onto the stack and let JVM continue.
Is there any standard (not involving hacking at lower level than JVM bytecode) way to do this?
Or can you maybe suggest another possible way to achieve this (other than starting N threads or making tasks cooperate in some way)?
It's possible using something like quasar that does stack-slicing via an agent. Some degree of cooperation from the tasks is helpful, but it is possible to use AOP to insert suspension points from outside.
(IMO it's better to be explicit about what's going on (using e.g. Future and ForkJoinPool). If some plain code runs on one thread for a while and is then "magically" suspended and jumps to another thread, this can be very confusing to debug or reason about. With modern languages and libraries the overhead of being explicit about the asynchronicity boundaries should not be overwhelming. If your tasks are written in terms of generic types then it's fairly easy to pass-through something like scalaz Future. But that wouldn't meet your requirements as given).
As mentioned, Quasar does exactly that (it usually schedules N fibers on M threads, but you can set M to 1), using bytecode transformations. It even gives each task (AKA "fiber") its own stack trace, so you can dump it and get a complete stack trace without any interference from any other task sharing the thread.
Well you could try this
you need
A mechanism to save the current state of the task because when the task returns its frame would be popped from the call stack. Based on the return value or something like that you can determine weather it completed or not since you would need to re-execute it from the point where it left thus u need to preserve the state information.
Create a Request Data structure for each task. When ever a task wants to request something it logs it there , The data structure should support all the possible request a task can make.
Store these DS in a Map. At the end of the loop you can query this DS to determine the kind of resource required by each task.
get the resource put it in the DS . Start the task from the state when it returned.
The task queries the DS gets the resource.
The task should use this DS when ever it wants to use an external resource.
you would need to design the method in which resource is requested with special consideration since when you will re-execute the task again you would need to call this method yourself so that the task can execute from where it left.
*DS -> Data Structure
hope it helps.

Time Based Streaming

I am trying to figure out how to get time-based streaming but on an infinite stream. The reason is pretty simple: Web Service call latency results per unit time.
But, that would mean I would have to terminate the stream (as I currently understand it) and that's not what I want.
In words: If 10 WS calls came in during a 1 minute interval, I want a list/stream of their latency results (in order) passed to stream processing. But obviously, I hope to get more WS calls at which time I would want to invoke the processors again.
I could totally be misunderstanding this. I had thought of using Collectors.groupBy(x -> someTimeGrouping) (so all calls are grouped by whatever measurement interval I chose. But then no code will be aware of this until I call a closing function as which point the monitoring process is done.
Just trying to learn java 8 through application to previous code
By definition and construction a stream can only be consumed once, so if you send your results to an inifinite streams, you will not be able to access them more than once. Based on your description, it looks like it would make more sense to store the latency results in a collection, say an ArrayList, and when you need to analyse the data use the stream functionality to group them.

Understanding NodeJS & Non-Blocking IO

So, I've recently been injected with the Node virus which is spreading in the Programming world very fast.
I am fascinated by it's "Non-Blocking IO" approach and have indeed tried out a couple of programs myself.
However, I fail to understand certain concepts at the moment.
I need answers in layman terms (someone coming from a Java background)
1. Multithreading & Non-Blocking IO.
Let's consider a practical scenario. Say, we have a website where users can register. Below would be the code.
..
..
// Read HTTP Parameters
// Do some Database work
// Do some file work
// Return a confirmation message
..
..
In a traditional programming language, the above happens in a sequential way. And, if there are multiple requests for registration, the web server creates a new thread and the rest is history. Of course, programmers can create threads of their own to work on Line 2 and Line 3 simultaneously.
In Node, as I understand, Lines 2 & 3 will be run in parallel while the rest of the program gets executed and the Interpreter polls the lines 2 & 3 every 'x' ms.
Now, my question is, if Node is a single threaded language, what does the job of lines 2 & 3 while the rest of the program is being executed?
2. Scalability
I recently read that LinkedIn have adapted Node as a back-end for their Mobile Apps and have seen massive improvements.
Can anyone explain how it has made such a difference?
3. Adapting in other programming languages
If people are claiming that Node to be making a lot of difference when it comes to performance, why haven't other programming languages adapted this Non-Blocking IO paradigm?
I'm sure I'm missing something. Only if you can explain me and guide me with some links, would be helpful.
Thanks.
A similar question was asked and probably contains all the info you're looking for: How the single threaded non blocking IO model works in Node.js
But I'll briefly cover your 3 parts:
1.
Lines 2 and 3 in a very simple form could look like:
db.query(..., function(query_data) { ... });
fs.readFile('/path/to/file', function(file_data) { ... });
Now the function(query_data) and function(file_data) are callbacks. The functions db.query and fs.readFile will send the actual I/O requests but the callbacks allow the processing of the data from the database or the file to be delayed until the responses are received. It doesn't really "poll lines 2 and 3". The callbacks are added to an event loop and associated with some file descriptors for their respective I/O events. It then polls the file descriptors to see if they are ready to perform I/O. If they are, it executes the callback functions with the I/O data.
I think the phrase "Everything runs in parallel except your code" sums it up well. For example, something like "Read HTTP parameters" would execute sequentially, but I/O functions like in lines 2 and 3 are associated with callbacks that are added to the event loop and execute later. So basically the whole point is it doesn't have to wait for I/O.
2.
Because of the things explained in 1., Node scales well for I/O intensive requests and allows many users to be connected simultaneously. It is single threaded, so it doesn't necessarily scale well for CPU intensive tasks.
3.
This paradigm has been used with JavaScript because JavaScript has support for callbacks, event loops and closures that make this easy. This isn't necessarily true in other languages.
I might be a little off, but this is the gist of what's happening.
Q1. " what does the job of lines 2 & 3 while the rest of the program is being executed?"
Answer: "Nothing". Lines 2 and 3 each themselves start their respective jobs, but those jobs cannot be done immediately because (for example) the disk sectors required are not loaded in yet - so the operating system issues a call to the disk to go get those sectors, then "Nothing happens" (node goes on with it's next task) until the disk subsystem (later) issues an interrupt to report they're ready, at which point node returns control to lines #2 and #3.
Q2. single-thread non-blocking dedicates almost no resources to each incoming connection (just some housekeeping data about the connected socket). It's very memory efficient. Traditional web servers "fork" a whole new process to handle each new connection - that means making a humongous copy of every bit of code and data variables needed, and time-slicing the CPU to deal with it all. That's massively wasteful of resources. Thus - if your load is a lot of idle connections waiting for stuff, as was theirs, node makes loads more sense.
Q3. almost every programming language does already have non-blocking I/O if you want to use it. Node is not a programming language, it's a web server that runs javascript and uses non-blocking I/O (eg: I personally wrote my own identical thing 10 years ago in perl, as did google (in C) when they started, and I'm sure loads of other people have similar web servers too). The non-blocking I/O is not the hard part - getting the programmer to understand how to use it is the tricky bit. Javascript happens to work well for that, because those programmers are already familiar with event programming.
Even though node.js has been around for a few years, it's performance model is still a bit mysterious.
I recently started a blog and decided that the node.js model would be a good first topic since I wanted to understand it better myself and it would be helpful to others to share what I learned. Here are a couple of articles I wrote that explain the high level concepts and some tradeoffs:
Blocking vs. Non-Blocking I/O – What’s going on?
Understanding node.js Performance

Java simple Analytics/Event Stream Processing with front end

My application takes a lot of measurements of it's internal processes. For example I time certain methods, I time external webservice calls and I also have variables which have a changing value, and processes which have a 'state' (e.g. PAUSED, WAITING etc).
The application uses 100 to 200 threads, and each bit of data would be associated with a particular thread.
I am looking for some software that I can channel all this information into that would produce useful metrics and graphs of the data (ideally in real time or close to real time), let me set thresholds to trigger warnings, would allow me to filter the data by thread or thread group, etc etc.
The application is performing time critical tasks so the software/api would need to be very fast and never block.
The application is written in java, and ideally the software/api would be in java as well. I think what I'm looking for is called Event Stream Processing, but I'm really not sure what language to use to describe it.
All I've found so far are Esper and ERMA. Can anyone give me a recommendation? I'm the only one working on this project so I'm hoping for something that is pretty easy to set up and use, and has a workable front end.
In the end I found Graphite which was pretty close to being exactly what I wanted. Not the simplest to set up and configure however, but I got it working in the end.
http://graphite.wikidot.com/
In my case I send data directly from my application to Statsd (via UDP), which collects the data and does some pre processing before it ends up in the whisper back end, there is a simple example of a java interface here https://github.com/etsy/statsd/commit/2253223f3c19d2149d65ec5bc802198ff93da4cb
Alternatively you could send your data directly to graphite, example here http://neopatel.blogspot.co.uk/2011/04/logging-to-graphite-monitoring-tool.html

Categories

Resources