Processing changing source data in Java Akka streams

Processing changing source data in Java Akka streams - java

2 threads are started. dataListUpdateThread adds the number 2 to a List. processFlowThread sums the values in the same List and prints the summed list to the console. Here is the code:
import akka.NotUsed;
import akka.actor.ActorSystem;
import akka.stream.javadsl.Sink;
import akka.stream.javadsl.Source;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.CompletionStage;
import java.util.concurrent.ExecutionException;
import static java.lang.Thread.sleep;
public class SourceExample {
private final static ActorSystem system = ActorSystem.create("SourceExample");
private static void delayOneSecond() {
try {
sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
private static void printValue(CompletableFuture<Integer> integerCompletableFuture) {
try {
System.out.println("Sum is " + integerCompletableFuture.get().intValue());
} catch (ExecutionException | InterruptedException e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
final List dataList = new ArrayList<Integer>();
final Thread dataListUpdateThread = new Thread(() -> {
while (true) {
dataList.add(2);
System.out.println(dataList);
delayOneSecond();
}
});
dataListUpdateThread.start();
final Thread processFlowThread = new Thread(() -> {
while (true) {
final Source<Integer, NotUsed> source = Source.from(dataList);
final Sink<Integer, CompletionStage<Integer>> sink =
Sink.fold(0, (agg, next) -> agg + next);
final CompletionStage<Integer> sum = source.runWith(sink, system);
printValue(sum.toCompletableFuture());
delayOneSecond();
}
});
processFlowThread.start();
}
}
I've tried to create the simplest example to frame the question. dataListUpdateThread could be populating the List from a REST service or Kafka topic instead of just adding the value 2 to the List. Instead of using Java threads how should this scenario be implemented? In other words, how to share dataList to the Akka Stream for processing?

Mutating the collection passed to Source.from is only ever going to accomplish this by coincidence: if the collection is ever exhausted, Source.from will complete the stream. This is because it's intended for finite, strictly evaluated data (the use cases are basically: a) simple examples for the docs and b) situations where you want to bound resource consumption when performing an operation for a collection in the background (think a list of URLs that you want to send HTTP requests to)).
NB: I haven't written Java to any great extent since the Java 7 days, so I'm not providing Java code, just an outline of approaches.
As mentioned in a prior answer Source.queue is probably the best option (besides using something like Akka HTTP or an Alpakka connector). In a case such as this, where the stream's materialized value is a future that won't be completed until the stream completes, that Source.queue will never complete the stream (because there's no way for it to know that its reference is the only reference), introducing a KillSwitch and propagating that through viaMat and toMat would give you the ability to decide outside of the stream to complete the stream.
An alternative to Source.queue, is Source.actorRef, which lets you send a distinguished message (akka.Done.done() in the Java API is pretty common for this). That source materializes as an ActorRef to which you can tell messages, and those messages (at least those which match the type of the stream) will be available for the stream to consume.
With both Source.queue and Source.actorRef, it's often useful to prematerialize them: the alternative in a situation like your example where you also want the materialized value of the sink, is to make heavy use of the Mat operators to customize materialized values (in Scala, it's possible to use tuples to at least simplify combining multiple materialized values, but in Java, once you got beyond a pair (as you would with queue), I'm pretty sure you'd have to define a class just to hold the three (queue, killswitch, future for completed value) materialized values).
It's also worth noting that, since Akka Streams run on actors in the background (and thus get scheduled as needed onto the ActorSystem's threads), there's almost never a reason to create a thread on which to run a stream.

Related

Atomic copy-and-clear on Java collection

I know similar questions are often asked, but I could not find anything that would help me.
The situation is like this:
One worker is adding elements to collection
The second one is waiting for some time (maturity of elements) or for certain size of collection, and start it's job.
The thing is: how to copy (I think it's best to work on copy) the collection for second worker, and then clear original collection to ensure we won't lost anything (the first worker is writing all the time) but not to hold lock on original collection as short as possible?
thanks

This kind of thing will be far easier if you use the purpose-built concurrency tools like LinkedBlockingQueue rather than a plain HashSet. Have the producer add elements to the queue, and the consumer can use drainTo to extract elements from the queue in batches as it requires them. There's no need for any synchronization, as BlockingQueue implementations are designed to be threadsafe.

Ian's LinkedBlockingQueue solution is the simplest.
For higher throughput (potentially trade off with latency) in a single producer single consumer scenario, you may want to consider the example in java.util.concurrent.Exchanger
After swapping, you now have the whole collection yourself.

works for me
import java.util.Collection;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReadWriteLock;
import java.util.concurrent.locks.ReentrantReadWriteLock;
public class MyClass {
private final Map<String, Integer> cachedData = new ConcurrentHashMap<>();
private final ReadWriteLock lock = new ReentrantReadWriteLock();
private final Lock sharedLock = lock.readLock();
private final Lock copyAndFlushLock = lock.writeLock();
public void putData(String key, Integer value) {
try {
sharedLock.lock();
cachedData.put(key, value);
} finally {
sharedLock.unlock();
}
}
public Collection<Integer> copyAndFlush() {
try {
copyAndFlushLock.lock();
Collection<Integer> values = cachedData.values();
cachedData.clear();
return values;
} finally {
copyAndFlushLock.unlock();
}
}
}

Java 8 lambda api

I'm working to migrate from Rx Java to Java 8 lambdas. One example I can't find is a way to buffer requests. For example, in Rx Java, I can say the following.
Observable.create(getIterator()).buffer(20, 1000, TimeUnit. MILLISECONDS).doOnNext(list -> doWrite(list));
Where we buffer 20 elements into a list, or timeout at 1000 milliseconds, which ever happens first.
Observables in RX are a "push" style observable, where as Streams use a java pull. Would this be possible implementing my own map operation in streams, or does the inability to emit cause problems with this since the doOnNext has to poll the previous element?

One way to do it would be to use a BlockingQueue and Guava. Using Queues.drain, you can create a Collection that you could then call stream() on and do your transformations. Here's a link: Guava Queues.drain
And here's a quick example:
public void transform(BlockingQueue<Something> input)
{
List<Something> buffer = new ArrayList<>(20);
Queues.drain(input, buffer, 20, 1000, TimeUnit.MILLISECONDS);
doWrite(buffer);
}

simple-react has similar operators, but not this exact one. It's pretty extensible though, so it should be possible to write your own. With the caveat that I haven't written this in an IDE or tested it, roughly a buffer by size with timeout operator for simple-react would look something like this
import com.aol.simple.react.async.Queue;
import com.aol.simple.react.stream.traits.LazyFutureStream;
import com.aol.simple.react.async.Queue.ClosedQueueException;
import com.aol.simple.react.util.SimpleTimer;
import java.util.concurrent.TimeUnit;
static LazyFutureStream batchBySizeAndTime(LazyFutureStream stream,int size,long time, TimeUnit unit) {
Queue queue = stream.toQueue();
Function<Supplier<U>, Supplier<Collection<U>>> fn = s -> {
return () -> {
SimpleTimer timer = new SimpleTimer();
List<U> list = new ArrayList<>();
try {
do {
if(list.size()==size())
return list;
list.add(s.get());
} while (timer.getElapsedNanoseconds()<unit.toNanos(time));
} catch (ClosedQueueException e) {
throw new ClosedQueueException(list);
}
return list;
};
};
return stream.fromStream(queue.streamBatch(stream.getSubscription(), fn));
}

Java Multithreading large arrays access

My main class, generates multiple threads based on some rules. (20-40 threads live for long time).
Each thread create several threads (short time ) --> I am using executer for this one.
I need to work on Multi dimension arrays in the short time threads --> I wrote it like it is in the code below --> but I think that it is not efficient since I pass it so many times to so many threads / tasks --. I tried to access it directly from the threads (by declaring it as public --> no success) --> will be happy to get comments / advices on how to improve it.
I also look at next step to return a 1 dimension array as a result (which might be better just to update it at the Assetfactory class ) --> and I am not sure how to.
please see the code below.
thanks
Paz
import java.util.concurrent.*;
import java.util.logging.Level;
public class AssetFactory implements Runnable{
private volatile boolean stop = false;
private volatile String feed ;
private double[][][] PeriodRates= new double[10][500][4];
private String TimeStr,Bid,periodicalRateIndicator;
private final BlockingQueue<String> workQueue;
ExecutorService IndicatorPool = Executors.newCachedThreadPool();
public AssetFactory(BlockingQueue<String> workQueue) {
this.workQueue = workQueue;
}
#Override
public void run(){
while (!stop) {
try{
feed = workQueue.take();
periodicalRateIndicator = CheckPeriod(TimeStr, Bid) ;
if (periodicalRateIndicator.length() >0) {
IndicatorPool.submit(new CalcMvg(periodicalRateIndicator,PeriodRates));
}
}
if ("Stop".equals(feed)) {
stop = true ;
}
} // try
catch (InterruptedException ex) {
logger.log(Level.SEVERE, null, ex);
stop = true;
}
} // while
} // run
Here is the CalcMVG class
public class CalcMvg implements Runnable {
private double [][][] PeriodRates = new double[10][500][4];
public CalcMvg(String Periods, double[][][] PeriodRates) {
System.out.println(Periods);
this.PeriodRates = PeriodRates ;
}
#Override
public void run(){
try{
// do some work with the data of PeriodRates array e.g. print it (no changes to array
System.out.println(PeriodRates[1][1][1]);
}
catch (Exception ex){
System.out.println(Thread.currentThread().getName() + ex.getMessage());
logger.log(Level.SEVERE, null, ex);
}
}//run
} // mvg class

There are several things going on here which seem to be wrong, but it is hard to give a good answer with the limited amount of code presented.
First the actual coding issues:
There is no need to define a variable as volatile if only one thread ever accesses it (stop, feed)
You should declare variables that are only used in a local context (run method) locally in that function and not globally for the whole instance (almost all variables). This allows the JIT to do various optimizations.
The InterruptedException should terminate the thread. Because it is thrown as a request to terminate the thread's work.
In your code example the workQueue doesn't seem to do anything but to put the threads to sleep or stop them. Why doesn't it just immediately feed the actual worker-threads with the required workload?
And then the code structure issues:
You use threads to feed threads with work. This is inefficient, as you only have a limited amount of cores that can actually do the work. As the execution order of threads is undefined, it is likely that the IndicatorPool is either mostly idle or overfilling with tasks that have not yet been done.
If you have a finite set of work to be done, the ExecutorCompletionService might be helpful for your task.
I think you will gain the best speed increase by redesigning the code structure. Imagine the following (assuming that I understood your question correctly):
There is a blocking queue of tasks that is fed by some data source (e.g. file-stream, network).
A set of worker-threads equal to the amount of cores is waiting on that data source for input, which is then processed and put into a completion queue.
A specific data set is the "terminator" for your work (e.g. "null"). If a thread encounters this terminator, it finishes it's loop and shuts down.
Now the following holds true for this construct:
Case 1: The data source is the bottle-neck. It cannot be speed-up by using multiple threads, as your harddisk/network won't work faster if you ask more often.
Case 2: The processing power on your machine is the bottle neck, as you cannot process more data than the worker threads/cores on your machine can handle.
In both cases the conclusion is, that the worker threads need to be the ones that seek for new data as soon as they are ready to process it. As either they need to be put on hold or they need to throttle the incoming data. This will ensure maximum throughput.
If all worker threads have terminated, the work is done. This can be i.E. tracked through the use of a CyclicBarrier or Phaser class.
Pseudo-code for the worker threads:
public void run() {
DataType e;
try {
while ((e = dataSource.next()) != null) {
process(e);
}
barrier.await();
} catch (InterruptedException ex) {
}
}
I hope this is helpful on your case.

Passing the array as an argument to the constructor is a reasonable approach, although unless you intend to copy the array it isn't necessary to initialize PeriodRates with a large array. It seems wasteful to allocate a large block of memory and then reassign its only reference straight away in the constructor. I would initialize it like this:
private final double [][][] PeriodRates;
public CalcMvg(String Periods, double[][][] PeriodRates) {
System.out.println(Periods);
this.PeriodRates = PeriodRates;
}
The other option is to define CalcMvg as an inner class of AssetFactory and declare PeriodRate as final. This would allow instances of CalcMvg to access PeriodRate in the outer instance of AssetFactory.
Returning the result is more difficult since it involves publishing the result across threads. One way to do this is to use synchronized methods:
private double[] result = null;
private synchronized void setResult(double[] result) {
this.result = result;
}
public synchronized double[] getResult() {
if (result == null) {
throw new RuntimeException("Result has not been initialized for this instance: " + this);
}
return result;
}
There are more advanced multi-threading concepts available in the Java libraries, e.g. Future, that might be appropriate in this case.
Regarding your concerns about the number of threads, allowing a library class to manage the allocation of work to a thread pool might solve this concern. Something like an Executor might help with this.

Java multithreading and iterators, should be simple, beginner

First I'd like to say that I'm working my way up from python to more complicated code. I'm now on to Java and I'm extremely new. I understand that Java is really good at multithreading which is good because I'm using it to process terabytes of data.
The data input is simply input into an iterator and I have a class that encapsulates a run function that takes one line from the iterator, does some analysis, and then writes the analysis to a file. The only bit of info the threads have to share with each other is the name of the object they are writing to. Simple right? I just want each thread executing the run function simultaneously so we can iterate through the input data quickly. In python it would b e simple.
from multiprocessing import Pool
f = open('someoutput.csv','w');
def run(x):
f.write(analyze(x))
p = Pool(8);
p.map(run,iterator_of_input_data);
So in Java, I have my 10K lines of analysis code and can very easily iterate through my input passing it my run function which in turn calls on all my analysis code sending it to an output object.
public class cool {
...
public static void run(Input input,output) {
Analysis an = new Analysis(input,output);
}
public static void main(String args[]) throws Exception {
Iterator iterator = new Parser(File(input_file)).iterator();
File output = File(output_object);
while(iterator.hasNext(){
cool.run(iterator.next(),output);
}
}
}
All I want to do is get multiple threads taking the iterator objects and executing the run statement. Everything is independent. I keep looking at java multithreading stuff but its for talking over networks, sharing data etc. Is this is simple as I think it is? If someone can just point me in the right direction I would be happy to do the leg work.
thanks

A ExecutorService (ThreadPoolExecutor) would be the Java equivelant.
ExecutorService executorService =
new ThreadPoolExecutor(
maxThreads, // core thread pool size
maxThreads, // maximum thread pool size
1, // time to wait before resizing pool
TimeUnit.MINUTES,
new ArrayBlockingQueue<Runnable>(maxThreads, true),
new ThreadPoolExecutor.CallerRunsPolicy());
ConcurrentLinkedQueue<ResultObject> resultQueue;
while (iterator.hasNext()) {
executorService.execute(new MyJob(iterator.next(), resultQueue))
}
Implement your job as a Runnable.
class MyJob implements Runnable {
/* collect useful parameters in the constructor */
public MyJob(...) {
/* omitted */
}
public void run() {
/* job here, submit result to resultQueue */
}
}
The resultQueue is present to collect the result of your jobs.
See the java api documentation for detailed information.

how to use Thread in java?

i have code use googleseach API
I want to use Thread to improve speed of my program. But i have a problem
here is code
import java.io.BufferedReader;
import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.net.URLEncoder;
import java.util.ArrayList;
import java.util.Enumeration;
import java.util.Hashtable;
import java.util.List;
import org.json.JSONArray;
import org.json.JSONObject;
import com.yahoo.search.WebSearchResult;
/**
* Simple Search using Google ajax Web Services
*
* #author Daniel Jones Copyright 2006 Daniel Jones Licensed under BSD open
* source license http://www.opensource.org/licenses/bsd-license.php
*/
public class GoogleSearchEngine extends Thread {
private String queryString;
private int maxResult;
private ArrayList<String> resultGoogleArrayList = null;
public ArrayList<String> getResultGoogleArrayList() {
return resultGoogleArrayList;
}
public void setResultGoogleArrayList(ArrayList<String> resultGoogleArrayList) {
this.resultGoogleArrayList = resultGoogleArrayList;
}
public String getQueryString() {
return queryString;
}
public void setQueryString(String queryString) {
this.queryString = queryString;
}
public int getMaxResult() {
return maxResult;
}
public void setMaxResult(int maxResult) {
this.maxResult = maxResult;
}
// Put your website here
public final static String HTTP_REFERER = "http://www.example.com/";
public static ArrayList<String> makeQuery(String query, int maxResult) {
ArrayList<String> finalArray = new ArrayList<String>();
ArrayList<String> returnArray = new ArrayList<String>();
try {
query = URLEncoder.encode(query, "UTF-8");
int i = 0;
String line = "";
StringBuilder builder = new StringBuilder();
while (true) {
// Call GoogleAjaxAPI to submit the query
URL url = new URL("http://ajax.googleapis.com/ajax/services/search/web?start=" + i + "&rsz=large&v=1.0&q=" + query);
URLConnection connection = url.openConnection();
if (connection == null) {
break;
}
// Value i to stop while or Max result
if (i >= maxResult) {
break;
}
connection.addRequestProperty("Referer", HTTP_REFERER);
BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream(),"utf-8"));
while ((line = reader.readLine()) != null) {
builder.append(line);
}
String response = builder.toString();
JSONObject json = new JSONObject(response);
JSONArray ja = json.getJSONObject("responseData").getJSONArray("results");
for (int j = 0; j < ja.length(); j++) {
try {
JSONObject k = ja.getJSONObject(j);
// Break string into 2 parts: URL and Title by <br>
returnArray.add(k.getString("url") + "<br>" + k.getString("titleNoFormatting"));
}
catch (Exception e) {
e.printStackTrace();
}
}
i += 8;
}
// Remove objects that is over the max number result required
if (returnArray.size() > maxResult) {
for (int k=0; k<maxResult; k++){
finalArray.add(returnArray.get(k));
}
}
else
return returnArray;
return finalArray;
}
catch (Exception e) {
e.printStackTrace();
}
return null;
}
#Override
public void run() {
// TODO Auto-generated method stub
//super.run();
this.resultGoogleArrayList = GoogleSearchEngine.makeQuery(queryString, maxResult);
System.out.println("Code run here ");
}
public static void main(String[] args)
{
Thread test = new GoogleSearchEngine();
((GoogleSearchEngine) test).setQueryString("data ");
((GoogleSearchEngine) test).setMaxResult(10);
test.start();
ArrayList<String> returnGoogleArrayList = null;
returnGoogleArrayList = ((GoogleSearchEngine) test).getResultGoogleArrayList();
System.out.print("contents of al:" + returnGoogleArrayList);
}
}
when i run it, it can run into run method but it don't excute make query methor and return null array.
when i do't use Thread it can nomal .
Can you give me the reason why ? or give a sulution
Thanks

One of the main problems is that you didn't wait for the asynchronous computation to complete. You can wait by using Thread.join(), but it'll be even better if you use a Future<V>, such as a FutureTask<V> instead.
A Future represents the result of an asynchronous computation. Methods are provided to check if the computation is complete, to wait for its completion, and to retrieve the result of the computation. The result can only be retrieved using method get when the computation has completed, blocking if necessary until it is ready.
API links
Package java.util.concurrent (contains many high level concurrency utilities)
interface Future<V> (represents result of asynchronous computation)
interface RunnableFuture<V> (a Future that is Runnable)
class FutureTask<V> (implementation that wraps a Callable or Runnable object)
interface Executor ("normally used instead of explicitly creating threads")
class Executors (provides factory and utility methods)
Tutorials and lessons
Concurrency
High level concurrency objects
Concurrency utilities language guide
See also
Effective Java 2nd Edition
Item 68: Prefer executors and tasks to threads
Item 69: Prefer concurrency utilities to wait and notify

Your problem is simply that you're not waiting for the thread to perform its job, so you can't get the result. This can be fixed by simply doing
test.join();
before getting the result. Of course, that way the code isn't any faster than if you were doing everything in the main thread. Threads don't make things magically faster. You'll only get a benefit if you do multiple queries in parallel.

I believe you need to wait for the thread to complete before you can read the results.
Use Thread.join() for that purpose.

You don't wait for the thread to finish its calculation before getting the result, therefore you won't get the result.
Doing the same work in a single new thread will not be any faster than doing it in the main thread.
Doing multiple requests in multiple threads may be faster than doing them serially in a single thread.
You should avoid handling threads directly when it's much simpler to use a thread pool (via an ExecutorService implementations as returned by one of the helper methods in Executors) which gives the same benefits and keeps you from doing all the manual synchronizaton and waiting, which is very error prone.

when you call test.start(); the new thread test is started, while the original main thread continues .. you then immediately continue processing on the main thread, calling test.getResultGoogleArrayList() which at that point (immediately) is still null as the thread test is most likely still processing the method makeQuery.
what you are trying to do is not really geared towards multi-threading and you are not likely to see any performance improvements simply by executing something on its own thread.
multi-threading is only useful if you have more than one task that can be processed concurrently, whereas what you are doing fits the linear or synchronous paradigm.

Before you start trying 'to use threads to improve the speed of your program' you should understand exactly what threading is. It is not some magic tool that just makes things faster. It allows you to perform multiple tasks 'simultaneously', depending on your hardware etc. If you have a single core processor it won't be simultaneous but execution can switch from one thread to the other whenever one thread has to wait for something to happen (e.g. user interaction etc.). There are other reasons for threads to switch execution and it depends on a lot of factors, which you don't really need to know, just appreciate that you can't take anything for granted when it comes to threading (don't assume something will happen at a specific time unless you specifically tell it to do so).
In your example above you have 2 threads, the main thread and the GoogleSearchEngine thread. In the main thread you create the second thread and tell it to start. But as I said they can both run simultaneously so execution in the main thread will continue onto the next line to get the results, but the other thread may not even have started, or at least not got round to doing anything worthwhile, which is why you are getting null.
In this case there is absolutely no reason to use multiple threads. Your program does one task then ends, so it might as well do it all in one thread.
The sun java tutorial on concurrency.

It is making call to your makeQuery method but getting connection timeout exception as appended below.
Call test.join(); before printing the content to avoid abrupt ending of program. Handle this connection exception as per your need.
I would also recommend instead of using join() as an alternate you can use timed waiting mechanism like wait(timeout) or CountDounLatch.await(timeout) etc
java.net.ConnectException: Connection timed out: connect
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:529)
....
....
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:570)
at at co.uk.mak.GoogleSearchEngine.makeQuery(GoogleSearchEngine.java:81)
at co.uk.mak.GoogleSearchEngine.run(GoogleSearchEngine.java:124)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Processing changing source data in Java Akka streams - java

Related

Atomic copy-and-clear on Java collection

Java 8 lambda api

Java Multithreading large arrays access

Java multithreading and iterators, should be simple, beginner

how to use Thread in java?

Categories

Resources