Concurrency on Vertx

Concurrency on Vertx - java

i have joined to one of those Vertx lovers , how ever the single threaded main frame may not be working for me , because in my server there might be 50 file download requests at a moment , as a work around i have created this class
public abstract T onRun() throws Exception;
public abstract void onSuccess(T result);
public abstract void onException();
private static final int poolSize = Runtime.getRuntime().availableProcessors();
private static final long maxExecuteTime = 120000;
private static WorkerExecutor mExecutor;
private static final String BG_THREAD_TAG = "BG_THREAD";
protected RoutingContext ctx;
private boolean isThreadInBackground(){
return Thread.currentThread().getName() != null && Thread.currentThread().getName().equals(BG_THREAD_TAG);
}
//on success will not be called if exception be thrown
public BackgroundExecutor(RoutingContext ctx){
this.ctx = ctx;
if(mExecutor == null){
mExecutor = MyVertxServer.vertx.createSharedWorkerExecutor("my-worker-pool",poolSize,maxExecuteTime);
}
if(!isThreadInBackground()){
/** we are unlocking the lock before res.succeeded , because it might take long and keeps any thread waiting */
mExecutor.executeBlocking(future -> {
try{
Thread.currentThread().setName(BG_THREAD_TAG);
T result = onRun();
future.complete(result);
}catch (Exception e) {
GUI.display(e);
e.printStackTrace();
onException();
future.fail(e);
}
/** false here means they should not be parallel , and will run without order multiple times on same context*/
},false, res -> {
if(res.succeeded()){
onSuccess((T)res.result());
}
});
}else{
GUI.display("AVOIDED DUPLICATE BACKGROUND THREADING");
System.out.println("AVOIDED DUPLICATE BACKGROUND THREADING");
try{
T result = onRun();
onSuccess((T)result);
}catch (Exception e) {
GUI.display(e);
e.printStackTrace();
onException();
}
}
}
allowing the handlers to extend it and use it like this
public abstract class DefaultFileHandler implements MyHttpHandler{
public abstract File getFile(String suffix);
#Override
public void Handle(RoutingContext ctx, VertxUtils utils, String suffix) {
new BackgroundExecutor<Void>(ctx) {
#Override
public Void onRun() throws Exception {
File file = getFile(URLDecoder.decode(suffix, "UTF-8"));
if(file == null || !file.exists()){
utils.sendResponseAndEnd(ctx.response(),404);
return null;
}else{
utils.sendFile(ctx, file);
}
return null;
}
#Override
public void onSuccess(Void result) {}
#Override
public void onException() {
utils.sendResponseAndEnd(ctx.response(),404);
}
};
}
and here is how i initialize my vertx server
vertx.deployVerticle(MainDeployment.class.getCanonicalName(),res -> {
if (res.succeeded()) {
GUI.display("Deployed");
} else {
res.cause().printStackTrace();
}
});
server.requestHandler(router::accept).listen(port);
and here is my MainDeployment class
public class MainDeployment extends AbstractVerticle{
#Override
public void start() throws Exception {
// Different ways of deploying verticles
// Deploy a verticle and don't wait for it to start
for(Entry<String, MyHttpHandler> entry : MyVertxServer.map.entrySet()){
MyVertxServer.router.route(entry.getKey()).handler(new Handler<RoutingContext>() {
#Override
public void handle(RoutingContext ctx) {
String[] handlerID = ctx.request().uri().split(ctx.currentRoute().getPath());
String suffix = handlerID.length > 1 ? handlerID[1] : null;
entry.getValue().Handle(ctx, new VertxUtils(), suffix);
}
});
}
}
}
this is working just fine when and where i need it , but i still wonder if is there any better way to handle concurencies like this on vertx , if so an example would be really appreciated . thanks alot

I don't fully understand your problem and reasons for your solution. Why don't you implement one verticle to handle your http uploads and deploy it multiple times? I think that handling 50 concurrent uploads should be a piece of cake for vert.x.
When deploying a verticle using a verticle name, you can specify the number of verticle instances that you want to deploy:
DeploymentOptions options = new DeploymentOptions().setInstances(16);
vertx.deployVerticle("com.mycompany.MyOrderProcessorVerticle", options);
This is useful for scaling easily across multiple cores. For example you might have a web-server verticle to deploy and multiple cores on your machine, so you want to deploy multiple instances to take utilise all the cores.
http://vertx.io/docs/vertx-core/java/#_specifying_number_of_verticle_instances

vertx is a well-designed model so that a concurrency issue does not occur.
generally, vertx does not recommend the multi-thread model.
(because, handling is not easy.)
If you select multi-thread model, you have to think about shared data..
Simply, if you just only want to split EventLoop Area,
first of all, you make sure Check your a number of CPU Cores.
and then Set up the count of Instances .
DeploymentOptions options = new DeploymentOptions().setInstances(4);
vertx.deployVerticle("com.mycompany.MyOrderProcessorVerticle", options);
But, If you have 4cores of CPU, you don't set up over 4 instances.
If you set up to number four or more, the performance won't improve.
vertx concurrency reference
http://vertx.io/docs/vertx-core/java/

Related

Correct way of sharing singleton clients across verticles in vetx

I have a vertx application where I deploy multiple instances of verticle A (HttpVerticle.java) and multiple instances of verticle B (AerospikeVerticle.java). The aerospike verticles need to share a single AerospikeClient. The HttpVerticle listens to port 8888 and calls AerospikeVerticle using the event bus. My questions are:
Is using sharedData the right way to share singleton client instances? Is there any other recommended / cleaner approach? I plan to create and share more such singleton objects (cosmos db clients, meterRegistry etc.) in the application. I plan to use sharedData.localMap to share them in a similar fashion.
Is it possible to use vertx's eventloop as the backing eventloop for aerospike client? Such that the aerospike client initialisation does not need to create its own new eventloop? Currently looks like the onRecord part of the aerospike get call runs on aerospike's eventloop.
public class SharedAerospikeClient implements Shareable {
public final EventLoops aerospikeEventLoops;
public final AerospikeClient client;
public SharedAerospikeClient() {
EventPolicy eventPolicy = new EventPolicy();
aerospikeEventLoops = new NioEventLoops(eventPolicy, 2 * Runtime.getRuntime().availableProcessors());
ClientPolicy clientPolicy = new ClientPolicy();
clientPolicy.eventLoops = aerospikeEventLoops;
client = new AerospikeClient(clientPolicy, "localhost", 3000);
}
}
Main.java
public class Main {
public static void main(String[] args) {
Vertx vertx = Vertx.vertx();
LocalMap localMap = vertx.sharedData().getLocalMap("SHARED_OBJECTS");
localMap.put("AEROSPIKE_CLIENT", new SharedAerospikeClient());
vertx.deployVerticle("com.demo.HttpVerticle", new DeploymentOptions().setInstances(2 * 4));
vertx.deployVerticle("com.demo.AerospikeVerticle", new DeploymentOptions().setInstances(2 * 4));
}
}
HttpVerticle.java
public class HttpVerticle extends AbstractVerticle {
#Override
public void start(Promise<Void> startPromise) throws Exception {
vertx.createHttpServer().requestHandler(req -> {
vertx.eventBus().request("read.aerospike", req.getParam("id"), ar -> {
req.response()
.putHeader("content-type", "text/plain")
.end(ar.result().body().toString());
System.out.println(Thread.currentThread().getName());
});
}).listen(8888, http -> {
if (http.succeeded()) {
startPromise.complete();
System.out.println("HTTP server started on port 8888");
} else {
startPromise.fail(http.cause());
}
});
}
}
AerospikeVerticle.java
public class AerospikeVerticle extends AbstractVerticle {
private SharedAerospikeClient sharedAerospikeClient;
#Override
public void start(Promise<Void> startPromise) throws Exception {
EventBus eventBus = vertx.eventBus();
sharedAerospikeClient = (SharedAerospikeClient) vertx.sharedData().getLocalMap("SHARED_OBJECTS").get("AEROSPIKE_CLIENT");
MessageConsumer<String> consumer = eventBus.consumer("read.aerospike");
consumer.handler(this::getRecord);
System.out.println("Started aerospike verticle");
startPromise.complete();
}
public void getRecord(Message<String> message) {
sharedAerospikeClient.client.get(
sharedAerospikeClient.aerospikeEventLoops.next(),
new RecordListener() {
#Override
public void onSuccess(Key key, Record record) {
if (record != null) {
String result = record.getString("value");
message.reply(result);
} else {
message.reply("not-found");
}
}
#Override
public void onFailure(AerospikeException exception) {
message.reply("error");
}
},
sharedAerospikeClient.client.queryPolicyDefault,
new Key("myNamespace", "mySet", message.body())
);
}
}

I don't know about the Aerospike Client.
Regarding sharing objects between verticles, indeed shared data maps are designed for this purpose.
However, it is easier to:
create the shared client in your main class or custom launcher
provide the client as a parameter of the verticle constructor
The Vertx interface has a deployVerticle(Supplier<Verticle>, DeploymentOptions) method which is convenient in this case:
MySharedClient client = initSharedClient();
vertx.deploy(() -> new SomeVerticle(client), deploymentOptions);

the use of threads and the java Future interface in AWS Lambda

I want to create an AWS Lambda function in java that writes to a database in Firestore. The short story is that, while the code does what it should when I execute it
on my own computer, using NetBeans (the truth is that it works most of the time, but not always, maybe due to problems with my internet connection), nothing at all
happens when I deploy it as a Lambda function and invoke this. I suspect that this has less to do with Firestore itself, but rather with how AWS Lambda handles asynchronous
operations.
Now to the details!
As a simple example, the method that writes to the Firestore object db reads
public static void writeFirestore(Firestore db){
try{
DateTime now = DateTime.now();
String time = now.toString();
Map<String, String> data = new HashMap<>();
data.put("time", time);
String collTitle = "Notebook";
String docTitle = "Document: "+time;
db.collection(collTitle).document(docTitle).set(data);
System.out.println("wrote to Firestore");
}
catch(Exception e){
System.out.println("Could not write to db: "+e.toString());
}
}
Now, as it takes some time to connect to Firestore and initialize db, I want to make sure that db is not passed as an argument into writeFirestore() before it
has been properly retrieved. So, I define a version of db in the form of a Future object, using ExecutorService, and then retrieve
the object db with the get()-method. For this, I define the class TaskRunner:
public class TaskRunner {
ExecutorService executor;
public TaskRunner(){
executor = Executors.newSingleThreadExecutor();
}
public static interface Callback<T>{
public void onCallback(T result);
}
public <T> void executeAsync(Callable<T> callable, Callback<T> callback) throws Exception{
try{
Future future = executor.submit(callable);
Object result = future.get();
if(result != null){
System.out.println("result is not null; applying callback...");
callback.onCallback((T) result);
}
else{
System.out.println("result is null");
}
}
catch(Exception e){
System.out.println("Problem running executeAsync: "+e.toString());
}
}
}
Writing the example document to my fixed database db now goes as follows:
I define the class FirestoreCreator that implements Callable with the purpose of retrieving the Firestore object db:
public static class FirestoreCreator implements Callable<Firestore>{
#Override
public Firestore call() throws Exception {
String projectId = "myProjectId";
GoogleCredentials credentials =
GoogleCredentials.fromStream(new FileInputStream("myCredentialsFile.json"));
FirestoreOptions firestoreOptions = FirestoreOptions.getDefaultInstance()
.toBuilder()
.setProjectId(projectId)
.setCredentials(credentials)
.build();
Firestore db = firestoreOptions.getService();
return db;
}
}
I implement the TaskRunner.Callback interface using writeFirestore().
I create a TaskRunner object, taskRunner, and call its executeAsync() method with the above two objects as parameters.
These three steps are collected in the final method testUpdateFirestore() that does the job:
public static void testUpdateFirestoreInterface(){
FirestoreCreator fsCreator = new FirestoreCreator();
TaskRunner.Callback<Firestore> updateCallback = new TaskRunner.Callback<Firestore>() {
#Override
public void onCallback(Firestore result) {
writeFirestore(result);
}
};
TaskRunner taskRunner = new TaskRunner();
try {
taskRunner.executeAsync(fsCreator, updateCallback);
} catch (Exception ex) {
System.out.println("Failed to run executeAsync");
}
}
As I already mentioned in the introduction, the code works (most times) when I run it on my computer, but not at all in AWS Lambda. No exception is thrown, and yet no document has been written in Firestore.
The discussion about threads in AWS Lambda (https://dzone.com/articles/multi-threaded-programming-with-aws-lambda) made me suspect that reason is that the use of some thread that runs when ExecutorService is used is not being handled properly.
Does anyone know what goes wrong and what a solution could look like?

RxNetty reuse the connection

I want to use Netflix-Ribbon as TCP client load balancer without Spring Cloud,and i write test code.
public class App implements Runnable
{
public static String msg = "hello world";
public BaseLoadBalancer lb;
public RxClient<ByteBuf, ByteBuf > client;
public Server echo;
App(){
lb = new BaseLoadBalancer();
echo = new Server("localhost", 8000);
lb.setServersList(Lists.newArrayList(echo));
DefaultClientConfigImpl impl = DefaultClientConfigImpl.getClientConfigWithDefaultValues();
client = RibbonTransport.newTcpClient(lb, impl);
}
public static void main( String[] args ) throws Exception
{
for( int i = 40; i > 0; i--)
{
Thread t = new Thread(new App());
t.start();
t.join();
}
System.out.println("Main thread is finished");
}
public String sendAndRecvByRibbon(final String data)
{
String response = "";
try {
response = client.connect().flatMap(new Func1<ObservableConnection<ByteBuf, ByteBuf>,
Observable<ByteBuf>>() {
public Observable<ByteBuf> call(ObservableConnection<ByteBuf, ByteBuf> connection) {
connection.writeStringAndFlush(data);
return connection.getInput();
}
}).timeout(1, TimeUnit.SECONDS).retry(1).take(1)
.map(new Func1<ByteBuf, String>() {
public String call(ByteBuf ByteBuf) {
return ByteBuf.toString(Charset.defaultCharset());
}
})
.toBlocking()
.first();
}
catch (Exception e) {
System.out.println(((LoadBalancingRxClientWithPoolOptions) client).getMaxConcurrentRequests());
System.out.println(lb.getLoadBalancerStats());
}
return response;
}
public void run() {
for (int i = 0; i < 200; i++) {
sendAndRecvByRibbon(msg);
}
}
}
i find it will create a new socket everytime i callsendAndRecvByRibbon even though the poolEnabled is setting to true. So,it confuse me,i miss something?
and there are no option to configure the size of the pool,but hava a PoolMaxThreads and MaxConnectionsPerHost.
My question is how to use a connection pool in my simple code, and what's wrong with my sendAndRecvByRibbon,it open a socket then use it only once,how can i reuse the connection?thanks for your time.
the server is just a simple echo server writing in pyhton3,i comment outconn.close() because i want to use long connection.
import socket
import threading
import time
import socketserver
class ThreadedTCPRequestHandler(socketserver.BaseRequestHandler):
def handle(self):
conn = self.request
while True:
client_data = conn.recv(1024)
if not client_data:
time.sleep(5)
conn.sendall(client_data)
# conn.close()
class ThreadedTCPServer(socketserver.ThreadingMixIn, socketserver.TCPServer):
pass
if __name__ == "__main__":
HOST, PORT = "localhost", 8000
server = ThreadedTCPServer((HOST, PORT), ThreadedTCPRequestHandler)
ip, port = server.server_address
server_thread = threading.Thread(target=server.serve_forever)
server_thread.daemon = True
server_thread.start()
server.serve_forever()
and the pom of mevan,i just add two dependency in IED's auto generated POM.
<dependency>
<groupId>commons-configuration</groupId>
<artifactId>commons-configuration</artifactId>
<version>1.6</version>
</dependency>
<dependency>
<groupId>com.netflix.ribbon</groupId>
<artifactId>ribbon</artifactId>
<version>2.2.2</version>
</dependency>
the code for printing src_port
#Sharable
public class InHandle extends ChannelInboundHandlerAdapter {
public void channelRead(ChannelHandlerContext ctx, Object msg) throws Exception {
System.out.println(ctx.channel().localAddress());
super.channelRead(ctx, msg);
}
}
public class Pipeline implements PipelineConfigurator<ByteBuf, ByteBuf> {
public InHandle handler;
Pipeline() {
handler = new InHandle();
}
public void configureNewPipeline(ChannelPipeline pipeline) {
pipeline.addFirst(handler);
}
}
and change the client = RibbonTransport.newTcpClient(lb, impl);to Pipeline pipe = new Pipeline();client = RibbonTransport.newTcpClient(lb, pipe, impl, new DefaultLoadBalancerRetryHandler(impl));

So, your App() constructor does the initialization of lb/client/etc.
Then you're starting 40 different threads with 40 different RxClient instances (each instance has own pool by default) by calling new App() in the first for loop. To make things clear - the way you spawn multiple RxClient instances here does not allow them to share any common pool. Try to use one RxClient instance instead.
What if you change your main method like below, does it stop creating extra sockets?
public static void main( String[] args ) throws Exception
{
App app = new App() // Create things just once
for( int i = 40; i > 0; i--)
{
Thread t = new Thread(()->app.run()); // pass the run()
t.start();
t.join();
}
System.out.println("Main thread is finished");
}
If above does not help fully (at least it will reduce created sockets count in 40 times) - can you please clarify how exactly do you determine that:
i find it will create a new socket everytime i call sendAndRecvByRibbon
and what are your measurements after you update constructor with this line:
DefaultClientConfigImpl impl = DefaultClientConfigImpl.getClientConfigWithDefaultValues();
impl.set(CommonClientConfigKey.PoolMaxThreads,1); //Add this one and test
Update
Yes, looking at the sendAndRecvByRibbon it seems that it lacks marking the PooledConnection as no longer acquired by calling close once you don't expect any further reads from it.
As long as you expect the only single read event, just change this line
connection.getInput()
to the
return connection.getInput().zipWith(Observable.just(connection), new Func2<ByteBuf, ObservableConnection<ByteBuf, ByteBuf>, ByteBuf>() {
#Override
public ByteBuf call(ByteBuf byteBuf, ObservableConnection<ByteBuf, ByteBuf> conn) {
conn.close();
return byteBuf;
}
});
Note, that if you'd design more complex protocol over TCP, then input bytebuf can be analyzed for your specific 'end of communication' sign which indicates the connection can be returned to the pool.

Java threads - waiting on all child threads in order to proceed

So a little background;
I am working on a project in which a servlet is going to release crawlers upon a lot of text files within a file system. I was thinking of dividing the load under multiple threads, for example:
a crawler enters a directory, finds 3 files and 6 directories. it will start processing the files and start a thread with a new crawler for the other directories. So from my creator class I would create a single crawler upon a base directory. The crawler would assess the workload and if deemed needed it would spawn another crawler under another thread.
My crawler class looks like this
package com.fujitsu.spider;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.Serializable;
import java.util.ArrayList;
public class DocumentSpider implements Runnable, Serializable {
private static final long serialVersionUID = 8401649393078703808L;
private Spidermode currentMode = null;
private String URL = null;
private String[] terms = null;
private float score = 0;
private ArrayList<SpiderDataPair> resultList = null;
public enum Spidermode {
FILE, DIRECTORY
}
public DocumentSpider(String resourceURL, Spidermode mode, ArrayList<SpiderDataPair> resultList) {
currentMode = mode;
setURL(resourceURL);
this.setResultList(resultList);
}
#Override
public void run() {
try {
if (currentMode == Spidermode.FILE) {
doCrawlFile();
} else {
doCrawlDirectory();
}
} catch (Exception e) {
e.printStackTrace();
}
System.out.println("SPIDER # " + URL + " HAS FINISHED.");
}
public Spidermode getCurrentMode() {
return currentMode;
}
public void setCurrentMode(Spidermode currentMode) {
this.currentMode = currentMode;
}
public String getURL() {
return URL;
}
public void setURL(String uRL) {
URL = uRL;
}
public void doCrawlFile() throws Exception {
File target = new File(URL);
if (target.isDirectory()) {
throw new Exception(
"This URL points to a directory while the spider is in FILE mode. Please change this spider to FILE mode.");
}
procesFile(target);
}
public void doCrawlDirectory() throws Exception {
File baseDir = new File(URL);
if (!baseDir.isDirectory()) {
throw new Exception(
"This URL points to a FILE while the spider is in DIRECTORY mode. Please change this spider to DIRECTORY mode.");
}
File[] directoryContent = baseDir.listFiles();
for (File f : directoryContent) {
if (f.isDirectory()) {
DocumentSpider spider = new DocumentSpider(f.getPath(), Spidermode.DIRECTORY, this.resultList);
spider.terms = this.terms;
(new Thread(spider)).start();
} else {
DocumentSpider spider = new DocumentSpider(f.getPath(), Spidermode.FILE, this.resultList);
spider.terms = this.terms;
(new Thread(spider)).start();
}
}
}
public void procesDirectory(String target) throws IOException {
File base = new File(target);
File[] directoryContent = base.listFiles();
for (File f : directoryContent) {
if (f.isDirectory()) {
procesDirectory(f.getPath());
} else {
procesFile(f);
}
}
}
public void procesFile(File target) throws IOException {
BufferedReader br = new BufferedReader(new FileReader(target));
String line;
while ((line = br.readLine()) != null) {
String[] words = line.split(" ");
for (String currentWord : words) {
for (String a : terms) {
if (a.toLowerCase().equalsIgnoreCase(currentWord)) {
score += 1f;
}
;
if (currentWord.toLowerCase().contains(a)) {
score += 1f;
}
;
}
}
}
br.close();
resultList.add(new SpiderDataPair(this, URL));
}
public String[] getTerms() {
return terms;
}
public void setTerms(String[] terms) {
this.terms = terms;
}
public float getScore() {
return score;
}
public void setScore(float score) {
this.score = score;
}
public ArrayList<SpiderDataPair> getResultList() {
return resultList;
}
public void setResultList(ArrayList<SpiderDataPair> resultList) {
this.resultList = resultList;
}
}
The problem I am facing is that in my root crawler I have this list of results from every crawler that I want to process further. The operation to process the data from this list is called from the servlet (or main method for this example). However the operations is always called before all of the crawlers have completed their processing. thus launching the operation to process the results too soon, which leads to incomplete data.
I tried solving this using the join methods but unfortunately I cant seems to figure this one out.
package com.fujitsu.spider;
import java.util.ArrayList;
import com.fujitsu.spider.DocumentSpider.Spidermode;
public class Main {
public static void main(String[] args) throws InterruptedException {
ArrayList<SpiderDataPair> results = new ArrayList<SpiderDataPair>();
String [] terms = {"SERVER","CHANGE","MO"};
DocumentSpider spider1 = new DocumentSpider("C:\\Users\\Mark\\workspace\\Spider\\Files", Spidermode.DIRECTORY, results);
spider1.setTerms(terms);
DocumentSpider spider2 = new DocumentSpider("C:\\Users\\Mark\\workspace\\Spider\\File2", Spidermode.DIRECTORY, results);
spider2.setTerms(terms);
Thread t1 = new Thread(spider1);
Thread t2 = new Thread(spider2);
t1.start();
t1.join();
t2.start();
t2.join();
for(SpiderDataPair d : spider1.getResultList()){
System.out.println("PATH -> " + d.getFile() + " SCORE -> " + d.getSpider().getScore());
}
for(SpiderDataPair d : spider2.getResultList()){
System.out.println("PATH -> " + d.getFile() + " SCORE -> " + d.getSpider().getScore());
}
}
}
TL:DR
I really wish to understand this subject so any help would be immensely appreciated!.

You need a couple of changes in your code:
In the spider:
List<Thread> threads = new LinkedList<Thread>();
for (File f : directoryContent) {
if (f.isDirectory()) {
DocumentSpider spider = new DocumentSpider(f.getPath(), Spidermode.DIRECTORY, this.resultList);
spider.terms = this.terms;
Thread thread = new Thread(spider);
threads.add(thread)
thread.start();
} else {
DocumentSpider spider = new DocumentSpider(f.getPath(), Spidermode.FILE, this.resultList);
spider.terms = this.terms;
Thread thread = new Thread(spider);
threads.add(thread)
thread.start();
}
}
for (Thread thread: threads) thread.join()
The idea is to create a new thread for each spider and start it. Once they are all running, you wait until each on is done before the Spider itself finishes. This way each spider thread keeps running until all of its work is done (thus the top thread runs until all children and their children are finished).
You also need to change your runner so that it runs the two spiders in parallel instead of one after another like this:
Thread t1 = new Thread(spider1);
Thread t2 = new Thread(spider2);
t1.start();
t2.start();
t1.join();
t2.join();

You should use a higher-level library than bare Thread for this task. I would suggest looking into ExecutorService in particular and all of java.util.concurrent generally. There are abstractions there that can manage all of the threading issues while providing well-formed tasks a properly protected environment in which to run.
For your specific problem, I would recommend some sort of blocking queue of tasks and a standard producer-consumer architecture. Each task knows how to determine if its path is a file or directory. If it is a file, process the file; if it is a directory, crawl the directory's immediate contents and enqueue new tasks for each sub-path. You could also use some properly-synchronized shared state to cap the number of files processed, depth, etc. Also, the service provides the ability to await termination of its tasks, making the "join" simpler.
With this architecture, you decouple the notion of threads and thread management (handled by the ExecutorService) with your business logic of tasks (typically a Runnable or Callable). The service itself has the ability to tune how to instantiate, such as a fixed maximum number of threads or a scalable number depending on how many concurrent tasks exist (See factory methods on java.util.concurrent.Executors). Threads, which are more expensive than the Runnables they execute, are re-used to conserve resources.
If your objective is primarily something functional that works in production quality, then the library is the way to go. However, if your objective is to understand the lower-level details of thread management, then you may want to investigate the use of latches and perhaps thread groups to manage them at a lower level, exposing the details of the implementation so you can work with the details.

Using multiple threads to get data from twitter using twitter4j

I have a set of the keywords (over 600) and I want to use streaming api to track tweets with them. Twitter api limits the number of keywords, which you are allowed to track, to 200. So I decided to have several threads that will do it, using several OAuth tokens for this. This is how I do it:
String[] dbKeywords = KeywordImpl.listKeywords();
List<String[]> keywords = ditributeKeywords(dbKeywords);
for (String[] subList : keywords) {
StreamCrawler streamCrawler = new StreamCrawler();
streamCrawler.setKeywords(subList);
Thread crawlerThread = new Thread(streamCrawler);
crawlerThread.start();
}
This is how words are distributed among threads. Each thread receives no more than 200 words.
This is the implementation of the StreamCrawler:
public class StreamCrawler extends Crawler implements Runnable {
...
private String[] keywords;
public void setKeywords(String[] keywords) {
this.keywords = keywords;
}
#Override
public void run() {
TwitterStream twitterStream = getTwitterInstance();
StatusListener listener = new StatusListener() {
ArrayDeque<Tweet> tweetbuffer = new ArrayDeque<Tweet>();
ArrayDeque<TwitterUser> userbuffer = new ArrayDeque<TwitterUser>();
#Override
public void onException(Exception arg0) {
System.out.println(arg0);
}
#Override
public void onDeletionNotice(StatusDeletionNotice arg0) {
System.out.println(arg0);
}
#Override
public void onScrubGeo(long arg0, long arg1) {
System.out.println(arg1);
}
#Override
public void onStatus(Status status) {
...Doing something with message
}
#Override
public void onTrackLimitationNotice(int arg0) {
System.out.println(arg0);
try {
Thread.sleep(5 * 60 * 1000);
System.out.println("Will sleep for 5 minutes!");
} catch (InterruptedException e) {
e.printStackTrace();
}
}
#Override
public void onStallWarning(StallWarning arg0) {
System.out.println(arg0);
}
};
FilterQuery fq = new FilterQuery();
String keywords[] = getKeywords();
System.out.println(keywords.length);
System.out.println("Listening for " + Arrays.toString(keywords));
fq.track(keywords);
twitterStream.addListener(listener);
twitterStream.filter(fq);
}
private long getCurrentThreadId() {
return Thread.currentThread().getId();
}
private TwitterStream getTwitterInstance() {
TwitterConfiguration configuration = null;
TwitterStream twitterStream = null;
while (configuration == null) {
configuration = TokenFactory.getAvailableToken();
if (configuration != null) {
System.out
.println("Token was obtained " + getCurrentThreadId());
System.out.println(configuration.getTwitterAccount());
setToken(configuration);
ConfigurationBuilder cb = new ConfigurationBuilder();
cb.setDebugEnabled(true);
cb.setOAuthConsumerKey(configuration.getConsumerKey());
cb.setOAuthConsumerSecret(configuration.getConsumerSecret());
cb.setOAuthAccessToken(configuration.getAccessToken());
cb.setOAuthAccessTokenSecret(configuration.getAccessSecret());
twitterStream = new TwitterStreamFactory(cb.build())
.getInstance();
} else {
// If there is no available configuration, wait for 2 minutes
// and try again
try {
System.out
.println("There were no available tokens, sleeping for 2 minutes.");
Thread.sleep(2 * 60 * 1000);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
return twitterStream;
}
}
So my problem is that when I start for example 2 threads I get notification that both of them are opening stream and getting it. But actually only first one is really getting stream and respectively calling OnStatus method. The array, which is used in the second thread, is not empty; Twitterconfiguration is also valid and unique. So I don't understand what might be the reason for such behavior. Why does the only first thread return tweets?

As far as I see you're trying to make two simultaneous connections to the public streaming endpoints (a.k.a. general streams or stream.twitter.com) from the same IP.
More specifically, I think you want two active connections to stream.twitter.com/1.1/statuses/filter.json from the same IP.
Although the Twitter streaming-apis documentation doesn't clearly say about only one standing connection to the public endpoints, the Twitter employees clarify this on the dev site https://dev.twitter.com/discussions/7542
For general streams, you should only make one connection from the same IP.
This means that it doesn't matter you use two different Twitter applications/accounts to connect to public streams; as long you're connecting from the same IP address you can have only one standing connection to the public streams. You said that you got both streams connected, and the answer to this behaviour is given by a Twitter employee: https://dev.twitter.com/discussions/14935
You may find that at times stream.twitter.com lets you get away with more open connections here or there, but that behavior shouldn't be counted on.
If you try for instance, in the 2nd thread, to connect to user stream instead (twitter4j TwitterStream user() method), then you'll really start getting both filter & user streams.
Regarding the 200 track keywords limit, probably the twitter4j.org javadoc is little bit outdated. Here is what the twitter api docs are saying
The default access level allows up to 400 track keywords, 5,000 follow userids and 25 0.1-360 degree location boxes. If you need elevated access to the Streaming API, you should explore our partner providers of Twitter data ...
So, if you need to go beyond the 400, you'll probably want to ask Twitter for increased track access level for your Twitter account application, or working with certified partner providers of Twitter data.
Another thing you don't necessarily need, is starting new threads for getting the streams, since the twitter4j filter (or user) "method internally creates a thread which manipulates TwitterStream and calls adequate listener methods continuously" (quoted from an example code by Yusuke Yamamoto).
I hope this help. (I couldn't post more links because I'm getting this "You need at least 10 reputation to post more than 2 links")

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.