I am trying to submit a JAR with Spark job into the YARN cluster from Java code. I am using SparkLauncher to submit SparkPi example:
Process spark = new SparkLauncher()
.setAppResource("C:\\spark-1.4.1-bin-hadoop2.6\\lib\\spark-examples-1.4.1-hadoop2.6.0.jar")
.setMainClass("org.apache.spark.examples.SparkPi")
.setMaster("yarn-cluster")
.launch();
System.out.println("Waiting for finish...");
int exitCode = spark.waitFor();
System.out.println("Finished! Exit code:" + exitCode);
There are two problems:
While submitting in "yarn-cluster" mode, the application is sucessfully submitted to YARN and executes successfully (it is visible in the YARN UI, reported as SUCCESS and pi is printed in the output). However, the submitting application is never notified that processing is finished - it hangs infinitely after printing "Waiting to finish..." The log of the container can be found here
While submitting in "yarn-client" mode, the application does not appear in YARN UI and the submitting application hangs at "Waiting to finish..." When hanging code is killed, the application shows up in YARN UI and it is reported as SUCCESS, but the output is empty (pi is not printed out). The log of the container can be found here
I tried to execute the submitting application both with Oracle Java 7 and 8.
I got help in the Spark mailing list. The key is to read / clear getInputStream and getErrorStream() on the Process. The child process might fill up the buffer and cause a deadlock - see Oracle docs regarding Process. The streams should be read in separate threads:
Process spark = new SparkLauncher()
.setSparkHome("C:\\spark-1.4.1-bin-hadoop2.6")
.setAppResource("C:\\spark-1.4.1-bin-hadoop2.6\\lib\\spark-examples-1.4.1-hadoop2.6.0.jar")
.setMainClass("org.apache.spark.examples.SparkPi").setMaster("yarn-cluster").launch();
InputStreamReaderRunnable inputStreamReaderRunnable = new InputStreamReaderRunnable(spark.getInputStream(), "input");
Thread inputThread = new Thread(inputStreamReaderRunnable, "LogStreamReader input");
inputThread.start();
InputStreamReaderRunnable errorStreamReaderRunnable = new InputStreamReaderRunnable(spark.getErrorStream(), "error");
Thread errorThread = new Thread(errorStreamReaderRunnable, "LogStreamReader error");
errorThread.start();
System.out.println("Waiting for finish...");
int exitCode = spark.waitFor();
System.out.println("Finished! Exit code:" + exitCode);
where InputStreamReaderRunnable class is:
public class InputStreamReaderRunnable implements Runnable {
private BufferedReader reader;
private String name;
public InputStreamReaderRunnable(InputStream is, String name) {
this.reader = new BufferedReader(new InputStreamReader(is));
this.name = name;
}
public void run() {
System.out.println("InputStream " + name + ":");
try {
String line = reader.readLine();
while (line != null) {
System.out.println(line);
line = reader.readLine();
}
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Since this is an old post, i would like to add an update that might help whom ever read this post after. In spark 1.6.0 there are some added functions in SparkLauncher class. Which is:
def startApplication(listeners: <repeated...>[Listener]): SparkAppHandle
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.launcher.SparkLauncher
You can run the application with out the need for additional threads for the stdout and stderr handling plush there is a nice status reporting of the application running. Use this code:
val env = Map(
"HADOOP_CONF_DIR" -> hadoopConfDir,
"YARN_CONF_DIR" -> yarnConfDir
)
val handler = new SparkLauncher(env.asJava)
.setSparkHome(sparkHome)
.setAppResource("Jar/location/.jar")
.setMainClass("path.to.the.main.class")
.setMaster("yarn-client")
.setConf("spark.app.id", "AppID if you have one")
.setConf("spark.driver.memory", "8g")
.setConf("spark.akka.frameSize", "200")
.setConf("spark.executor.memory", "2g")
.setConf("spark.executor.instances", "32")
.setConf("spark.executor.cores", "32")
.setConf("spark.default.parallelism", "100")
.setConf("spark.driver.allowMultipleContexts","true")
.setVerbose(true)
.startApplication()
println(handle.getAppId)
println(handle.getState)
You can keep enquering the state if the spark application until it give success.
For information about how the Spark Launcher server works in 1.6.0. see this link:
https://github.com/apache/spark/blob/v1.6.0/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java
I implemented using CountDownLatch, and it works as expected.
This is for SparkLauncher version 2.0.1 and it works in Yarn-cluster mode too.
...
final CountDownLatch countDownLatch = new CountDownLatch(1);
SparkAppListener sparkAppListener = new SparkAppListener(countDownLatch);
SparkAppHandle appHandle = sparkLauncher.startApplication(sparkAppListener);
Thread sparkAppListenerThread = new Thread(sparkAppListener);
sparkAppListenerThread.start();
long timeout = 120;
countDownLatch.await(timeout, TimeUnit.SECONDS);
...
private static class SparkAppListener implements SparkAppHandle.Listener, Runnable {
private static final Log log = LogFactory.getLog(SparkAppListener.class);
private final CountDownLatch countDownLatch;
public SparkAppListener(CountDownLatch countDownLatch) {
this.countDownLatch = countDownLatch;
}
#Override
public void stateChanged(SparkAppHandle handle) {
String sparkAppId = handle.getAppId();
State appState = handle.getState();
if (sparkAppId != null) {
log.info("Spark job with app id: " + sparkAppId + ",\t State changed to: " + appState + " - "
+ SPARK_STATE_MSG.get(appState));
} else {
log.info("Spark job's state changed to: " + appState + " - " + SPARK_STATE_MSG.get(appState));
}
if (appState != null && appState.isFinal()) {
countDownLatch.countDown();
}
}
#Override
public void infoChanged(SparkAppHandle handle) {}
#Override
public void run() {}
}
Related
OK after few tries, I would like to rephrase my question :
"I have developed Web App with Angular 5 (frontend), spring-boot (backend) AND Java 8
The next step is to launch partner software, installed on remote server, from the interface.
It's an .exe program with some parameters, But I wish to test by just launching putty
My java class (by using #Ankesh answer)
#Service
public class DictService extends CoreServices {
public ApiResponse<?> launch(String idWS, String ipp, String nom, String prenom, String ddn) {
try {
boolean isWindows = System.getProperty("os.name")
.toLowerCase().startsWith("windows");
String path = "C:\\klinck\\PuTTY\\putty.exe";
ProcessBuilder builder = new ProcessBuilder();
if (isWindows) {
// builder.command("cmd.exe", "/c", "dir");
builder.command("cmd.exe", "/c", path);
} else {
// this is for bash on linux (can be omitted)
builder.command("sh", "-c", "ls");
}
System.out.println(builder);
// builder.directory(new File(System.getProperty("user.home")));
// Start the process here
// Redirect the errorstream
builder.redirectErrorStream(true);
System.out.println(""+builder.redirectErrorStream());
System.out.println("before start");
Process process = builder.start();
System.out.println("after start");
// Follow the process to get logging if required
StreamGobbler streamGobbler =
new StreamGobbler (process.getInputStream(), System.out::println);
//Submit log collection to and executor for proper scheduling and collection of loggs
System.out.println("before executors submit");
Executors.newSingleThreadExecutor().submit(streamGobbler);
// Collect exit code
System.out.println("before waitFor");
int exitCode = process.waitFor();
System.out.println("after waitFor");
// validate if the appliction exited without errors using exit code
assert exitCode == 0;
} catch (Exception ure) {
return new ApiResponse<>(false, "Une erreur interne est survenue. Merci de contacter le support", null, 0);
}
return new ApiResponse<>(true, null, null, 0);
}
private static class StreamGobbler implements Runnable {
private InputStream inputStream;
private Consumer<String> consumer;
public StreamGobbler(InputStream inputStream, Consumer<String> consumer) {
this.inputStream = inputStream;
this.consumer = consumer;
}
#Override
public void run() {
System.out.println("StreamGlobber run");
new BufferedReader(new InputStreamReader(inputStream)).lines()
.forEach(consumer);
}
}
}
A process in launched but the GUI doen't appear.
I noticed that this process locks tomcat log files (stderr and stdout)
I saw that it's possible :
Best Way to Launch External Process from Java Web-Service?
Executing external Java program from a webapp
But I don't succeed to adapt this code.
Here is log :
java.lang.ProcessBuilder#7b1ead05
true
before start
after start
before executors submit
before waitFor
StreamGlobber run
It seems like process is launched.
I don't understand....How can I solve that ?
You can use your SpringBoot backend to achieve this.
In java ProcessBuilder class can help you launch a command (which will run your executable). Here we are running cmd.exe which is available on system path.
ProcessBuilder builder = new ProcessBuilder();
if (isWindows) {
builder.command("cmd.exe", "/c", "dir");
} else {
// this is for bash on linux (can be omitted)
builder.command("sh", "-c", "ls");
}
builder.directory(new File(System.getProperty("user.home")));
// Start the process here
Process process = builder.start();
// Follow the process to get logging if required
StreamGobbler streamGobbler =
new StreamGobbler(process.getInputStream(), System.out::println);
//Submit log collection to and executor for proper scheduling and collection of loggs
Executors.newSingleThreadExecutor().submit(streamGobbler);
// Collect exit code
int exitCode = process.waitFor();
// validate if the appliction exited without errors using exit code
assert exitCode == 0;
Reference : https://www.baeldung.com/run-shell-command-in-java
I am using "selenium-java.jar" file to open chrome headless drivers.
Now we are using threads to open headless chrome. Now what happens if there is any error then sometime threads quits without closing browser.
So i want to implement a solution that if any headless chrome is ideal for last 20 minutes then close/quit it.
I searched on google and i found may solution which is around selenium server standalone like this https://github.com/seleniumhq/selenium/issues/1106
My problem is i cannot switch to standalone server now so i have to figure out solution with current library.
So is there any way to close all headless chrome browsers which are idle for last 20 minutes?
Please guide.
I use selenium-java.jar with TestNg and whilst I don't run headless browsers I do clean up after a test run in the TestNg aftermethod, which is not quite the same as your 20 min wait, but might be of help.
When running tests on a windows OS I check for to see if the process is running by name and terminate it:
public final class OsUtils
{
private static final String TASKLIST = "tasklist";
private static final String KILL = "taskkill /F /IM ";
public static final String IE_EXE = "iexplore.exe";
public static final String CHROME_EXE = "chrome.exe";
public static final String EDGE_EXE = "MicrosoftEdge.exe";
public static final String FIREFOX_EXE = "firefox.exe";
public static boolean isProcessRunning(String processName)
{
Process process;
try
{
process = Runtime.getRuntime().exec(TASKLIST);
}
catch (IOException ex)
{
Logger.error("Error on get runtime" + ex.getMessage());
return false;
}
String line;
try ( BufferedReader reader = new BufferedReader(new InputStreamReader(process.getInputStream())); )
{
while ((line = reader.readLine()) != null) {
if (line.contains(processName)) {
Logger.log("Process found");
return true;
}
}
}
catch (IOException ex)
{
Logger.error("Error on check for process " + processName + ": " + ex.getMessage());
}
return false;
}
public static void killProcessIfRunning(String processName)
{
Logger.log("Trying to kill process: " + processName);
try
{
if (isProcessRunning(processName))
{
Runtime.getRuntime().exec(KILL + processName);
}
}
catch (IOException ex)
{
Logger.error("Error on kill process " + processName+ ": " + ex.getMessage());
}
}
...
}
When running Safari on macmini I have a similar kill command (which works for both Safari proper and also the technology preview):
public static void killSafariProcess()
{
Logger.log("Trying to kill Safari processes if running.");
try
{
Process p = Runtime.getRuntime().exec(new String[]{"bash","-c","ps ux | grep -i app/Contents/MacOs/Safari | grep -v grep | awk '{print $2}' | xargs kill -9"});
}
catch (IOException ex)
{
Logger.error("Error on kill Safari processes: " + ex.getMessage());
}
}
The custom Logger class just uses System.out.println(message)
You can probably do some analysis on the start time of the different processes that match your driver criteria. I don't think it's going to tell you how long it's been idle, but you can probably assume that if it's been running for 20 mins (assuming your test should successfully complete within minutes) that it's probably orphaned.
I found this answer that shows how you can use Java to get a list of processes and see their start time. From there you should be able to find all of the drivers that are old and kill them.
An alternative might be to use Powershell to get the processes, start time, and deal with it in that way. It just depends on what you are looking for. Here's an answer to get you started down this path.
You could subclass ChromeDriver and implement your own proxy class with a timer to quit after 20 minutes idle time:
public class TimedChromeDriver extends ChromeDriver {
Timer timeOut;
private void initTimer() {
timeOut = new Timer();
}
private void startTimer() {
timeOut.cancel();
timeOut.schedule(
new TimerTask() {
#Override
public void run() {
quit();
}
},
20 * 60 * 1000);
}
public TimedChromeDriver() {
initTimer();
}
#Override
public void get(String url) {
super.get(url);
startTimer();
}
// override every method of WebDriver and ChromeDriver in the same way
}
This will only work if your Java VM is not terminated before the timer is triggered. The garbage collector could also interfere. Overriding the finalize method is deprecated.
I would invest some analysis effort into your threads quitting ungracefully. This would solve your problems at the source.
I'll give some context information and hope you could get any idea on how could this issue happen.
Firstly, the main thread code for whole app is attched here.
public static void main(String args[]) throws Exception {
AppConfig appConfig = AppConfig.getInstance();
appConfig.initBean("applicationContext.xml");
SchedulerFactory factory=new StdSchedulerFactory();
Scheduler _scheduler=factory.getScheduler();
_scheduler.start();
Thread t = new Thread((Runnable) appConfig.getBean("consumeGpzjDataLoopTask"));
t.start();
}
Main method just does 3 things: inits beans by the Spring way, starts the Quartz jobs thread and starts the sub thread which subscribes one channel in Jedis and listen for msgs continuously. Then I'll show the code for the sub thread which starts subscribing:
#Override
public void run() {
Properties pros = new Properties();
Jedis sub = new Jedis(server, defaultPort, 0);
sub.subscribe(subscriber, channelId);
}
and the thread stack then message received:
But something weird happened in production environment. The quartz jobs scheduler is running properly while the consumeGpzjDataLoopTask seems to be exited somehow! I really can't get an idea why the issue could even happen, as you could see, the sub thread inits one Jedis instance with timeout 0 which stands for running infinitely, so I thought the sub thread should not be closed unless some terrible issues in main thread occured. But in prod environment, the message publisher published messages normally and the messages disappeared, and no related could be found in log file, just like the subscriber thread already been dead. BTW, I never met the situation when self testing in local machine.
Could you help me on the possibility for the issue? Comment me if any extra info needed for problem analyzing. Thanks.
Edited: For #code, here's the code for subscriber.
public class GpzjDataSubscriber extends JedisPubSub {
private static final Logger logger = LoggerFactory.getLogger(GpzjDataSubscriber.class);
private static final String META_INSERT_SQL = "insert into dbo.t_cl_tj_transaction_meta_attributes\n" +
"(transaction_id, meta_key, meta_value) VALUES (%d, '%s', '%s')";
private static final String GET_EVENT_ID_SQL = "select id from t_cl_tj_monthly_golden_events_dict where target = ?";
private static final String TRANSACTION_TB_NAME = "t_cl_tj_monthly_golden_stock_transactions";
private static Map<String, Object> insertParams = new HashMap<String, Object>();
private static Collection<String> metaSqlContainer = new ArrayList<String>();
#Autowired(required = false)
#Qualifier(value = "gpzjDao")
private GPZJDao gpzjDao;
public GpzjDataSubscriber() {}
public void onMessage(String channel, String message) {
consumeTransactionMessage(message);
logger.info(String.format("gpzj data subscriber receives redis published message, channel %s, message %s", channel, message));
}
public void onSubscribe(String channel, int subscribedChannels) {
logger.info(String.format("gpzj data subscriber subscribes redis channel success, channel %s, subscribedChannels %d",
channel, subscribedChannels));
}
#Transactional(isolation = Isolation.READ_COMMITTED)
private void consumeTransactionMessage(String msg) {
final GpzjDataTransactionOrm jsonOrm = JSON.parseObject(msg, GpzjDataTransactionOrm.class);
Map<String, String> extendedAttrs = (jsonOrm.getAttr() == null || jsonOrm.getAttr().isEmpty())? null : JSON.parseObject(jsonOrm.getAttr(), HashMap.class);
if (jsonOrm != null) {
SimpleJdbcInsert insertActor = gpzjDao.getInsertActor(TRANSACTION_TB_NAME);
initInsertParams(jsonOrm);
Long transactionId = insertActor.executeAndReturnKey(insertParams).longValue();
if (extendedAttrs == null || extendedAttrs.isEmpty()) {
return;
}
metaSqlContainer.clear();
for (Map.Entry e: extendedAttrs.entrySet()) {
metaSqlContainer.add(String.format(META_INSERT_SQL, transactionId.intValue(), e.getKey(), e.getValue()));
}
int[] insertMetaResult = gpzjDao.batchUpdate(metaSqlContainer.toArray(new String[0]));
}
}
private void initInsertParams(GpzjDataTransactionOrm orm) {
DateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
Integer eventId = gpzjDao.queryForInt(GET_EVENT_ID_SQL, orm.getTarget());
insertParams.clear();
insertParams.put("khid", orm.getKhid());
insertParams.put("attr", orm.getAttr());
insertParams.put("event_id", eventId);
insertParams.put("user_agent", orm.getUser_agent());
insertParams.put("referrer", orm.getReferrer());
insertParams.put("page_url", orm.getPage_url());
insertParams.put("channel", orm.getChannel());
insertParams.put("os", orm.getOs());
insertParams.put("screen_width", orm.getScreen_width());
insertParams.put("screen_height", orm.getScreen_height());
insertParams.put("note", orm.getNote());
insertParams.put("create_time", df.format(new Date()));
insertParams.put("already_handled", 0);
}
}
I send a message using EventBus and i want to get the reply message into a variable then will return it.this is the code block.
public class MessageExecute {
private static final Logger logger = LoggerFactory.getLogger(MessageExecute.class);
public static <T> T sendMessage(Vertx vertx,String address,T message){
Future<Message<T>> future = Future.future();
vertx.eventBus().send(address, message, future.completer());
future.setHandler(new Handler<AsyncResult<Message<T>>>() {
#Override
public void handle(AsyncResult<Message<T>> event) {
logger.info("received reply message | thread - " + Thread.currentThread().getName());
}
});
boolean notFound = true;
while(notFound){
try{
if(future.result()!= null){
notFound = false;
}
}catch(Exception e){
}
}
return message;
}
}
Actually this is working fine.But some times While block never exit.Its mean future.result() not getting the value ,even after the reply message is received.I don't know this the correct way and I don't have clear idea about how the Futures work in vert.x .Is there any other way to implement these kind of scenario.
I recommend you to read about the Vertx-Sync project - http://vertx.io/docs/vertx-sync/java/
In examples, have the follow example that appears very similar to you case:
EventBus eb = vertx.eventBus();
HandlerReceiverAdaptor<Message<String>> adaptor = streamAdaptor();
eb.<String>consumer("some-address").handler(adaptor);
// Receive 10 messages from the consumer:
for (int i = 0; i < 10; i++) {
Message<String> received1 = adaptor.receive();
System.out.println("got message: " + received1.body());
}
So a little background;
I am working on a project in which a servlet is going to release crawlers upon a lot of text files within a file system. I was thinking of dividing the load under multiple threads, for example:
a crawler enters a directory, finds 3 files and 6 directories. it will start processing the files and start a thread with a new crawler for the other directories. So from my creator class I would create a single crawler upon a base directory. The crawler would assess the workload and if deemed needed it would spawn another crawler under another thread.
My crawler class looks like this
package com.fujitsu.spider;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.Serializable;
import java.util.ArrayList;
public class DocumentSpider implements Runnable, Serializable {
private static final long serialVersionUID = 8401649393078703808L;
private Spidermode currentMode = null;
private String URL = null;
private String[] terms = null;
private float score = 0;
private ArrayList<SpiderDataPair> resultList = null;
public enum Spidermode {
FILE, DIRECTORY
}
public DocumentSpider(String resourceURL, Spidermode mode, ArrayList<SpiderDataPair> resultList) {
currentMode = mode;
setURL(resourceURL);
this.setResultList(resultList);
}
#Override
public void run() {
try {
if (currentMode == Spidermode.FILE) {
doCrawlFile();
} else {
doCrawlDirectory();
}
} catch (Exception e) {
e.printStackTrace();
}
System.out.println("SPIDER # " + URL + " HAS FINISHED.");
}
public Spidermode getCurrentMode() {
return currentMode;
}
public void setCurrentMode(Spidermode currentMode) {
this.currentMode = currentMode;
}
public String getURL() {
return URL;
}
public void setURL(String uRL) {
URL = uRL;
}
public void doCrawlFile() throws Exception {
File target = new File(URL);
if (target.isDirectory()) {
throw new Exception(
"This URL points to a directory while the spider is in FILE mode. Please change this spider to FILE mode.");
}
procesFile(target);
}
public void doCrawlDirectory() throws Exception {
File baseDir = new File(URL);
if (!baseDir.isDirectory()) {
throw new Exception(
"This URL points to a FILE while the spider is in DIRECTORY mode. Please change this spider to DIRECTORY mode.");
}
File[] directoryContent = baseDir.listFiles();
for (File f : directoryContent) {
if (f.isDirectory()) {
DocumentSpider spider = new DocumentSpider(f.getPath(), Spidermode.DIRECTORY, this.resultList);
spider.terms = this.terms;
(new Thread(spider)).start();
} else {
DocumentSpider spider = new DocumentSpider(f.getPath(), Spidermode.FILE, this.resultList);
spider.terms = this.terms;
(new Thread(spider)).start();
}
}
}
public void procesDirectory(String target) throws IOException {
File base = new File(target);
File[] directoryContent = base.listFiles();
for (File f : directoryContent) {
if (f.isDirectory()) {
procesDirectory(f.getPath());
} else {
procesFile(f);
}
}
}
public void procesFile(File target) throws IOException {
BufferedReader br = new BufferedReader(new FileReader(target));
String line;
while ((line = br.readLine()) != null) {
String[] words = line.split(" ");
for (String currentWord : words) {
for (String a : terms) {
if (a.toLowerCase().equalsIgnoreCase(currentWord)) {
score += 1f;
}
;
if (currentWord.toLowerCase().contains(a)) {
score += 1f;
}
;
}
}
}
br.close();
resultList.add(new SpiderDataPair(this, URL));
}
public String[] getTerms() {
return terms;
}
public void setTerms(String[] terms) {
this.terms = terms;
}
public float getScore() {
return score;
}
public void setScore(float score) {
this.score = score;
}
public ArrayList<SpiderDataPair> getResultList() {
return resultList;
}
public void setResultList(ArrayList<SpiderDataPair> resultList) {
this.resultList = resultList;
}
}
The problem I am facing is that in my root crawler I have this list of results from every crawler that I want to process further. The operation to process the data from this list is called from the servlet (or main method for this example). However the operations is always called before all of the crawlers have completed their processing. thus launching the operation to process the results too soon, which leads to incomplete data.
I tried solving this using the join methods but unfortunately I cant seems to figure this one out.
package com.fujitsu.spider;
import java.util.ArrayList;
import com.fujitsu.spider.DocumentSpider.Spidermode;
public class Main {
public static void main(String[] args) throws InterruptedException {
ArrayList<SpiderDataPair> results = new ArrayList<SpiderDataPair>();
String [] terms = {"SERVER","CHANGE","MO"};
DocumentSpider spider1 = new DocumentSpider("C:\\Users\\Mark\\workspace\\Spider\\Files", Spidermode.DIRECTORY, results);
spider1.setTerms(terms);
DocumentSpider spider2 = new DocumentSpider("C:\\Users\\Mark\\workspace\\Spider\\File2", Spidermode.DIRECTORY, results);
spider2.setTerms(terms);
Thread t1 = new Thread(spider1);
Thread t2 = new Thread(spider2);
t1.start();
t1.join();
t2.start();
t2.join();
for(SpiderDataPair d : spider1.getResultList()){
System.out.println("PATH -> " + d.getFile() + " SCORE -> " + d.getSpider().getScore());
}
for(SpiderDataPair d : spider2.getResultList()){
System.out.println("PATH -> " + d.getFile() + " SCORE -> " + d.getSpider().getScore());
}
}
}
TL:DR
I really wish to understand this subject so any help would be immensely appreciated!.
You need a couple of changes in your code:
In the spider:
List<Thread> threads = new LinkedList<Thread>();
for (File f : directoryContent) {
if (f.isDirectory()) {
DocumentSpider spider = new DocumentSpider(f.getPath(), Spidermode.DIRECTORY, this.resultList);
spider.terms = this.terms;
Thread thread = new Thread(spider);
threads.add(thread)
thread.start();
} else {
DocumentSpider spider = new DocumentSpider(f.getPath(), Spidermode.FILE, this.resultList);
spider.terms = this.terms;
Thread thread = new Thread(spider);
threads.add(thread)
thread.start();
}
}
for (Thread thread: threads) thread.join()
The idea is to create a new thread for each spider and start it. Once they are all running, you wait until each on is done before the Spider itself finishes. This way each spider thread keeps running until all of its work is done (thus the top thread runs until all children and their children are finished).
You also need to change your runner so that it runs the two spiders in parallel instead of one after another like this:
Thread t1 = new Thread(spider1);
Thread t2 = new Thread(spider2);
t1.start();
t2.start();
t1.join();
t2.join();
You should use a higher-level library than bare Thread for this task. I would suggest looking into ExecutorService in particular and all of java.util.concurrent generally. There are abstractions there that can manage all of the threading issues while providing well-formed tasks a properly protected environment in which to run.
For your specific problem, I would recommend some sort of blocking queue of tasks and a standard producer-consumer architecture. Each task knows how to determine if its path is a file or directory. If it is a file, process the file; if it is a directory, crawl the directory's immediate contents and enqueue new tasks for each sub-path. You could also use some properly-synchronized shared state to cap the number of files processed, depth, etc. Also, the service provides the ability to await termination of its tasks, making the "join" simpler.
With this architecture, you decouple the notion of threads and thread management (handled by the ExecutorService) with your business logic of tasks (typically a Runnable or Callable). The service itself has the ability to tune how to instantiate, such as a fixed maximum number of threads or a scalable number depending on how many concurrent tasks exist (See factory methods on java.util.concurrent.Executors). Threads, which are more expensive than the Runnables they execute, are re-used to conserve resources.
If your objective is primarily something functional that works in production quality, then the library is the way to go. However, if your objective is to understand the lower-level details of thread management, then you may want to investigate the use of latches and perhaps thread groups to manage them at a lower level, exposing the details of the implementation so you can work with the details.