From Java parallelstream spawns other parallelStreams and fails seldom

From Java parallelstream spawns other parallelStreams and fails seldom - java

Considering the following function:
public void execute4() {
File filePath = new File(filePathData);
File[] files = filePath.listFiles((File filePathData) -> filePathData.getName().endsWith("CDR"));
List<CDR> cdrs = new ArrayList<CDR>();
Arrays.asList(files).parallelStream().forEach(file -> readCDRP(cdrs, file));
cdrs.sort(cdrsorter);
}
which reads a list of Files containing CDR and executes the readCDRP() which is this:
private void readCDRP(List<CDR> cdrs, File file) {
final CDR cdr = new CDR(file.getName());
try (BufferedReader bfr = new BufferedReader(new FileReader(file))) {
List<String> lines = bfr.lines().collect(Collectors.toList());
lines.parallelStream().forEach(e -> {
String[] data = e.split(",", -1);
CDREntry entry = new CDREntry(file.getName());
for (int i = 0; i < data.length; i++) {
entry.setField(i, data[i]);
}
cdr.addEntry(entry);
});
if (cdr != null) {
cdrs.add(cdr);
}
} catch (IOException e) {
e.printStackTrace();
}
}
What I observe is that occasionally and NOT all the time, I either get a ArrayIndexNotBound Exception at the readCDRP function over the line (which is awkward, as the list of cdr is an ArrayList() ):
cdr.addEntry(entry);
or
at the last line in execute4() where I apply the sorting.
I think the issue is that the first parallelStream from execute4 is not in a separate space in memory from the second parallelStream execution inside readCDRP() and also seems to share wrongly the data. Using "seem" word as I can't confirm and is just a hutch.
The questions are:
is my code buggy to the bone from JDK8 perspective?
Is there a workaround using the same flow, something like using CountDownLatch for example?
Is limitation of the ForkJoinPool ?
Thanks for any responce....
EDIT(1):
The addEntry is part of a class itself:
class CDR {
public final String fileName;
private final List<CDREntry> entries = new ArrayList<CDREntry>();
public CDR(String fileName) {
super();
this.fileName = fileName;
}
public List<CDREntry> getEntries() {
return entries;
}
public List<CDREntry> addEntry(CDREntry e) {
entries.add(e);
return entries;
}
public String getFileName() {
return this.fileName;
}
}

Your code is broken from a thread safety point of view. In readCDR you add elements to the cdrs list which is an ArrayList that does not support concurrent writes. That is why it breaks.
A better approach would be to have readCDR return a cdr object and do something like:
List<CDR> cdrs = Arrays.stream(files)
.parallel()
.map(this::readCDR)
.collect(Collectors.toList());
Also, using parallel streams for IO related operations is generally a bad idea, but that is another discussion.

When you starting programming in functional style you should prefer immutable objects which can be fully created via construction (or probably using builder pattern or some factory method). So your CDREntry class may look like this:
class CDREntry {
private final String[] fields;
private final String name;
public CDREntry(String name, String[] fields) {
this.name = name;
this.fields = fields;
}
// Add getters and whatever
}
And your CDR class may look like this:
class CDR {
private final String fileName;
private final List<CDREntry> entries;
public CDR(String fileName, List<CDREntry> entries) {
this.fileName = fileName;
this.entries = entries;
}
public List<CDREntry> getEntries() {
return entries;
}
public String getFileName() {
return this.fileName;
}
}
Having such classes things become easier. The rest of the code can be rewritten like this:
public void execute4() {
File filePath = new File(filePathData);
File[] files = filePath.listFiles((File data, String name) ->
data.getName().endsWith("CDR")); // fixed this line: it had compilation error
List<CDR> cdrs = Arrays.stream(files).parallel()
.map(this::readCDRP).sorted(cdrsorter)
.collect(Collectors.toList());
}
private CDR readCDRP(File file) {
try (BufferedReader bfr = new BufferedReader(new FileReader(file))) {
// I'm not sure that collecting lines into list
// before main processing was actually necessary
return bfr.lines().parallelStream()
.map(e -> new CDREntry(file.getName(), e.split(",", -1)))
.collect(Collectors.collectingAndThen(
Collectors.toList(), list -> new CDR(file.getName(), list)));
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}
In general remember that forEach is usually not the cleanest way to solve the tasks. It may be helpful when you integrate the streams into legacy code, but in general should be avoided.

you are using a parallel stream and a lambda that has side effects
(the lambda updates the ArrayList 'cdrs')
try to use a Collector or a Reduction-Operation.

Related

Java : OutOfMemoryError even after using GSON Streaming API

I have been working on a problem statement where we have a huge JSON response coming in and when we were parsing it using conventional gson parsing technique, it used to give OutOfMemoryException as this method stores the data in memory before processing it, so as a solution to this i have worked on streaming the JSON response where it won't put everything in memory, so it worked fine till somewhere around 1.6 million records and after that even that broke. So this is the exception we are getting.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
This is the entire code i'm using for this:
// Getting reponse into InputStream and casting it to JsonReader object for parsing
InputStream liInStream = luURLConn.getInputStream();
lCycleTimeReader = new JsonReader(new InputStreamReader(liInStream, "UTF-8"));
Our JSON looks like this:
{
"Report_Entry": [
{
"key1": "value",
"key2": "value",
"key3": "value",
"key4": "value",
"key5": "value"
},
{
"key1": "value",
"key2": "value",
"key3": "value",
"key4": "value",
"key5": "value"
}
]}
Using this object into our parsing method:
public HashMap<String, HashMap<String, String>> getcycleTimeMap(JsonReader poJSONReaderObj,
CycleTimeConstant cycleTimeConstant, int processId) {
Integer counter = 0;
HashMap<String, HashMap<String, String>> cycleTimeMap = new HashMap<String, HashMap<String, String>>();
HashMap<String, HashMap<String, String>> finalcycleTimeMap = new HashMap<String, HashMap<String, String>>();
try {
CycleTime cycleTime = new CycleTime();
poJSONReaderObj.beginObject();
while (poJSONReaderObj.hasNext()) {
String name = poJSONReaderObj.nextName();
if (name.equals("Report_Entry")) {
poJSONReaderObj.beginArray();
while (poJSONReaderObj.hasNext()) {
JsonToken nextToken2 = poJSONReaderObj.peek();
if (JsonToken.BEGIN_OBJECT.equals(nextToken2)) {
poJSONReaderObj.beginObject();
} else if (JsonToken.END_OBJECT.equals(nextToken2)) {
poJSONReaderObj.endObject();
} else {
String nextString = "";
if (JsonToken.STRING.equals(nextToken2)) {
nextString = poJSONReaderObj.nextString();
} else if (JsonToken.NAME.equals(nextToken2)) {
nextString = poJSONReaderObj.nextName();
}
switch (nextString) {
case "key1":
cycleTime.setKey1(poJSONReaderObj.nextString());
break;
case "key2":
cycleTime.setKey2(poJSONReaderObj.nextString());
break;
case "key3":
cycleTime.setKey3(poJSONReaderObj.nextString());
break;
case "key4":
cycleTime.setKey4(poJSONReaderObj.nextString());
break;
case "key5":
cycleTime.setKey5(poJSONReaderObj.nextString());
break;
}
}
poJSONReaderObj.endObject();
System.out
.println("Value of Map is : " + new Gson().toJson(cycleTime) + "counter : " + counter);
counter++;
System.out.println("Counter : " + counter);
cycleTimeMap = (HashMap<String, HashMap<String, String>>) cycleTimeBpProcessIterator(
cycleTime, cycleTimeConstant, counter, processId);
}
finalcycleTimeMap.putAll(cycleTimeMap);
}
}
JsonToken nextToken = poJSONReaderObj.peek();
if (JsonToken.END_OBJECT.equals(nextToken)) {
poJSONReaderObj.endObject();
} else if (JsonToken.END_ARRAY.equals(nextToken)) {
poJSONReaderObj.endArray();
}
} catch (IOException ioException) {
ioException.printStackTrace();
}
System.out.println("FINAL MAP TO BE LOADED : " + new Gson().toJson(finalcycleTimeMap));
return finalcycleTimeMap;
}
POJO class for handling response:
public class CycleTime {
private String key1 = "";
private String key2 = "";
private String key3 = "";
private String key4 = "";
private String key5 = "";
public String getKey1() {
return key1;
}
public void setKey1(String key1) {
this.key1 = key1;
}
public String getKey2() {
return key2;
}
public void setKey2(String key2) {
this.key2 = key2;
}
public String getKey3() {
return key3;
}
public void setKey3(String key3) {
this.key3 = key3;
}
public String getKey4() {
return key4;
}
public void setKey4(String key4) {
this.key4 = key4;
}
public String getKey5() {
return key5;
}
public void setKey5(String key5) {
this.key5 = key5;
}
}
I'm not sure what might be a culprit here but seems it is giving the same error, i'm wondering what should be the next approach to avoid this OutOfMemoryException.

Reading the entire document into a single object does not mean that the streamed reading would help you.
Moreover, Gson uses streaming under the hood because it is just an optional way of reading and writing.
Your approach, however, is very far from being good:
Gson things:
The main thing is: use Gson properly in full and let it do its job. I couldn't run your code for the JSON document you provided: it works neither for the root JSON object, nor for the only top object entry (so your deserializer is broken due to improper used of the hasNext and the beginObject/endObject pair).
Common Java things:
don't catch exceptions in middle returning a partially composed object (is it correct?);
don't use Throwable.printStackTrace (use proper logging facilities);
if you don't want using loggers, then print it to System.err (this is just a proper standard stream for such purposes);
Integer as a counter is a bad idea because of creating many boxed values, especially for huge documents (use int -- it is just fine);
enum values can (and should be) checked for equality with == (it is safe since they are singletons);
then, you can use switch to for enums too (both shorted, more compile-time safe);
don't create Gson instances in a loop especially that has that many iterations (Gson instances are known to be immutable as thread-safe, but not that cheap at constructing its objects);
don't use maps where you can have statically typed plain objects (for good);
what's the purposes of returning an always-one-key-value-pair map; (return the value);
Common design things:
use as common types as possible for declarations: not HashMap, but Map (what if someday you need another map with ordered keys? or what if you don't need a map after all?);
inverse dependencies (what if you don't need CycleTime with five keys?);
Streaming things:
if it runs in an OOM error, then what's the point of collecting a huge map that obviously cannot fit your app RAM? (use callbacks or promises (pushing approach) to process a single element, iterators or streams (pulling approach), reactive streams (pushing approach), whatever);
collect the result only for a small memory foot-print or use aggregation (otherwise you're at risk of having OOM).
This is how you can reduce the memory foot-print by using a pushing approach via callbacks:
#UtilityClass
public final class StreamSupport {
public static void acceptArrayElements(#WillNotClose final JsonReader jsonReader, final Consumer<? super JsonReader> acceptElement)
throws IOException {
jsonReader.beginArray();
while ( jsonReader.hasNext() ) {
acceptElement.accept(jsonReader);
}
jsonReader.endArray();
}
}
#UtilityClass
public final class CycleDeserializer {
public static void readCycles(final JsonReader jsonReader, final Consumer<? super JsonReader> acceptJsonReader)
throws IOException {
jsonReader.beginObject();
while ( jsonReader.hasNext() ) {
switch ( jsonReader.nextName() ) {
case "Report_Entry":
StreamSupport.acceptArrayElements(jsonReader, acceptJsonReader);
break;
default:
jsonReader.skipValue();
break;
}
}
jsonReader.endObject();
}
}
private static final Gson gson = new GsonBuilder()
.disableHtmlEscaping()
.disableInnerClassSerialization()
.create();
#Test
public void test()
throws IOException {
try ( final JsonReader jsonReader = openTheHugeDocument() ) {
CycleDeserializer.readCycles(jsonReader, jr -> {
final CycleTime cycleTime = gson.fromJson(jr, CycleTime.class);
System.out.println(cycleTime);
});
}
// do the simplest aggregation operation: `COUNT`
try ( final JsonReader jsonReader = openTheHugeDocument() ) {
final AtomicInteger count = new AtomicInteger();
CycleDeserializer.readCycles(jsonReader, jr -> {
try {
jr.skipValue();
count.incrementAndGet();
} catch ( final IOException ex ) {
throw new RuntimeException(ex);
}
});
System.out.println("Count = " + count);
}
// this will probably fail when the document is huge because it is collected into a single collection
// (you need to let your JVM use as much RAM as possible if it is a must for you)
try ( final JsonReader jsonReader = openTheHugeDocument() ) {
final Collection<CycleTime> cycleTimes = new ArrayList<>();
CycleDeserializer.readCycles(jsonReader, jr -> {
final CycleTime cycleTime = gson.fromJson(jr, CycleTime.class);
cycleTimes.add(cycleTime);
});
System.out.println("Count in list = " + cycleTimes.size());
}
}
As you can see, in the runner above you can choose the way you prefer to process your entries: either a dumb logging, or a simple count, or a simple collect-to operation.
For the pull approach via Stream approach please see: https://stackoverflow.com/a/69282822/12232870

Read two lines of a file at once in a flink streaming process

I want to process files with a flink stream in which two lines belong together. In the first line there is a header and in the second line a corresponding text.
The files are located on my local file system. I am using the readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo) method with a custom FileInputFormat.
My streaming job class looks like this:
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Read> inputStream = env.readFile(new ReadInputFormatTest("path/to/monitored/folder"), "path/to/monitored/folder", FileProcessingMode.PROCESS_CONTINUOUSLY, 100);
inputStream.print();
env.execute("Flink Streaming Java API Skeleton");
and my ReadInputFormatTest like this:
public class ReadInputFormatTest extends FileInputFormat<Read> {
private transient FileSystem fileSystem;
private transient BufferedReader reader;
private final String inputPath;
private String headerLine;
private String readLine;
public ReadInputFormatTest(String inputPath) {
this.inputPath = inputPath;
}
#Override
public void open(FileInputSplit inputSplit) throws IOException {
FileSystem fileSystem = getFileSystem();
this.reader = new BufferedReader(new InputStreamReader(fileSystem.open(inputSplit.getPath())));
this.headerLine = reader.readLine();
this.readLine = reader.readLine();
}
private FileSystem getFileSystem() {
if (fileSystem == null) {
try {
fileSystem = FileSystem.get(new URI(inputPath));
} catch (URISyntaxException | IOException e) {
throw new RuntimeException(e);
}
}
return fileSystem;
}
#Override
public boolean reachedEnd() throws IOException {
return headerLine == null;
}
#Override
public Read nextRecord(Read r) throws IOException {
r.setHeader(headerLine);
r.setSequence(readLine);
headerLine = reader.readLine();
readLine = reader.readLine();
return r;
}
}
As expected, the headers and the text are stored together in one object. However, the file is read eight times. So the problem is the parallelization. Where and how can I specify that a file is processed only once, but several files in parallel?
Or do I have to change my custom FileInputFormat even further?

I would modify your source to emit the available filenames (instead of the actual file contents) and then add a new processor to read a name from the input stream and then emit pairs of lines. In other words, split the current source into a source followed by a processor. The processor can be made to run at any degree of parallelism and the source would be a single instance.

REST Api with Multithreading for handling Files in Spring Boot

Is their any way so that i can make use of Multithreading concept to call the execution in parallel and make execution faster for created #RestController which will accepts a String and List<MultipartFile> as request parameters, and the code is working fine.Problem here is if i'm parsing one file after other via a for loop. Time taken for execution is more.
Below is Controller
#RequestMapping(value = "/csvUpload", method = RequestMethod.POST)
public List<String> csvUpload(#RequestParam String parentPkId, #RequestParam List<MultipartFile> file)
throws IOException {
log.info("Entered method csvUpload() of DaoController.class");
List<String> response = new ArrayList<String>();
String temp = parentPkId.replaceAll("[-+.^:,]", "");
for (MultipartFile f : file) {
String resp = uploadService.csvUpload(temp, f);
response.add(resp);
}
return response;
}
from controller, i'm calling uploadService.csvUpload() method where i'm parsing the files one after the other as i'm using For loop.
Below is my UploadService Class
public String csvUpload(String parentPkId, MultipartFile file) {
try {
BufferedReader br = new BufferedReader(new InputStreamReader(file.getInputStream()));
String line = "";
int header = 0;
while ((line = br.readLine()) != null) {
// TO SKIP HEADER
if(header == 0) {
header++;
continue;
}
header++;
//Use Comma As Separator
String[] csvDataSet = line.split(",");
//Saving it to DB
}catch(IOException ex) {
ex.printStackTrace();
}
return "Successfully Uploaded "+ file.getOriginalFilename();
}
How to make this Controller as a Multithreaded so that processing is parallel and fast. I'm new to Multithreading and I tried by making use of Callable interface but the Call() method will not take parameters.
any leads and suggestion are welcomed, thanks in advance.

you need to create a Class which will implement callable as below and store futures in a list and finally process the futures as below
public class ProcessMutlipartFile implements Callable<String>
{
private Mutlipartfile file;
private String temp;
private UploadService uploadService;
public ProcessMutlipartFile(Mutlipartfile file,String temp, UploadService uploadService )
{
this.file=file;
this.temp=temp,
this.uploadService=uploadService;
}
public String call() throws Exception
{
return uploadService.csvUpload(temp, file);
}
}
in your controller create a list of future object
ExecutorService executor = Executors.newFixedThreadPool(10)
List< Future<String> > futureList = new ArrayList<Future<String>>();
.
.
.
for (MultipartFile f : file) {
futureList.add(executor.submit(new ProcessMutlipartFile(file ,temp,uploadService));
}
finally in your controller
for (Future f :futureList)
{
response.add(f.get());
}
//shuttingdown the Executor
executor.shutdown();
hope this helps

You can execute the uploading code using parallel stream
List<String> response = file.parallelStream().map(f -> uploadService.csvUpload(temp, f))
.collect(Collectors.toList());
You can execute streams in serial or in parallel. When a stream executes in parallel, the Java runtime partitions the stream into multiple substreams. Aggregate operations iterate over and process these substreams in parallel and then combine the results.

How do I turn my input text file into individual variables? Java

I need help with a small mars lander video-game I'm making for my computer science class. We have to read a game config text file using scanner and use it as the rules for the different aspects of our game (Gravity, amount of fuel you have, etc.) She gave us different text files and they all have different difficulties and values, but they all have the same format, so I need to be able to simply call the different text file and have a new level ready to play. My question is:
How do I get the input from the file into separate variables so that I can manipulate them to create the game?
Here's the code for reading the text file, it also prints it out to the console
import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;
public class MarsLander {
public static void main(String [] args) {
try {
Scanner sc = new Scanner(new File("gameConfig.txt"));
while (sc.hasNext()){
String s = sc.next();
System.out.println(s);
}
sc.close();
}
catch (FileNotFoundException e) {
System.out.println("Failed to open file!");
}
}
}
Here is one of the text game config files:
1000 500
mars_sky.jpg
ship.png ship_bottom.png ship_left.png ship_right.png ship_landed.png ship_crashed.png
20 50
500.0 400.0
100
thrust.wav yay.wav explosion.wav
-0.1
2.0
0.5
500 50

In my solution, I was thinking about something slightly more generic. Let me first show you the piece of code I wrote. I will then explain its behaviour and particularities.
public class GameExample {
private static class Game {
private Long x, y;
private List<String> images = new ArrayList<>();
private Game(final Long x, final Long y, final List<String> images) {
this.x = x;
this.y = y;
this.images = images;
}
public Long getX() {
return x;
}
public Long getY() {
return y;
}
public List<String> getImages() {
return images;
}
public static class Builder {
// Parsing methods used by the builder to read the files and build the configuration
// TODO: add here builder methods for each line of the file
private final List<BiFunction<String, Game.Builder, Game.Builder>> parsingMethods = Arrays.asList(
(str, builder) -> builder.withPositions(str),
(str, builder) -> builder.withImages(str));
private Long x, y;
private List<String> images = new ArrayList<>();
private Builder withPositions(final String str) {
String[] positions = str.split(" ");
x = Long.valueOf(positions[0]);
y = Long.valueOf(positions[1]);
return this;
}
private Builder withImages(final String str) {
Stream.of(str.split(" ")).forEach(imgStr -> images.add(imgStr));
return this;
}
public Game build(final String filename) throws IOException {
Scanner sc = null;
try {
// Read the file line by line
List<String> lines = Files.lines(Paths.get(filename)).collect(Collectors.toList());
// Iterate over each line and call the configured method
IntStream.range(0, lines.size()).forEach(
index -> parsingMethods.get(index)
.apply(lines.get(index), this));
// Build an instance of the game
return new Game(x, y, images);
} catch (IOException e) {
e.printStackTrace();
throw e;
} finally {
if (sc != null) sc.close();
}
}
}
}
public static void main(String[] args) throws IOException {
final Game.Builder builder = new Game.Builder();
Game game = builder.build("file.txt");
System.out.println(game.getX() + ":" + game.getY());
System.out.println(game.getImages());
}
}
This piece of code would output:
10:15
test.jpg
with a given configuration file containing:
10 15
test.jpg
Let me explain what was done. We define a Game builder that has only one public method with the signature Game build(final String filename). It takes the filename and will build the game from the content of this file. The cornerstone of this approach is that the builder defines a list that determine which method of the builder is used for each line of the file:
private final List<BiFunction<String, Game.Builder, Game.Builder>> parsingMethods = Arrays.asList(
(str, builder) -> builder.withPositions(str),
(str, builder) -> builder.withImages(str));
This list says:
Use the method withPositions for the first line
Use the method withImages for the second line
Now, in the build method, it implements the logic that executes the methods on the lines:
// Iterate over each line and call the configured method
IntStream.range(0, lines.size()).forEach(
index -> parsingMethods.get(index)
.apply(lines.get(index), this));
We can therefore easily parse a new line of data by doing the following:
Add a new private method in the builder describing how to parse the line;
Add this method in the list called parsingMethods.

Java threads - waiting on all child threads in order to proceed

So a little background;
I am working on a project in which a servlet is going to release crawlers upon a lot of text files within a file system. I was thinking of dividing the load under multiple threads, for example:
a crawler enters a directory, finds 3 files and 6 directories. it will start processing the files and start a thread with a new crawler for the other directories. So from my creator class I would create a single crawler upon a base directory. The crawler would assess the workload and if deemed needed it would spawn another crawler under another thread.
My crawler class looks like this
package com.fujitsu.spider;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.Serializable;
import java.util.ArrayList;
public class DocumentSpider implements Runnable, Serializable {
private static final long serialVersionUID = 8401649393078703808L;
private Spidermode currentMode = null;
private String URL = null;
private String[] terms = null;
private float score = 0;
private ArrayList<SpiderDataPair> resultList = null;
public enum Spidermode {
FILE, DIRECTORY
}
public DocumentSpider(String resourceURL, Spidermode mode, ArrayList<SpiderDataPair> resultList) {
currentMode = mode;
setURL(resourceURL);
this.setResultList(resultList);
}
#Override
public void run() {
try {
if (currentMode == Spidermode.FILE) {
doCrawlFile();
} else {
doCrawlDirectory();
}
} catch (Exception e) {
e.printStackTrace();
}
System.out.println("SPIDER # " + URL + " HAS FINISHED.");
}
public Spidermode getCurrentMode() {
return currentMode;
}
public void setCurrentMode(Spidermode currentMode) {
this.currentMode = currentMode;
}
public String getURL() {
return URL;
}
public void setURL(String uRL) {
URL = uRL;
}
public void doCrawlFile() throws Exception {
File target = new File(URL);
if (target.isDirectory()) {
throw new Exception(
"This URL points to a directory while the spider is in FILE mode. Please change this spider to FILE mode.");
}
procesFile(target);
}
public void doCrawlDirectory() throws Exception {
File baseDir = new File(URL);
if (!baseDir.isDirectory()) {
throw new Exception(
"This URL points to a FILE while the spider is in DIRECTORY mode. Please change this spider to DIRECTORY mode.");
}
File[] directoryContent = baseDir.listFiles();
for (File f : directoryContent) {
if (f.isDirectory()) {
DocumentSpider spider = new DocumentSpider(f.getPath(), Spidermode.DIRECTORY, this.resultList);
spider.terms = this.terms;
(new Thread(spider)).start();
} else {
DocumentSpider spider = new DocumentSpider(f.getPath(), Spidermode.FILE, this.resultList);
spider.terms = this.terms;
(new Thread(spider)).start();
}
}
}
public void procesDirectory(String target) throws IOException {
File base = new File(target);
File[] directoryContent = base.listFiles();
for (File f : directoryContent) {
if (f.isDirectory()) {
procesDirectory(f.getPath());
} else {
procesFile(f);
}
}
}
public void procesFile(File target) throws IOException {
BufferedReader br = new BufferedReader(new FileReader(target));
String line;
while ((line = br.readLine()) != null) {
String[] words = line.split(" ");
for (String currentWord : words) {
for (String a : terms) {
if (a.toLowerCase().equalsIgnoreCase(currentWord)) {
score += 1f;
}
;
if (currentWord.toLowerCase().contains(a)) {
score += 1f;
}
;
}
}
}
br.close();
resultList.add(new SpiderDataPair(this, URL));
}
public String[] getTerms() {
return terms;
}
public void setTerms(String[] terms) {
this.terms = terms;
}
public float getScore() {
return score;
}
public void setScore(float score) {
this.score = score;
}
public ArrayList<SpiderDataPair> getResultList() {
return resultList;
}
public void setResultList(ArrayList<SpiderDataPair> resultList) {
this.resultList = resultList;
}
}
The problem I am facing is that in my root crawler I have this list of results from every crawler that I want to process further. The operation to process the data from this list is called from the servlet (or main method for this example). However the operations is always called before all of the crawlers have completed their processing. thus launching the operation to process the results too soon, which leads to incomplete data.
I tried solving this using the join methods but unfortunately I cant seems to figure this one out.
package com.fujitsu.spider;
import java.util.ArrayList;
import com.fujitsu.spider.DocumentSpider.Spidermode;
public class Main {
public static void main(String[] args) throws InterruptedException {
ArrayList<SpiderDataPair> results = new ArrayList<SpiderDataPair>();
String [] terms = {"SERVER","CHANGE","MO"};
DocumentSpider spider1 = new DocumentSpider("C:\\Users\\Mark\\workspace\\Spider\\Files", Spidermode.DIRECTORY, results);
spider1.setTerms(terms);
DocumentSpider spider2 = new DocumentSpider("C:\\Users\\Mark\\workspace\\Spider\\File2", Spidermode.DIRECTORY, results);
spider2.setTerms(terms);
Thread t1 = new Thread(spider1);
Thread t2 = new Thread(spider2);
t1.start();
t1.join();
t2.start();
t2.join();
for(SpiderDataPair d : spider1.getResultList()){
System.out.println("PATH -> " + d.getFile() + " SCORE -> " + d.getSpider().getScore());
}
for(SpiderDataPair d : spider2.getResultList()){
System.out.println("PATH -> " + d.getFile() + " SCORE -> " + d.getSpider().getScore());
}
}
}
TL:DR
I really wish to understand this subject so any help would be immensely appreciated!.

You need a couple of changes in your code:
In the spider:
List<Thread> threads = new LinkedList<Thread>();
for (File f : directoryContent) {
if (f.isDirectory()) {
DocumentSpider spider = new DocumentSpider(f.getPath(), Spidermode.DIRECTORY, this.resultList);
spider.terms = this.terms;
Thread thread = new Thread(spider);
threads.add(thread)
thread.start();
} else {
DocumentSpider spider = new DocumentSpider(f.getPath(), Spidermode.FILE, this.resultList);
spider.terms = this.terms;
Thread thread = new Thread(spider);
threads.add(thread)
thread.start();
}
}
for (Thread thread: threads) thread.join()
The idea is to create a new thread for each spider and start it. Once they are all running, you wait until each on is done before the Spider itself finishes. This way each spider thread keeps running until all of its work is done (thus the top thread runs until all children and their children are finished).
You also need to change your runner so that it runs the two spiders in parallel instead of one after another like this:
Thread t1 = new Thread(spider1);
Thread t2 = new Thread(spider2);
t1.start();
t2.start();
t1.join();
t2.join();

You should use a higher-level library than bare Thread for this task. I would suggest looking into ExecutorService in particular and all of java.util.concurrent generally. There are abstractions there that can manage all of the threading issues while providing well-formed tasks a properly protected environment in which to run.
For your specific problem, I would recommend some sort of blocking queue of tasks and a standard producer-consumer architecture. Each task knows how to determine if its path is a file or directory. If it is a file, process the file; if it is a directory, crawl the directory's immediate contents and enqueue new tasks for each sub-path. You could also use some properly-synchronized shared state to cap the number of files processed, depth, etc. Also, the service provides the ability to await termination of its tasks, making the "join" simpler.
With this architecture, you decouple the notion of threads and thread management (handled by the ExecutorService) with your business logic of tasks (typically a Runnable or Callable). The service itself has the ability to tune how to instantiate, such as a fixed maximum number of threads or a scalable number depending on how many concurrent tasks exist (See factory methods on java.util.concurrent.Executors). Threads, which are more expensive than the Runnables they execute, are re-used to conserve resources.
If your objective is primarily something functional that works in production quality, then the library is the way to go. However, if your objective is to understand the lower-level details of thread management, then you may want to investigate the use of latches and perhaps thread groups to manage them at a lower level, exposing the details of the implementation so you can work with the details.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.