I need to write a program in Java which will read a relatively large number (~50,000) files in a directory tree, process the data, and output the processed data in a separate (flat) directory.
Currently I have something like this:
private void crawlDirectoyAndProcessFiles(File directory) {
for (File file : directory.listFiles()) {
if (file.isDirectory()) {
crawlDirectoyAndProcessFiles(file);
} else {
Data d = readFile(file);
ProcessedData p = d.process();
writeFile(p,file.getAbsolutePath(),outputDir);
}
}
}
Suffice to say that each of those methods is removed and trimmed down for ease of reading, but they all work fine. The whole process works fine, except that it is slow. The processing of data occurs via a remote service and takes between 5-15 seconds. Multiply that by 50,000...
I've never done anything multi-threaded before, but I figure I can get some pretty good speed increases if I do. Can anyone give some pointers how I can effectively parallelise this method?
I would use a ThreadPoolExecutor to manage the threads. You can do something like this:
private class Processor implements Runnable {
private final File file;
public Processor(File file) {
this.file = file;
}
#Override
public void run() {
Data d = readFile(file);
ProcessedData p = d.process();
writeFile(p,file.getAbsolutePath(),outputDir);
}
}
private void crawlDirectoryAndProcessFiles(File directory, Executor executor) {
for (File file : directory.listFiles()) {
if (file.isDirectory()) {
crawlDirectoryAndProcessFiles(file,executor);
} else {
executor.execute(new Processor(file);
}
}
}
You would obtain an Executor using:
ExecutorService executor = Executors.newFixedThreadPool(poolSize);
where poolSize is the maximum number of threads you want going at once. (It's important to have a reasonable number here; 50,000 threads isn't exactly a good idea. A reasonable number might be 8.) Note that after you've queued all the files, your main thread can wait until things are done by calling executor.awaitTermination.
Assuming you have a single hard disk (i.e. something that only allows single simultaneous read operations, not a SSD or RAID array, network file system, etc...), then you only want one thread performing IO (reading from/writing to the disk). Also, you only want as many threads doing CPU bound operations as you have cores, otherwise time will be wasted in context switching.
Given the above restrictions, the code below should work for you. The single threaded executor ensures that only one Runnable executes at any one time. The fixed thread pool ensures no more than NUM_CPUS Runnables are executing at any one time.
One thing this does not do is to provide feedback on when processing is finished.
private final static int NUM_CPUS = 4;
private final Executor _fileReaderWriter = Executors.newSingleThreadExecutor();
private final Executor _fileProcessor = Executors.newFixedThreadPool(NUM_CPUS);
private final class Data {}
private final class ProcessedData {}
private final class FileReader implements Runnable
{
private final File _file;
FileReader(final File file) { _file = file; }
#Override public void run()
{
final Data data = readFile(_file);
_fileProcessor.execute(new FileProcessor(_file, data));
}
private Data readFile(File file) { /* ... */ return null; }
}
private final class FileProcessor implements Runnable
{
private final File _file;
private final Data _data;
FileProcessor(final File file, final Data data) { _file = file; _data = data; }
#Override public void run()
{
final ProcessedData processedData = processData(_data);
_fileReaderWriter.execute(new FileWriter(_file, processedData));
}
private ProcessedData processData(final Data data) { /* ... */ return null; }
}
private final class FileWriter implements Runnable
{
private final File _file;
private final ProcessedData _data;
FileWriter(final File file, final ProcessedData data) { _file = file; _data = data; }
#Override public void run()
{
writeFile(_file, _data);
}
private Data writeFile(final File file, final ProcessedData data) { /* ... */ return null; }
}
public void process(final File file)
{
if (file.isDirectory())
{
for (final File subFile : file.listFiles())
process(subFile);
}
else
{
_fileReaderWriter.execute(new FileReader(file));
}
}
The easiest (and probably one of the most reasonable) way is to have a thread pool (take a look in corresponding Executor). Main thread is responsible to crawl in the directory. When a file is encountered, then create a "Job" (which is a Runnable/Callable) and let the Executor handle the job.
(This should be sufficient for you to start, I prefer not giving too much concrete code coz it should not be difficult for you to figure out once you have read the Executor, Callable etc part)
Related
I have an array of int with size 4, only one thread can access an array cell at a time.
I thought about using Semaphore but I don't know how or if there is a way to get the acquired index
I build a code example to explain butter:
public class Temp {
private ExecutorService executeService;
private Semaphore semaphore;
private int[] syncArray; // only one thread can access an array cell at the same time
public Temp() {
syncArray = new int[]{1,2,3,4};
executeService = Executors.newFixedThreadPool(10);
semaphore = new Semaphore(syncArray.length, true);
for(int i = 0;i < 100; i++) {
executeService.submit(new Runnable() {
#Override
public void run() {
semaphore.acquire();
// here I want to access one of the array cell
// dose not matter witch one as long as no other thread is currently use it
int syncArrayIndex = semaphore.getAcquiredIndex(); // is something like this possible?
syncArray[syncArrayIndex] += ...;
semaphore.release();
}
});
}
}
}
Edit:
this is a piece of code that looks closer the my real problem:
public class Temp {
private ExecutorService executeService;
private Semaphore semaphore;
private static ChromeDriver driver;
public Temp() {
executeService = Executors.newFixedThreadPool(10);
}
public Future<WikiPage> getWikiPage(String url) {
executeService.submit(new PageRequest(url) {
});
}
private static class PageRequest implements Callable<WikiPage> {
String url;
public PageRequest(String url) {
this.url = url;
}
#Override
public WikiPage call() throws Exception {
String html = "";
synchronized (driver) {
html = ...// get the wiki page, this part takes a log time
};
WikiPage ret = ...// parse the data to the WikiPage class
// this part takes less time but depend on the sync block above
return ret;
}
}
}
#Kayaman I'm not sure I understand your comment, the problem is that I return a future. Do you have a any suggestions on how to improve my code to run faster?
No, semaphore isn't useful here. It only knows about how many permits it has, there are no "indices" in a semaphore.
You can use AtomicIntegerArray instead, although if you explain your root problem, there may be a more suitable class to use.
I have a bunch of objects representing some data. These objects can be written to their corresponding files. User may request some changes to be made quicker than previous changes written to the file.
Say, I make changes to File A, File B and File C and submit them for execution. Then, while they are being written, I make changes to File A and post it. For instance, there are 3 threads operating. Once first changes to A, B and C executed (written to files), 1st and 2nd changes to A will be executed almost simultaneously. However, I want the 2nd change to be applied after the 1st one is done.
How can I do that in rxJava?
Another point. In a different place I want to run action with the latest changes. One option is to wait until all tasks finished.
Is there appropriate RxJava primitive/approach that would hopefully cover these 2 use cases?
I am new to RxJava, but I hope this makes sense. Subjects come to my mind as relevant, but there gonna be hundreds of files.
I already have the implementation using custom Executor.
public class OrderingExecutor
implements Executor
{
#Delegate
private final Executor delegate;
private final Map<Object, Queue<Runnable>> keyedTasks = new HashMap<>();
public OrderingExecutor(
Executor delegate)
{
this.delegate = delegate;
}
public void execute(
Runnable task,
Object key)
{
Objects.requireNonNull(key);
boolean first;
Runnable wrappedTask;
synchronized (keyedTasks)
{
Queue<Runnable> dependencyQueue = keyedTasks.get(key);
first = (dependencyQueue == null);
if (dependencyQueue == null)
{
dependencyQueue = new LinkedList<>();
keyedTasks.put(key, dependencyQueue);
}
wrappedTask = wrap(task, dependencyQueue, key);
if (!first)
{
dependencyQueue.add(wrappedTask);
}
}
// execute method can block, call it outside synchronize block
if (first)
{
delegate.execute(wrappedTask);
}
}
private Runnable wrap(
Runnable task,
Queue<Runnable> dependencyQueue,
Object key)
{
return new OrderedTask(task, dependencyQueue, key);
}
class OrderedTask
implements Runnable
{
private final Queue<Runnable> dependencyQueue;
private final Runnable task;
private final Object key;
public OrderedTask(
Runnable task,
Queue<Runnable> dependencyQueue,
Object key)
{
this.task = task;
this.dependencyQueue = dependencyQueue;
this.key = key;
}
#Override
public void run()
{
try
{
task.run();
}
finally
{
Runnable nextTask = null;
synchronized (keyedTasks)
{
if (dependencyQueue.isEmpty())
{
keyedTasks.remove(key);
}
else
{
nextTask = dependencyQueue.poll();
}
}
if (nextTask != null)
{
delegate.execute(nextTask);
}
}
}
}
}
Maybe some sensible way to plug it into rxJava?
It's not fully clear what you try to achieve here, but you can layer a priority queue on
top of RxJava.
class OrderedTask implements Comparable<OrderedTask> { ... }
PriorityBlockingQueue<OrderedTask> queue = new PriorityBlockingQueue<>();
PublishSubject<Integer> trigger = PublishSubject.create();
trigger.flatMap(v -> {
OrderedTask t = queue.poll();
return someAPI.workWith(t);
}, 1)
.subscribe(result -> { }, error -> { });
queue.offer(new SomeOrderedTask(1));
trigger.onNext(1);
queue.offer(new SomeOrderedTask(2));
trigger.onNext(2);
I am using spring boot
public interface StringConsume extends Consumer<String> {
default public void strHandel(String str) {
accept(str);
}
}
Impl
#Component("StrImpl")
public class StringConsumeImpl implements StringConsume {
BlockingQueue<String> queue = new ArrayBlockingQueue<>(500);
final ExecutorService exService = Executors.newSingleThreadExecutor();
Future<?> future = CompletableFuture.completedFuture(true);
#Override
public void accept(String t) {
try {
queue.put(t);
} catch (InterruptedException e) {
e.printStackTrace();
}
while (null != queue.peek()) {
if (future.isDone()) {
future = exService.submit(() -> queue.take());
}
}
}
}
Class
#Component
public class Test {
#Resource(name="StrImpl")
private #Autowired StringConsume handler;
public void insertIntoQueue(String str) {
handler.accept(str);
}
}
In StringConsumeImpl , do I need synchronized while loop? and suppose five time StringConsumeImpl class called, then do while loop will create 5 process or only 1 process ? and what is the best replacement of while loop in StringConsumeImpl , if any ?
There are some problems with that code.
First of all, the consumer doesn't really "consume" anything, it just adds the string to the queue then takes it back out. Let's say for the sake of the argument that it also "consumes" it by printing it to console or something.
Secondly, the consumer will only get called once due to the loop unless it is running in a thread of its own. Eg if you do
public static void main(String[]args) {
StringConsume consumer = new StringConsumeImpl();
consumer.accept("hello");
}
The consumer will put "hello" into the queue, take it out immediately and then stay in the loop, waiting for more elements to take out; however, no one is there to actually add any.
The usual concept of doing what it looks like you're trying to do is "producer/consumer". This means that there is a "producer" that puts items into a queue and a "consumer" taking them out and doing stuff with them.
So in your case what your class does is "consume" the string by putting it into the queue, making it a "producer", then "consuming" the string by taking it back out of the queue. Of course, there's also the "actual" producer of the string, ie the class calling this.
So in general you'd do something like this:
/** Produces random Strings */
class RandomStringProducer {
Random random = new Random();
public String produceString() {
return Double.toString(random.nextDouble());
}
}
/** Prints a String */
class PrintConsumer implements StringConsume {
public void accept(String s) { System.out.println(s); }
}
/** Consumes String by putting it into a queue */
class QueueProducer implements StringConsume {
BlockingQueue<String> queue;
public QueueProducer(BlockingQueue<String> q) { queue = q; }
public void accept(String s) {
queue.put(s);
}
}
public static void main(String[] args) {
// the producer
RandomStringProducer producer = new RandomStringProducer();
// the end consumer
StringConsume printConsumer = new PrintConsumer();
// the queue that links producer and consumer
BlockingQueue<String> queue = new ArrayBlockingQueue<>();
// the consumer putting strings into the queue
QueueProducer queuePutter = new QueueProducer(queue);
// now, let's tie them together
// one thread to produce strings and put them into the queue
ScheduledExecutorService producerService = Executors.newScheduledThreadPool(1);
Runnable createStringAndPutIntoQueue = () -> {
String created = producer.createString();
queuePutter.consume(created);
};
// put string into queue every 100ms
producerService.scheduleAtFixedRate(createStringAndPutIntoQueue, 100, TimeUnit.MILLISECONDS);
// one thread to consume strings
Runnable takeStringFromQueueAndPrint = () -> {
while(true) {
String takenFromQueue = queue.take(); // this will block until a string is available
printConsumer.consume(takenFromQueue);
}
};
// let it run in a different thread
ExecutorService consumerService = Executors.newSingleThreadExecutor();
consumerService.submit(takeStringFromQueueAndPrint);
// this will be printed; we are in the main thread and code is still being executed
System.out.println("the produce/consume has started");
}
So when you run this, there will be three threads: the main thread, the producer thread and the consumer thread. The producer and consumer will be doing their thing concurrently, and the main thread will also continue to run (as exemplified by the System.out.println in the last line).
So, I am new to threads, and I'm still learning how everything works. So, I couldn't find an answer that would provide an explanation for my problem (to my level of understanding).
I have a Runnable class that looks like so:
public class Request implements Runnable {
private Boolean ok = true;
public synchronized void setOk(Boolean ok) {
this.ok = ok;
}
public synchronized Boolean getOk() {
return ok;
}
private synchronized void foo() {
//if something happens
setOk(false);
}
#Override
public void run() {
while (true)
foo();
}
}
And then I have another class that does the following:
private static Request request;
private static void spawnThreads() {
ExecutorService e = new Executors.newFixedThreadPool(4);
request = new Request();
e.execute(request);
}
public static void main(String[] args) {
spawnThreads();
while (true) {
System.out.println(request.getOk());
if (!request.getOk())
request.setOk(true);
TimeUnit.SECONDS.sleep(10);
}
}
I need that if in the main thread, that getOk() returns false, do something and set it to true. Viceversa, set it to false in the thread (which I need to keep on going, no matter what the value of ok is at any given time).
As this code is, I can't get the value of request.getOk() in the main thread. If I remove the synchronized words from the getter and setter, I can access the value in the main thread until a point in time when it is changed by the thread, and never again.
Also, the executor is used because I would create multiple Request objects, and waiting for it to shutdown before accessing the variable would contradict my reason for doing this, as I would need all the threads to keep running.
That thread is making http requests to a server (that randomly times out, denies response, etc) and is used to retrieve some information. The ok variable is there to take a note when the thread acquires an ok response and some information from the server.
How do I solve it so that the thread can update that variable, but the main thread to be able to retrieve it whenever needed, no matter if it was changed by the thread in the meanwhile or not.
Would changing my Runnable to a Callable help? If yes, how?
Your example still leaves some holes in the thread-safety. Like mentioned by #Radiodef using AtomicBoolean can relieve you of most of the synchronisation if used properly.
Using your example, this is a thread safe Request class that accepts a message, like an answer to a http request.
public final class Request implements Runnable {
private final AtomicBoolean ok = new AtomicBoolean(false);
// volatile variables promote reference changes through all threads
private volatile String msg;
private boolean setMessage(String responseMessage) {
if (this.ok.compareAndSet(false, true)) {
this.msg = msg;
return true;
}
return false;
}
public boolean hasMessage() {
// *pure* getters don't need synchronisation!
return this.ok.get();
}
public String getMessageAndReset() {
// make a copy before resetting the OK
String msgCopy = this.msg;
this.ok.compareAndSet(true, false);
return msgCopy;
}
public void run() {
final Random rand = new Random();
try {
while(true) {
// sleep at random max 5 seconds
// (simulate unpredictable network)
TimeUnit.SECONDS.sleep(rand.nextInt(5));
while(!setMessage("Incoming message")) {
// busy waiting ... waits until the current value has
// been retrieved by the main thread
Thread.sleep(100);
}
}
} catch (Exception e) {
System.out.println(e);
}
}
}
And your main class:
public final class MainClazz implements Runnable {
private final ExecutorService exec;
private final Request request;
public void MainClazz() {
this.exec = new Executors.newFixedThreadPool(4);
this.request = new Request();
this.exec.execute(request);
}
public void run() {
while (true) {
if (request.hasMessage()) {
System.out.println(request.getMessageAndReset());
}
TimeUnit.SECONDS.sleep(10);
}
public static void main(String[] args) {
MainClazz main = new MainClazz();
main.run();
}
}
In this implementation, the Request class only holds a single value at a time. Depending the amount of data you expect you might want to think about using a buffer.
Also, like many others have mentioned, don't use while (true)! Get a synchronisation object from the java concurrent package!
More light reading on the AtomicBoolean object.
In my program I am repeatedly reading a number of files like this:
String myLetter = "CoverSheet.rtf"; // actually has a full path
FileInputStream in = new FileInputStream(myLetter);
letterSection.importRtfDocument(in);
in.close();
Because there are many small files which are components to add to the document with importRtfDocument, and thousands of letters to generate in a run, the processing is quite slow.
The importRtfDocument method comes from a library I'm using, and needs to be given a FileInputStream. This is where I'm stumped. I tried a few things like declaring a FileInputStream for each file in the class and keeping them open - but reset() isn't supported.
I have looked at other similar questions like this one:
How to Cache InputStream for Multiple Use
However, none seem to address my problem, to wit, how can I cache a FileInputStream?
I normally create my own pool to cache files. Just consider following simple code :
class CachedPool {
private Map<URI, CachedFile> pool = new HashMap<>();
public CachedPool(){
}
public <T> T getResource(URI uri) {
CachedFile file;
if(pool.containsKey(uri)){
file = pool.get(uri);
} else {
file = new CachedFile(uri); // Injecting point to add resources
pool.put(uri, file);
}
return file.getContent();
}
}
class CachedFile {
private URI uri;
private int counter;
private Date cachedTime;
private Object content;
public CachedFile(URL uri){
this.url = uri;
this.content = uri.toURL().getContent();
this.cachedTime = new Date();
this.counter = 0;
}
public <T> T getContent(){
counter++;
return (T) content;
}
/** Override equals() and hashCode() **/
/** Write getters for all instance variables **/
}
You can use counter of CachedFile to remove the files that are rarely being used after a certain time period or when heap memory is very low.