I have a java method which writes content into a text file with values de-limited by | symbol. Contents needs to be picked up from 10 tables depending on conditions and write into file. Currently am doing the following method. Could anyone please suggest a better alternate approach for doing this requirement. Does this method have performance bottleneck
public static void createFile()
{
queryFromTable1
whileLoopForqueryFromTable1
{
writer.write(value1+"|"+value2+"|".....)
}
queryFromTable2
whileLoopForqueryFromTable2
{
writer.write("||||"+value4+"|".....)
}
queryFromTable2
whileLoopForqueryFromTable2
{
writer.write("||"+value5+"|".....)
}
}
There is no performance bottleneck in this pseudo code.
If you are using a BufferedWriter I think it's ok to call write many times. You just have to not forget to close the writer at the end of treatments.
Now we don't know what is behind your DB query. Maybe they can be optimized. Do you use Prepared Statements?
You can create separate method for extracting data from tables
private List<String> getDataFromTable("select * from table1"){...}
The obvious bottleneck is string concatenation value1+"|"+value2+"|". You'd better use single write for each element
for(int i=0;i<tableData.size();i++){
String str = tableData.get(i);
if(checkPassed(str)){
writer.write(str);
// don't print last |
if(i<tableData.size()-1)writer.write(DELIMITER); // private static final String DELIMITER = "|";
}
}
More information will allow us to give better answer.
Try breaking down to several methods, below is pseudo code:
void createFile() {
writeTo(out, query1);
writeTo(out, query2);
writeTo(out, query3);
....
}
void writeTo(out, query) {
execute query
loop(){
out.write(...)
}
}
A possible bottleneck is if any of the queries are slow, in that case all of the remaining queries have to wait for the earlier to complete. Another possible bottleneck is that the algorithm is single threaded.
To solve that a solution would be to do the reads in parallel and process the writing in multiple writers. When all of that has been completed, simply merge the writers in the correct order (single threaded). The Java 8 class CompletableFuture provides some nice features that can be used here. Simply create some futures that completes and merge the output.
Check out the CompletableFuture JavaDocs for more info.
An example of the algorithm could be something like code below. Please note that this is simply an example and not a full-fledged solution. The use of the StringWriter is just for convencience and is just one way of handling the data.
public class AlgorithmTest {
public static void main(String[] args)
throws IOException, ExecutionException, InterruptedException {
// Setup async processing of task 1
final CompletableFuture<String> q1 = CompletableFuture.supplyAsync(() -> {
// Setup result data
StringWriter result = new StringWriter();
// execute query 1
// Process result
result.write("result from 1");
// Return the result (can of course be handled in other ways).
return result.toString();
});
// Setup async processing of task 2
final CompletableFuture<String> q2 = CompletableFuture.supplyAsync(() -> {
// Setup result data
StringWriter result = new StringWriter();
// execute query 2
// Process result
result.write("result from 2");
// Return the result (can of course be handled in other ways).
return result.toString();
});
// Write the whole thing to file (i.e. merge the streams)
final Path path = new File("result.txt").toPath();
Files.write(path, Arrays.asList(q1.get(), q2.get()));
}
}
Related
I have a large set of words and I need to execute a task on each individual word. I want to make it multithreaded in order to increase the speed. Currently, I am just using a foreach loop to iterate through each item in the list. What I want to do is have 8 threads that check the word I give them and then write a result to a file.
Currently, this is the code I am using:
public static void main(String[] args) {
System.setProperty("http.agent", "Chrome");
readWords();
Collections.shuffle(words);
words.forEach(word -> {
if (CheckValidity.checkValidity(word)) {
System.out.println(word);
try(PrintWriter writer = new PrintWriter(new FileWriter("output.txt",true)))
{
writer.printf("%s\r\n", word);
} catch (IOException e) {
e.printStackTrace();
}
}
});
System.out.println("Done!");
}
How would I implement this in multithreading? I couldn't find any information that made sense to me where I could input a value into a method any free thread. Sorry if this isn't quite how multithreading works, i've never written anything with more than one thread before so I don't know what's possible and whats not.
The quickest way to parallelize your calls to CheckValidity would be to use a parallel Stream. Something like
public static void main(String[] args) {
List<String> words = readWords();
Collections.shuffle(words);
words.stream()
.unordered()
.parallel()
.filter(CheckValidity::checkValidity)
.forEach(word -> {
System.out.println(word);
try(PrintWriter writer = new PrintWriter(new FileWriter("output.txt",true)))
{
writer.printf("%s\r\n", word);
} catch (IOException e) {
e.printStackTrace();
}
});
System.out.println("Done!");
}
However, this should not be your production solution if your application also does other things in parallel, as this internally uses the common ForkJoinPool, and blocking that with non-CPU bound operations may slow down other parts of your application (for example other parallel streams).
For a more robust solution, you should have a look at ThreadPoolExecutor, which allows to create separate thread pools with defined sizes, timeouts etc.
Is there a way to safely and immediately stop the execution of a Thread in Java? Especially, if the logic inside the run() method of the Runnable implementation executes only a single iteration and does not regularly check for any flag that tells it to stop?
I am building a Web Application, using which a user can translate the contents of an entire document from one language to another.
Assuming the documents are extra-large, and subsequently assuming each translation is going to take a long time (say 20-25 minutes), my application creates a separate Thread for each translation that is initiated by its users. A user can see a list of active translations and decide to stop a particular translation job if he/she wishes so.
This is my Translator.java
public class Translator {
public void translate(File file, String sourceLanguage, String targetLanguage) {
//Translation happens here
//.......
//Translation ends and a new File is created.
}
}
I have created a TranslatorRunnable class which implements the Runnable interface as follows:
public class TranslatorRunnable implements Runnable {
private File document;
private String sourceLanguage;
private String targetLanguage;
public TranslatorRunnable(File document, String sourceLanguage, String targetLanguage) {
this.document = document;
this.sourceLanguage = sourceLanguage;
this.targetLanguage = targetLanguage;
}
public void run() {
// TODO Auto-generated method stub
Translator translator = new Translator();
translator.translate(this.document, this.sourceLanguage, this.targetLanguage);
System.out.println("Translator thread is finished.");
}
}
I'm creating the thread for translating a document from an outer class like this:
TranslatorRunnable tRunnable = new TranslatorRunnable(document, "ENGLISH", "FRENCH");
Thread t = new Thread(tRunnable);
t.start();
Now my problem is how do I stop a translation process (essentially a Thread) when the user clicks on "Stop" in the GUI?
I have read a few posts on StackOverflow as well as on other sites, which tell me to have a volatile boolean flag inside the Runnable implementation, which I should check on regularly from inside the run() method and decide when to stop. See this post
This doesn't work for me as the run() method is just calling the Translator.translate() method, which itself is going to take a long time. I have no option here.
The next thing I read is to use ExecutorService and use its shutDownAll() method. But even here, I'd have to handle InterruptedException somewhere regularly within my code. This, is again out of the option. Referred this documentation of the ExecutorService class.
I know I cannot use Thread.stop() as it is deprecated and may cause issues with objects that are commonly used by all threads.
What options do I have?
Is my requirement really feasible without substantial changes to my design? If yes, please tell me how.
If it is absolutely necessary for me to change the design, could anyone tell me what is the best approach I can take?
Thanks,
Sriram
Is there a way to safely and immediately stop the execution of a Thread in Java?
No. each thread is reponsible to periodically check if it has been interrupted to exit as soon as possible
if (Thread.currentThread().isInterrupted() ) {
// release resources. finish quickly what it was doing
}
if you want a more responsive application, you have to change the logic (for example divide each job in smaller batches) so each thread does this checking more often than every 20-25 minutes
If you are the one that created the Translator class what's stopping you from adding some kind of value inside the function that is checked periodically and if needed stops reading the lines from file something like this
public static List<String> readFile(String filename)
{
List<String> records = new ArrayList<>();
try
{
BufferedReader reader = new BufferedReader(new FileReader(filename));
String line;
while ((line = reader.readLine()) != null)
{
String[] split = line.split("\\s+");
records.addAll(Arrays.asList(split));
if (needsToStop) {
break; //Or throw exception
}
}
reader.close();
return records;
}
catch (Exception e)
{
System.err.format("Exception occurred trying to read '%s'.", filename);
e.printStackTrace();
return null;
}
}
I am writing a lots of files like bellow.
public void call(Iterator<Tuple2<Text, BytesWritable>> arg0)
throws Exception {
// TODO Auto-generated method stub
while (arg0.hasNext()) {
Tuple2<Text, BytesWritable> tuple2 = arg0.next();
System.out.println(tuple2._1().toString());
PrintWriter writer = new PrintWriter("/home/suv/junk/sparkOutPut/"+tuple2._1().toString(), "UTF-8");
writer.println(new String(tuple2._2().getBytes()));
writer.close();
}
}
Is there any better way to write the files..without closing or creating printwriter every time.
There is no significantly better way to write lots of files. What you are doing is inherently I/O intensive.
UPDATE - #Michael Anderson is right, I think. Using multiple threads to write the files (probably) will speed things up considerably. However, the I/O is still going to be the ultimate bottleneck from a couple of respects:
Creating, opening and closing files involves file & directory metadata access and update. This entails non-trivial CPU.
The file data and metadata changes need to be written to disc. That is possibly multiple disc writes.
There are at least 3 syscalls for each file written.
Then there are thread stitching overheads.
Unless the quantity of data written to each file is significant (multiple kilobytes per file), I doubt that the techniques like using NIO, direct buffers, JNI and so on will be worthwhile. The real bottlenecks will be in the kernel: file system operations and low-level disk I/O.
... without closing or creating printwriter every time.
No. You need to create a new PrintWriter ( or Writer or OutputStream ) for each file.
However, this ...
writer.println(new String(tuple2._2().getBytes()));
... looks rather peculiar. You appear to be:
calling getBytes() on a String (?),
converting the byte array to a String
calling the println() method on the String which will copy it, and the convert it back into bytes before finally outputting them.
What gives? What is the point of the String -> bytes -> String conversion?
I'd just do this:
writer.println(tuple2._2());
This should be faster, though I wouldn't expect the percentage speed-up to be that large.
I'm assuming you're after the fastest way. Because everyone knows fastest is best ;)
One simple way is to use a bunch of threads to do your writing for you.
However you're not going to get much benefit by doing this unless your filesystem scales well. (I use this technique on Luster based cluster systems, and in cases where "lots of files" could mean 10k - in this case many of the writes will be going to different servers / disks)
The code would look something like this: (Note I think this version is not right as for small numbers of files this fills the work queue - but see the next version for the better version anyway...)
public void call(Iterator<Tuple2<Text, BytesWritable>> arg0) throws Exception {
int nThreads=5;
ExecutorService threadPool = Executors.newFixedThreadPool(nThreads);
ExecutorCompletionService<Void> ecs = new ExecutorCompletionService<>(threadPool);
int nJobs = 0;
while (arg0.hasNext()) {
++nJobs;
final Tuple2<Text, BytesWritable> tuple2 = arg0.next();
ecs.submit(new Callable<Void>() {
#Override Void call() {
System.out.println(tuple2._1().toString());
String path = "/home/suv/junk/sparkOutPut/"+tuple2._1().toString();
try(PrintWriter writer = new PrintWriter(path, "UTF-8") ) {
writer.println(new String(tuple2._2().getBytes()))
}
return null;
}
});
}
for(int i=0; i<nJobs; ++i) {
ecs.take().get();
}
}
Better yet is to start writing your files as soon as you have data for the first one, not when you've got data for all of them - and for this writing to not block the calculation thread(s).
To do this you split your application into several pieces communicating over a (thread safe) queue.
Code then ends up looking more like this:
public void main() {
SomeMultithreadedQueue<Data> queue = ...;
int nGeneratorThreads=1;
int nWriterThreads=5;
int nThreads = nGeneratorThreads + nWriterThreads;
ExecutorService threadPool = Executors.newFixedThreadPool(nThreads);
ExecutorCompletionService<Void> ecs = new ExecutorCompletionService<>(threadPool);
AtomicInteger completedGenerators = new AtomicInteger(0);
// Start some generator threads.
for(int i=0; ++i; i<nGeneratorThreads) {
ecs.submit( () -> {
while(...) {
Data d = ... ;
queue.push(d);
}
if(completedGenerators.incrementAndGet()==nGeneratorThreads) {
queue.push(null);
}
return null;
});
}
// Start some writer threads
for(int i=0; i<nWriterThreads; ++i) {
ecs.submit( () -> {
Data d
while((d = queue.take())!=null) {
String path = data.path();
try(PrintWriter writer = new PrintWriter(path, "UTF-8") ) {
writer.println(new String(data.getBytes()));
}
return null;
}
});
}
for(int i=0; i<nThreads; ++i) {
ecs.take().get();
}
}
Note I've not provided an implementation of the queue class you can easily wrap the standard java threadsafe ones to get what you need.
There's still lots more that can be done to reduce latency, etc - heres some of the further things I've used to get the times down ...
don't even wait for all the data to be generated for a given file. Pass another queue containing packets of bytes to write.
Watch out for allocations - you can reuse some of your buffers.
There's some latency in the nio stuff - you can get some performance improvements by using C writes and JNI and direct buffers.
Thread switching can hurt, and the latency in the queues can hurt, so you might want to batch up your data slightly. Balancing this with 1 can be tricky.
I'm working on my first multi-threaded application, for the sake of learning. I really need to learn it. I already have a single-threaded function that reads in all text files in a directory, and replaces all indentation tabs to three spaces.
It has the ability to pass in an Appendable for the sake of optional extra information (listing each file, giving statistics, etcetera). If they pass in null, they want no debugging.
I'm trying to determine what's the best way of handling this in a multi-threaded version, but searching for "debugging multi-threaded java" is giving me nothing but how to diagnose bugs and deadlocks.
Can I safely stick with an Appendable or should I be considering something else? I'm not sure how to deal with interleaving messages, but the first thing I want to figure out is thread safety.
Rather than passing in an Appendable, consider using slf4j in your library to do the logging.
If no logging framework is linked in at run-time, no logging will be done. If the application is doing logging already, then there's probably a front-end to it that slf4j will output to.
I'd recommend using Logback for your logging output, as it's nicely configurable, either through configuration files or directly in code. All you need to do to get rudimentary output is include the JAR.
Debugging threads is often a case of trying to figure out presentation. Log4j is great generally. You can configure it to tag each line with the thread name as well as the timestamp. Once you do this you can filter the output based on thread name and follow a single thread.
A good filtering tool is really important. The most basic would be tail and pipe it through grep--but if it's something you do a lot you might want to layer something on top of the log--like a GUI with tabs for each thread or something like that.
Log4j itself will have no problem dealing with threads.
If you really want to do it yourself, pass in a DIFFERENT appendable to each thread, then when the thread is done dump it or save it to a file. You probably want to use just one thread to dump/save the appendables.
The problem with using Appendable from multiple threads is that it is not specified as thread safe.
Thread safety is the responsibility of classes that extend and implement this interface.
The answer is therefore to use a thread-safe multiplexor. This one uses a BlockingQueue and a thread that pulls data out of it and forwards it to their Appendable.
class TellThemWhatIsHappening implements Appendable {
// The pipe to their system/log.
private final Appendable them;
// My internal queue.
private final BlockingQueue<String> queue = new ArrayBlockingQueue<>(10);
// Hav I been interrupted?
private volatile boolean interrupted = false;
public TellThemWhatIsHappening(Appendable them) {
// Record the target Appendable.
this.them = them;
// Grow my thread.
Thread t = new Thread(consumer);
// Make sure it doesn't hold your app open.
t.setDaemon(true);
// Start the consumer runnning.
t.start();
}
// The runnable that consumes the queue and passes it on to them.
private Runnable consumer = new Runnable() {
#Override
public void run() {
while (!interrupted) {
try {
// Pull from the queue and push to them.
them.append(queue.take());
} catch (InterruptedException ex) {
// We got interrupted.
interrupted = true;
} catch (IOException ex) {
// Not sure what you shoudl do here. Their appendable threw youy an exception.
interrupted = true;
}
}
}
};
Continued...
private void append(String s) throws IOException {
// No point if they are null.
if (them != null) {
try {
queue.put(s);
} catch (InterruptedException ex) {
// What should we do here?
interrupted = true;
}
}
}
#Override
public Appendable append(CharSequence csq) throws IOException {
append(csq.toString());
return this;
}
#Override
public Appendable append(CharSequence csq, int start, int end) throws IOException {
append(csq.subSequence(start, end).toString());
return this;
}
#Override
public Appendable append(char c) throws IOException {
append("" + c);
return this;
}
}
However - it is a very good idea to use a proper logging system for logging rather than growing your own.
First I'd like to say that I'm working my way up from python to more complicated code. I'm now on to Java and I'm extremely new. I understand that Java is really good at multithreading which is good because I'm using it to process terabytes of data.
The data input is simply input into an iterator and I have a class that encapsulates a run function that takes one line from the iterator, does some analysis, and then writes the analysis to a file. The only bit of info the threads have to share with each other is the name of the object they are writing to. Simple right? I just want each thread executing the run function simultaneously so we can iterate through the input data quickly. In python it would b e simple.
from multiprocessing import Pool
f = open('someoutput.csv','w');
def run(x):
f.write(analyze(x))
p = Pool(8);
p.map(run,iterator_of_input_data);
So in Java, I have my 10K lines of analysis code and can very easily iterate through my input passing it my run function which in turn calls on all my analysis code sending it to an output object.
public class cool {
...
public static void run(Input input,output) {
Analysis an = new Analysis(input,output);
}
public static void main(String args[]) throws Exception {
Iterator iterator = new Parser(File(input_file)).iterator();
File output = File(output_object);
while(iterator.hasNext(){
cool.run(iterator.next(),output);
}
}
}
All I want to do is get multiple threads taking the iterator objects and executing the run statement. Everything is independent. I keep looking at java multithreading stuff but its for talking over networks, sharing data etc. Is this is simple as I think it is? If someone can just point me in the right direction I would be happy to do the leg work.
thanks
A ExecutorService (ThreadPoolExecutor) would be the Java equivelant.
ExecutorService executorService =
new ThreadPoolExecutor(
maxThreads, // core thread pool size
maxThreads, // maximum thread pool size
1, // time to wait before resizing pool
TimeUnit.MINUTES,
new ArrayBlockingQueue<Runnable>(maxThreads, true),
new ThreadPoolExecutor.CallerRunsPolicy());
ConcurrentLinkedQueue<ResultObject> resultQueue;
while (iterator.hasNext()) {
executorService.execute(new MyJob(iterator.next(), resultQueue))
}
Implement your job as a Runnable.
class MyJob implements Runnable {
/* collect useful parameters in the constructor */
public MyJob(...) {
/* omitted */
}
public void run() {
/* job here, submit result to resultQueue */
}
}
The resultQueue is present to collect the result of your jobs.
See the java api documentation for detailed information.