Creating observables that do IO work

Creating observables that do IO work - java

I have several resources in my app I need to load and dump into my database on first launch. I want to do this parallely.
So i created an observable wrapper around reading a file.
#Override
public Observable<List<T>> loadDataFromFile() {
return Observable.create(new Observable.OnSubscribe<List<T>>() {
#Override
public void call(Subscriber<? super List<T>> subscriber) {
LOG.info("Starting load from file for %s ON THREAD %d" + type, Thread.currentThread().getId());
InputStream inputStream = null;
try {
Gson gson = JsonConverter.getExplicitGson();
inputStream = resourceWrapper.openRawResource(resourceId);
InputStreamReader inputStreamReader = new InputStreamReader(inputStream);
List<T> tList = gson.fromJson(inputStreamReader, type);
subscriber.onNext(tList);
subscriber.onCompleted();
LOG.info("Completed load from file for " + type);
} catch (Exception e) {
LOG.error("An error occurred loading the file");
subscriber.onError(e);
} finally {
if (inputStream != null) {
try {
inputStream.close();
} catch (IOException e) {
}
}
}
}
});
}
However its not asynchronous, There are two approaches to making this asynchronous that i see:
1) Do the asynchrony inside the observable Spawn a new thread or use a callback based file reading api.
2) Use a scheduler to do the work on an I/O thread,
Again for the DB i have to create my own observable that wraps the databases Api and there is a synchronous and asynchronous version with a callback.
So what is the correct way of creating observables that do i/o work?
Secondly How can i use these observables in a chain to read these files all in parallel, then for each store the contents in the DB. I want to receive an onCompleted event when the entire process is complete for all my reference data.

One good thing about RX is you can control on what thread your "work" is done. You can use
subscribeOn(Schedulers.io())
If you want to load resources in parallel I suggest using the merge (or mergeDelayError) operator.
Assuming you have a function
Observable<List<T>> loadDataFromresource(int resID)
to load one resource, you could first create a list of observables for each resource
for (int i=0 ; i<10; i++) {
obsList.add(loadDataFromresource(i+1).subscribeOn(Schedulers.io()));
}
associating a scheduler with each observable. Merge the observables using
Observable<List<T>> mergedObs = Observable.merge(obsList);
Subscribing to the resulting observable should then load the resources in parallel. If you'd like to delay errors until the end of the merged observable then use
Observable<List<T>> mergedObs = Observable.mergeDelayError(obsList);

I'm not a Java developer, but in C# this is basically how this kind of code should be structured:
public IObservable<string> LoadDataFromFile()
{
return
Observable.Using(
() => new FileStream("path", FileMode.Open),
fs =>
Observable.Using(
() => new StreamReader(fs),
sr => Observable.Start(() => sr.ReadLine())));
}
Hopefully you can adapt from that.

Related

Java Reactive Way to read lines of a File

So I started to play with the Advent of Code and I would like to use the project reactor for this to find the solutions in a reactive way.
I have implemented a solutions that works partially but not quite how I want it. Because it can also read lines partially if there is no more space in the buffer.
The Input to run the following function you can find here: https://adventofcode.com/2022/day/1/input
public static Flux<String> getLocalInputForStackOverflow(String filePath) throws IOException {
Path dayPath = Path.of(filePath);
FileOutputStream resultDay = new FileOutputStream(basePath.resolve("result_day.txt").toFile());
return DataBufferUtils
.readAsynchronousFileChannel(
() -> AsynchronousFileChannel.open(dayPath),
new DefaultDataBufferFactory(),
64)
.map(DataBuffer::asInputStream)
.map(db -> {
try {
resultDay.write(db.readAllBytes());
resultDay.write("\n".getBytes());
return db;
} catch (FileNotFoundException e) {
throw new RuntimeException(e);
} catch (IOException e) {
throw new RuntimeException(e);
}
})
.map(InputStreamReader::new)
.map(is ->new BufferedReader(is).lines())
.flatMap(Flux::fromStream);
}
The point of this function is to read the lines of the files in a reactive way.
I used the FileOutputStream to write what I read into another file and the compare the resulted file with the original, because I noticed that some lines are only partially read if there is no more space in the buffer. So the try-catch .map() can be ignored
My questions here would:
Is there a more optimal way to read files asynchronously in a Reactive way?
Is there a more optimal way to read a file asyncronously line by line with a limited buffer and make sure that only whole lines are read?
Workarounds that I've found are:
Increased the buffer to read the whole file in 1 run -> Not optimal solution
Use the following functions, but this raise a warning:
Possibly blocking call in non-blocking context could lead to thread starvation
public static Flux<String> getLocalInput1(int day ) throws IOException {
Path dayPath = getFilePath(day);
return Flux.using(() -> Files.lines(dayPath),
Flux::fromStream,
BaseStream::close);
}

You're almost there. Just use BufferedReader instead of Files.lines.
In Spring Webflux, the optimal way to read files asynchronously in a reactive way is to use the Reactor Core library's Flux.using method. It creates a Flux that consumes a resource, performs some operations on it, and then cleans up the resource when the Flux is complete.
Example of reading a file asynchronously and reactively:
Flux<String> flux = Flux.using(
// resource factory creates FileReader instance
() -> new FileReader("/path/to/file.txt"),
// transformer function turns the FileReader into a Flux
reader -> Flux.fromStream(new BufferedReader(reader).lines()),
// resource cleanup function closes the FileReader when the Flux is complete
reader -> reader.close()
);
Subscribe to the Flux and consume the lines of the file as they are emitted; this will print each line of the file to the console as it is read from the file.
flux.subscribe(line -> System.out.println(line));
In similar way we can solve it controlling each line explicitly:
Flux<String> flux = Flux.generate(
() -> new BufferedReader(new FileReader("/path/to/file.txt")),
// generator function reads a line from the file and emits it
(bufferedReader, sink) -> {
String line = bufferedReader.readLine();
if (line != null) {
sink.next(line);
} else {
sink.complete();
}
},
reader -> Mono.fromRunnable(() -> {
try {
reader.close();
} catch (IOException e) {
// Handle exception
}
})
);

CompleteableFuture for a large dataset

I have multi-threaded Spring Boot application in which I am reading data from table in batches (the table contains around 1 million records).
I am getting into Java heap memory issues, and I am unable to find a workaround. Below is the code sample.
I call the Spring Boot REST API which then calls this code. Here I am reading from db in the main thread in batches, then passing the batches to thread pool executorService and then finally processing the result in another thread pool resultProcessor.
The Worker class implements Callable<WorkerResult>
ExecutorService executorService = Executors.newFixedThreadPool(15);
Long workerCount = 0L;
ExecutorService resultProcessor = Executors.newFixedThreadPool(10);
List<CompletableFuture<WorkerResult>> futures = new ArrayList<>();
while (workerCount < totalData) {
List<Model> dbRecords = repo.getData(workerCount,workerCount+rp,date);
workerCount += rp + 1;
try {
futures.add(CompletableFuture.supplyAsync(() -> {
try {
return new Worker(dbRecords).call(); // Here for each record third party api is called
} catch (Exception ex) {
throw new CompletionException(ex);
}
// Or return default value
}, executorService).thenApplyAsync(result -> {
service.resultReceived(result); // update the results into db
return result;
}, resultProcessor));
} catch (RejectedExecutionException e) {
logData("Can't submit anymore tasks %s ", e.getMessage());
}
}
}
Outside the while loop once I have read all data from DB, then I call the CompletableFuture.allOf method to finish any remaining tasks.
Below is the code for that:
try {
CompletableFuture.allOf(futures.toArray(new CompletableFuture[0])).join();
executorService.shutdown();
executorService.awaitTermination(Long.MAX_VALUE, TimeUnit.MILLISECONDS);
resultProcessor.shutdown();
resultProcessor.awaitTermination(Long.MAX_VALUE, TimeUnit.MILLISECONDS);
} catch (InterruptedException e) {
e.printStackTrace();
}
Here, if I do not add the CompletableFuture.allOf, the result is returned from this method without completing all tasks in the queues.
Instead of calling CompletableFuture.allOf, I have tried futures.foreach(CompletableFuture::join) but my issue didn't resolve that way either.
Currently, I have assigned 1GB ram to the Tomcat server, therefore I face heap space error after some 100 thousand records are processed successfully.
What can I do here to get rid of this error and improve code efficiency as well, also the solution should be in Java 8 and not the latest versions if possible.
I don't know how much data will be in real this is a test environment data.

read from a file that is actively being written to using Java [Improvement]

I want to read from a file that's actively being written by another application/process.
This code which I found in another question, does the job: It reads a file while it is being written and only read the new content.
The problem that it consumes a lot of CPU even if no data was added to the file. How can I optimize this code ?
Use a timer ? or a thread.sleep to pause ?
Another thing to add, the whole program I am trying to write reads a file in real-time and process its content. So this means that thread.sleep or the timer will pause my whole program. The ideal improvement I am looking for is not to wait few seconds, but for a certain event to happen => New data is added. Is this possible ?
public class FileReader {
public static void main(String args[]) throws Exception {
if(args.length>0){
File file = new File(args[0]);
System.out.println(file.getAbsolutePath());
if(file.exists() && file.canRead()){
long fileLength = file.length();
readFile(file,0L);
while(true){
if(fileLength<file.length()){
readFile(file,fileLength);
fileLength=file.length();
}
}
}
}else{
System.out.println("no file to read");
}
}
public static void readFile(File file,Long fileLength) throws IOException {
String line = null;
BufferedReader in = new BufferedReader(new java.io.FileReader(file));
in.skip(fileLength);
while((line = in.readLine()) != null)
{
System.out.println(line);
}
in.close();
}
}

The ideal improvement I am looking for is not to wait few seconds, but
for a certain event to happen => New data is added. Is this possible ?
Best solution : data pushing :
The application that produces the content should inform the other application as it may read the content.
You could use any channel that may convey the information between two distinct applications.
For example by using a specific file that the writer updates to notify new things to read.
The writer could write/overwrite in this file a update date and the reader would read the data file only if it doesn't read any content since this date.
A more robust way but with more overhead could be exposing a notification service from the reader side.
Why not a REST service.
In this way, the writer could notify the reader via the service as new content is ready.
Another thing to add, the whole program I am trying to write reads a
file in real-time and process its content. So this means that
thread.sleep or the timer will pause my whole program.
A workaround solution : data pulling performed by a specific thread :
You have probably a multi-core CPU.
So create a separate thread to read the produced file to allow other threads of your application to be runnable.
Besides, you also could performs some regular pause : Thread.sleep() to optimize the core use done by the reading thread.
It could look like :
Thread readingThread = new Thread(new MyReadingProcessing());
readingThread.start();
Where MyReadingProcessing is a Runnable :
public class MyReadingProcessing implements Runnable{
public void run(){
while (true){
readFile(...);
try{
Thread.sleep(1000); // 1 second here but choose the deemed reasonable time according the metrics of your producer application
}
catch(InterruptedException e){
Thread.currentThread().interrupt();
}
if (isAllReadFile()){ // condition to stop the reading process
return;
}
}
}
}

Instead of a busy wait loop use a WatchService on directory entries changing.
Path path = Paths.get("...");
try (WatchService watchService =
path.getFileSystem().newWatchService()) {
WatchKey watchKey = path.register(watchService,
StandardWatchEventKinds.ENTRY_MODIFY);
for (;;) { // watchKey.poll with timeout, or take, blocking
WatchKey taken = watchService.take();
for (WatchEvent<?> event : taken.pollEvents()) {
Path changedPath = (Path) event.context();
if (changedPath.equals(path)) {
...
}
}
boolean valid = taken.reset();
if (!valid) {
... unregistered
}
}
}
Note that the above has to be adapted to use poll or take.

Multiple Threads downloading same file from sftp server

I have a system that, when files of a certain type are found, I download, encode, and upload them in a separate thread.
while(true) {
for(SftpClient c : clients) {
try {
filenames = c.list("*.wav", "_rdy_");
} catch (SftpException e) {
e.printStackTrace();
}
if(filenames.size() > 0) {
//AudioThread run() method handles the download, encode, and upload
AudioThread at = new AudioThread(filenames);
at.setNode(c.getNode());
Thread t = new Thread(at);
t.start();
}
}
try {
Thread.sleep(3000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
The run method from AudioThread
public void run() {
System.out.println("Running...");
this.buildAsteriskMapping();
this.connectToSFTP();
ac = new AudioConvert();
this.connectToS3();
String downloadDir = "_rough/" + getNode() + "/" + Time.getYYYYMMDDDate() + "/";
String encodeDir = "_completed" + getNode() + "/" + Time.getYYYYMMDDDate() + "/";
String uploadDir = getNode() + "/" + Time.getYYYYMMDDDate() + "/";
System.out.println("Downloading...");
try {
sftp.get(filenames, downloadDir);
} catch (SftpException e) {
//download failed
System.out.println("DL Failed...");
e.printStackTrace();
}
System.out.println("Encoding...");
try {
ac.encodeWavToMP3(filenames, downloadDir, encodeDir);
} catch (IllegalArgumentException | EncoderException e) {
System.out.println("En Failed...");
e.printStackTrace();
}
System.out.println("Uploading...");
try {
s3.upload(filenames, encodeDir, uploadDir);
} catch (AmazonClientException e) {
System.out.println("Up Failed...");
e.printStackTrace();
}
}
The download method:
public void get(ArrayList<String> src, String dest) throws SftpException {
for(String file : src) {
System.out.println(dest + file);
channel.get(file, dest + file);
}
}
The encode method:
public void encodeWavToMP3(ArrayList<String> filenames, String downloadDir, String encodeDir) throws IllegalArgumentException, EncoderException {
for(String f : filenames) {
File wav = new File(downloadDir + f);
File mp3 = new File(encodeDir + wav.getName().replace(".wav", ".mp3"));
encoder.encode(wav, mp3, attrs);
}
}
The upload method:
public void upload(ArrayList<String> filenames, String encodeDir, String uploadDir) throws AmazonClientException, AmazonServiceException {
for(String f : filenames) {
s3.putObject(new PutObjectRequest(bucketName, uploadDir, new File(encodeDir + f)));
}
}
The issue is I keep downloading the same files (or about the same files) for every thread. I want to add a variable for each client that holds the files that are being downloaded but I don't know how to remove the lists/filenames from this variable. What would be a solution? My boss would also like to only allow x amount of threads to run.

It's kind of hard to see the problem, as the code that actually does the download is missing :P
However, I would use some kind of ExecutorService instead.
Basically, I would add each download request to the service (wrapped in a "DownloadTask" with a reference to the file to be downloaded and any other relevant information it might need to get the file) and let the service take care of the rest.
The download tasks could be coded to take into account existing files as you see fit.
Depending on your requirements, this could be a single thread or multi-threaded service. It could also allow you to place upload quests in it as well.
Check out the Executors trail for more info
The general idea is to use a kind of producer/consumer pattern. You would have (at least) a thread that would look up all the files to be downloaded and for each file, you would add it to the executor service. After the file has been downloaded, I would queue and upload request into the same service.
This way, you avoid all the mess with synchronization and thread management :D
You could use the same idea with the scan tasks, for each client, you could a task to a separate service

There is a problem in your code where you instantiate AudioThread in a while loop.
Note that after you create a thread and do a t.start(), all downloading, encoding and uploading happens asynchronously. Therefore, after you start the thread the loop continuous to do another call to c.list(...) while the first thread you created is still processing the first set of files. Most probably the same set of files is returned in the succeeding c.list() calls since you specified a file pattern in the call and there is no code which marks which files are currently being processed.
My suggestion:
Use Executors.newFixedThreadPool(int nThreads) as mentioned in previous post. And specify the number of threads to the number of processors in your machine. Do this before your while loop.
For each filename you retrieved from ftp s.list(), create a Callable class and call ExecutorService.invokeAll(Collection<Callable<T>> tasks). The code in the Callable you will create is your AudioThread code. Modify AudioThread code to only process one file at at time (if possible), this way you are doing downloads,uploads, encoding in parallel for each file.
Add code which marks which files were already processed. I would suggest adding a code which renames the files you have processed to a different name to avoid getting returned in the next c.list() call.
Call ExecutorService.shutdown(...) after your while loop block

Store raw HTML content from URL then get a InputStream from memory (not using a connection)

I have a kind of tricky problem involving multi-threading. What I do is that I use a thread pool (ExecutorService) that is tasked with opening connections and putting them in a LinkedBlockingQueue.
So far I have used:
//run method in "getter threads"
public void run() {
try {
URL url = new URL(url_s); //url_s is given as a constructor argument
//if I am correct then url.openStream will wait until we have the content
InputStream stream = url.openStream();
Request req = new Request(); //a class with two variables:
req.html_stream = new InputSource(stream);
req.source = stream;
//this is a class variable (LinkedBlockingQueue<Request>)
blocking_queue.put(req);
} catch (Exception ex) {
logger.info("Getter thread died from an exeption",ex);
return;
}
}
I then have consumer thread (java.lang.Thread) that takes these InputSources and InputStreams and does:
public void run() {
while(running) {
try {
logger.info("waiting for data to eat");
Request req = blocking_queue.take();
if(req.html_stream != null)
eat_data(req);
} catch (Exception ex) {
logger.error(ex);
return;
}
}
}
Where eat_data calls an external library that takes InputSource. The library uses a singleton instance to do the processing so I cant put this step in the "getter" threads.
When I tested this code for small amounts of data it worked fine, but when I supplied it with several thousands of URLs I started to have real problems. Its not easy to find out exactly what is wrong, but I suspect that the connections time out before the consumer thread get to them, sometimes even causing deadlock.
I implemented it this way because it was so easy to go from url.openStream() to InputSource but I realize that I really must store the data locally for this to work.
How do I get from url.openStream() to some object I can store in my LinkedBlockingQueue (all data in memory) that I can later turn into an InputSoruce when my consumer thread has time to process it?

You can copy the contents of the URL into a ByteArrayOutputStream and close the URL Stream. Then store the ByteArrayInputStream in the queue.
Pseudo Code :
InputStream in = null;
try {
in = url.openStream();
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
IOUtils.copy(in, buffer);
ByteArrayInputStream bin = new ByteArrayInputStream(buffer.toByteArray());
queue.put(bin);
}
References :
java.io.ByteArrayInputStream
java.io.ByteArrayOutputStream
org.apache.commons.io.IOUtils.IOUtils

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Creating observables that do IO work - java

Related

Java Reactive Way to read lines of a File

CompleteableFuture for a large dataset

read from a file that is actively being written to using Java [Improvement]

Multiple Threads downloading same file from sftp server

Store raw HTML content from URL then get a InputStream from memory (not using a connection)

Categories

Resources