Nonblocking IO with Java libraries that use InputStream

Nonblocking IO with Java libraries that use InputStream - java

I'm reading up on non-blocking I/O as I'm using Akka and Play and blocking is a bad idea if avoidable in that context as far as I can read, but I can't get this to work together with my use case:
Get file over network (here alternatives using nio exist, but right now I'm using URL.openStream)
Decrypt file (PGP) using BouncyCastle (here I'm limited to InputStream)
Unzip file using standard Java GZIP (limited to InputStream)
Read each line in file, which is a position based flat file, and convert to a Case Classes (here I have no constraints on method for reading, right now scalax.io.Resource)
Persist using Slick/JDBC (Not sure if JDBC is blocking or not)
It's working right now basically using InputStreams all the way. However, in the interest of learning and improving my understanding, I'm investigating if I could do this with nonblocking IO.
I'd basically like to stream the file through a pipeline where I apply each step above and finally persist the data without blocking.
If code is required I can easily provide, but I'm looking of a solution on a general level: what do I do when I'm dependent on libraries using java.io?

I hope this helps with some of your points:
1/2/3/4) Akka can work well with libraries that use java.io.InputStream and java.io.OutputStream. See this page, specifically this section: http://doc.akka.io/docs/akka/snapshot/scala/io.html
A ByteStringBuilder can be wrapped in a java.io.OutputStream via the asOutputStream method. Likewise, ByteIterator can we wrapped in a java.io.InputStream via asInputStream. Using these, akka.io applications can integrate legacy code based on java.io streams.
1) You say get a file over the network. I'm guessing via HTTP? You could look into an asynchronous HTTP library. There are many fairly mature async HTTP libraries out there. I like using Spray Client in scala as it is built on top of akka, so plays well in an akka environment. It supports GZIP, but not PGP.
4) Another option: Is the file small enough to store in memory? If so you need not worry about being asynchronous as you will not be doing any IO. You will not be blocking whilst waiting for IO, you will instead be constantly using the CPU as memory is fast.
5) JDBC is blocking. You call a method with the SQL query as the argument, and the return type is a result set with the data. The method must block whilst performing the IO to be able to return this data.
There are some Java async database drivers, but all the ones I have seen seem unmaintained, so I have't used them.
Fear not. Read this section of the akka docs for how to deal with blocking libraries in an akka environment:
http://doc.akka.io/docs/akka/snapshot/general/actor-systems.html#Blocking_Needs_Careful_Management

Decrypt file (PGP) using BouncyCastle (here I'm limited to InputStream)
As you are limited to an InputStream in this step you've answered your own question. You can do the part involving the network with NIO but your step (2) requires an InputStream. You could spool the file from the network to disk using NIO and then use streams from then on, for unzipping and decrypting (CipherInputStream) ... still blocking in theory but continuous in practice.

I know this isn't non blocking IO exactly, but I think you should look at composing Futures (or Promsies) with map which is non-blocking in the Playframework sense of things.
def getFile(location: String): File = { //blocking code}
def decrypt(file: File): File = ..
def unzip(file: File): PromiseFile = ..
def store(file: File): String = ..
def result(status: String): SimpleResult[Json] = ..
AsyncResult{
Promise.pure(getFile("someloc")) map decrypt map unzip map store map result
}

Related

Ways to buffer REST response

There's a REST endpoint, which serves large (tens of gigabytes) chunks of data to my application.
Application processes the data in it's own pace, and as incoming data volumes grow, I'm starting to hit REST endpoint timeout.
Meaning, processing speed is less then network throughoutput.
Unfortunately, there's no way to raise processing speed enough, as there's no "enough" - incoming data volumes may grow indefinitely.
I'm thinking of a way to store incoming data locally before processing, in order to release REST endpoint connection before timeout occurs.
What I've came up so far, is downloading incoming data to a temporary file and reading (processing) said file simultaneously using OutputStream/InputStream.
Sort of buffering, using a file.
This brings it's own problems:
what if processing speed becomes faster then downloading speed for
some time and I get EOF?
file parser operates with
ObjectInputStream and it behaves weird in cases of empty file/EOF
and so on
Are there conventional ways to do such a thing?
Are there alternative solutions?
Please provide some guidance.
Upd:
I'd like to point out: http server is out of my control.
Consider it to be a vendor data provider. They have many consumers and refuse to alter anything for just one.
Looks like we're the only ones to use all of their data, as our client app processing speed is far greater than their sample client performance metrics. Still, we can not match our app performance with network throughoutput.
Server does not support http range requests or pagination.
There's no way to divide data in chunks to load, as there's no filtering attribute to guarantee that every chunk will be small enough.
Shortly: we can download all the data in a given time before timeout occurs, but can not process it.
Having an adapter between inputstream and outpustream, to pefrorm as a blocking queue, will help a ton.

You're using something like new ObjectInputStream(new FileInputStream(..._) and the solution for EOF could be wrapping the FileInputStream first in an WriterAwareStream which would block when hitting EOF as long a the writer is writing.
Anyway, in case latency don't matter much, I would not bother start processing before the download finished. Oftentimes, there isn't much you can do with an incomplete list of objects.
Maybe some memory-mapped-file-based queue like Chronicle-Queue may help you. It's faster than dealing with files directly and may be even simpler to use.
You could also implement a HugeBufferingInputStream internally using a queue, which reads from its input stream, and, in case it has a lot of data, it spits them out to disk. This may be a nice abstraction, completely hiding the buffering.
There's also FileBackedOutputStream in Guava, automatically switching from using memory to using a file when getting big, but I'm afraid, it's optimized for small sizes (with tens of gigabytes expected, there's no point of trying to use memory).

Are there alternative solutions?
If your consumer (the http client) is having trouble keeping up with the stream of data, you might want to look at a design where the client manages its own work in progress, pulling data from the server on demand.
RFC 7233 describes the Range Requests
devices with limited local storage might benefit from being able to request only a subset of a larger representation, such as a single page of a very large document, or the dimensions of an embedded image
HTTP Range requests on the MDN Web Docs site might be a more approachable introduction.

This is the sort of thing that queueing servers are made for. RabbitMQ, Kafka, Kinesis, any of those. Perhaps KStream would work. With everything you get from the HTTP server (given your constraint that it cannot be broken up into units of work), you could partition it into chunks of bytes of some reasonable size, maybe 1024kB. Your application would push/publish those records/messages to the topic/queue. They would all share some common series ID so you know which chunks match up, and each would need to carry an ordinal so they can be put back together in the right order; with a single Kafka partition you could probably rely upon offsets. You might publish a final record for that series with a "done" flag that would act as an EOF for whatever is consuming it. Of course, you'd send an HTTP response as soon as all the data is queued, though it may not necessarily be processed yet.

not sure if this would help in your case because you haven't mentioned what structure & format the data are coming to you in, however, i'll assume a beautifully normalised, deeply nested hierarchical xml (ie. pretty much the worst case for streaming, right? ... pega bix?)
i propose a partial solution that could allow you to sidestep the limitation of your not being able to control how your client interacts with the http data server -
deploy your own webserver, in whatever contemporary tech you please (which you do control) - your local server will sit in front of your locally cached copy of the data
periodically download the output of the webservice using a built-in http querying library, a commnd-line util such as aria2c curl wget et. al, an etl (or whatever you please) directly onto a local device-backed .xml file - this happens as often as it needs to
point your rest client to your own-hosted 127.0.0.1/modern_gigabyte_large/get... 'smart' server, instead of the old api.vendor.com/last_tested_on_megabytes/get... server
some thoughts:
you might need to refactor your data model to indicate that the xml webservice data that you and your clients are consuming was dated at the last successful run^ (ie. update this date when the next ingest process completes)
it would be theoretically possible for you to transform the underlying xml on the way through to better yield records in a streaming fashion to your webservice client (if you're not already doing this) but this would take effort - i could discuss this more if a sample of the data structure was provided
all of this work can run in parallel to your existing application, which continues on your last version of the successfully processed 'old data' until the next version 'new data' are available
^
in trade you will now need to manage a 'sliding window' of data files, where each 'result' is a specific instance of your app downloading the webservice data and storing it on disc, then successfully ingesting it into your model:
last (two?) good result(s) compressed (in my experience, gigabytes of xml packs down a helluva lot)
next pending/ provisional result while you're streaming to disc/ doing an integrity check/ ingesting data - (this becomes the current 'good' result, and the last 'good' result becomes the 'previous good' result)
if we assume that you're ingesting into a relational db, the current (and maybe previous) tables with the webservice data loaded into your app, and the next pending table
switching these around becomes a metadata operation, but now your database must store at least webservice data x2 (or x3 - whatever fits in your limitations)
... yes you don't need to do this, but you'll wish you did after something goes wrong :)
Looks like we're the only ones to use all of their data
this implies that there is some way for you to partition or limit the webservice feed - how are the other clients discriminating so as not to receive the full monty?

You can use in-memory caching techniques OR you can use Java 8 streams. Please see the following link for more info:
https://www.conductor.com/nightlight/using-java-8-streams-to-process-large-amounts-of-data/

Camel could maybe help you the regulate the network load between the REST producer and producer ?
You might for instance introduce a Camel endpoint acting as a proxy in front of the real REST endpoint, apply some throttling policy, before forwarding to the real endpoint:
from("http://localhost:8081/mywebserviceproxy")
.throttle(...)
.to("http://myserver.com:8080/myrealwebservice);
http://camel.apache.org/throttler.html
http://camel.apache.org/route-throttling-example.html
My 2 cents,
Bernard.

If you have enough memory, Maybe you can use in-memory data store like Redis.
When you get data from your Rest endpoint you can save your data into Redis list (or any other data structure which is appropriate for you).
Your consumer will consume data from the list.

Specify InputStream for ServletResponce instead of copying InputStream in OutputStream

In short I have a Servlet, which retrieves pictures/videos e t.c. from underlying data store.
In order to archive this I need to copy files InputStream to ServletResponce *OutputStream*
From my point of view this is not effective, since I'll need to copy the file in memory before sending it, it would be more convinient to specify InputStream, from which OutputStream would read data and send it straight away, after reading some data in the buffer.
I looked at ServletResponce documentation and it have some buffer for the message data, so I have a few questions regarding it.
Is this the right mechanism?
What If I decide not to send the file at the end of Servlet processing?
For example:
If I have copied InputStream in OutputStream, and then find out that this is not authorized request, and user have no right to see this Object (Mistake in design maybe) I would still send some data to the client, although this is not what I intended, or not.

To address your first concern, you can easily copy InputStream to OutputStream using IOUtils from Apache Commons Lang:
IOUtils.copy(fileInputStream, servletOutputStream);
It uses 4K buffer, so memory consumption should not be a concern. In fact you cannot just send straight away data from InputStream. At the lowest level the operating system still has to read file contents to some memory location and in order to send it to socket, you need to provide a memory location where the data to be sent resides. Streams are just a useful abstraction.
About your second question: this is how HTTP works: if you start streaming data to the client, servlet container sends all response headers first. If you abort in the middle, from the client perspective it looks like interrupted download.

Is this the right mechanism?
Basically, it is the only mechanism provided by the Servlet APIs. You need to design your servlet with this in mind.
(It is hard to see how it could be done any other way. A read syscall reads data into memory from a device (the disk). A write syscall writes data from memory to a device (the network interface). There is no syscall to transfer data directly from one device to another. The best you can do is to reduce the amount of copying of data within the application. If you use something like IOUtils.copy, it should minimize that as far as possible. The only way you could avoid going through application memort would be to use some special purpose hardware / operating system combination optimized for content delivery.)
However, this is probably moot anyway. In most cases, the performance bottleneck is likely to be movement of data over the network. Data can probably be read from disk to memory, copied, and written to the network interface orders of magnitude faster than it can move through the network to the user's web browser (or whatever).
If it is NOT moot, then a practical way to do content delivery would be to use a separate web server implemented in native code that us optimized for delivering static content; e.g. something like nginx.)
What If I decide not to send the file at the end of Servlet processing? For example: If I have copied InputStream in OutputStream, and then find out that this is not authorized request, and user have no right to see this Object (Mistake in design maybe) I would still send some data to the client, although this is not what I intended, or not.
You should write your servlet to do the access checks BEFORE reading the content into memory. And ideally, before you "commit" the response by sending the response header.

Storing state in Java

Broad discussion question.
Are there any libraries already which allow me to store the state of execution of my application in Java?
E.g I have an application which processes files, now the application may be forced to shutdown suddenly at some point.I want to store the information on what all files have been processed and what all have not been, and what stage the processing was on for the ongoing processes.
Are there already any libraries which abstract this functionality or I would have to implement it from scratch?

It seems like what you are looking for is serialization which can be performed with the Java Serialization API.
You can write even less code if you decide to use known libraries such as Apache Commons Lang, and its SerializationUtils class which itself is built on top the Java Serialization API.
Using the latest, serializing/deserializing your application state into a file is done in a few lines.
The only thing you have to do is create a class holding your application state, let's call it... ApplicationState :-) It can look like that:
class ApplicationState {
enum ProcessState {
READ_DONE,
PROCESSING_STARTED,
PROCESSING_ENDED,
ANOTHER_STATE;
}
private List<String> filesDone, filesToDo;
private String currentlyProcessingFile;
private ProcessState currentProcessState;
}
With such a structure, and using SerializationUtils, serializing is done the following way:
try {
ApplicationState state = new ApplicationState();
...
// File to serialize object to
String fileName = "applicationState.ser";
// New file output stream for the file
FileOutputStream fos = new FileOutputStream(fileName);
// Serialize String
SerializationUtils.serialize(state, fos);
fos.close();
// Open FileInputStream to the file
FileInputStream fis = new FileInputStream(fileName);
// Deserialize and cast into String
String ser = (String) SerializationUtils.deserialize(fis);
System.out.println(ser);
fis.close();
} catch (Exception e) {
e.printStackTrace();
}

It sounds like the Java Preferences API might be a good option for you. This can store user/system settings with minimal effort on your part and you can update/retrieve at any time.
https://docs.oracle.com/javase/8/docs/technotes/guides/preferences/index.html

It's pretty simple to make from scratch. You could follow this:
Have a DB (or just a file) that stores the information of processing progress. Something like:
Id|fileName|status|metadata
As soon as you start processing a file make a entry to this table. Ans mark status as PROCESSING, the you can store intermediate states, and finally when you're done you can set status to DONE. This way, on restart, you would know what are the files processed; what are the files that were in-citu when the process shutdown/crashed. And (obviously) where to start.
In large enterprise environment where applications are loosely coupled (and there is no guarantee if the application will be available or might crash), we use Message Queue to do something like the same to ensure reliable architecture.

There are almost too many ways to mention. I would choice the option you believe is simplest.
You can use;
a file to record what is done (and what is to be done)
a persistent queue on JMS (which support multiple processes, even on different machine)
a embedded or remote database.
An approach I rave about is using memory mapped files. A nice feature is that information is not lost if the application dies or is killed (provided the OS doesn't crash) which means you don't have to flush it, nor worry about losing data if you don't.
This works because the data is partly managed by the OS which means it uses little heap (even for TB of data) and the OS deals with loading and flushing to disk making it much faster (and making sizes much larger than your main memory practical).
BTW: This approach works even with a kill -9 as the OS flushes the data to disk. To test this I use Unsafe.getByte(0) which crashes the application with a SEG fault immediately after making a change (as in the next machine code instruction) and it still writes the change to disk.
This won't work if you pull the power, but you have to be really quick. You can use memory mapped files to force the data to disk before continuing, but I don't know how you can test this really works. ;)
I have a library which could make memory mapped files easier to use
https://github.com/peter-lawrey/Java-Chronicle
Its a not long read and you can use it as an example.

Apache Commons Configuration API: http://commons.apache.org/proper/commons-configuration/userguide/howto_filebased.html#File-based_Configurations

Interactions/Communications between a C++ and a Java program

I've got an app written in Java and some native C++ code with system hooks. These two have to communicate with each other. I mean the C++ subprogram must send some data to the Java one. I would've written the whole thing in one language if it was possible for me. What I'm doing now is really silly, but works. I'm hiding the C++ program's window and sending it's data to it's standard output and then I'm reading that output with Java's standard input!!!
Ok, I know what JNI is but I'm looking for something easier for this (if any exists).
Can anyone give me any idea on how to do this?
Any help will be greatly appreciated.

Sockets & Corba are two techniques that come to mind.
Also, try Google's Protocol Buffers or Apache Thrift.

If you don't find JNI 'easy' then you are in need of an IPC (Inter process communication) mechanism. So from your C++ process you could communicate with your Java one.
What you are doing with your console redirection is a form of IPC, in essence that what IPC.
Since the nature of what you are sending isn't exactly clear its very hard to give you a good answer. But if you have 'simple' Objects or 'commands' that could be serialised easily into a simple protocol then you could use a communication protocol such as protocol buffers.
#include <iostream>
#include <boost/interprocess/file_mapping.hpp>
// Create an IPC enabled file
const int FileSize = 1000;
std::filebuf fbuf;
fbuf.open("cpp.out", std::ios_base::in | std::ios_base::out
| std::ios_base::trunc | std::ios_base::binary);
// Set the size
fbuf.pubseekoff(FileSize-1, std::ios_base::beg);
fbuf.sputc(0);
// use boost IPC to use the file as a memory mapped region
namespace ipc = boost::interprocess;
ipc::file_mapping out("cpp.out", ipc::read_write);
// Map the whole file with read-write permissions in this process
ipc::mapped_region region(out, ipc::read_write);
// Get the address of the mapped region
void * addr = region.get_address();
std::size_t size = region.get_size();
// Write to the memory 0x01
std::memset(addr, 0x01, size);
out.flush();
Now your java file could open 'cpp.out' and read the contents like a normal file.

Two approaches off the top of my head:
1) create two processes and use any suitable IPC;
2) compile C++ app into dynamic library and export functions with standard C interface, those should be callable from any language.

Display % of upload done

I have a java applet for uploading files to server.
I want to display the % of data sent but when I use ObjectOutputStream.write() it just writes to the buffer, does not wait until the data has actually been sent. How can I achieve this.
Perhaps I need to use thread synchronization or something. Any clues would be most helpful.

Don't use ObjectOutputStream. It's for writing serialized Java objects, not for writing raw binary data. It may indeed block the stream. Rather just write directly to the OutputStream of the URL connection.
That said, the code looks pretty overcomplicated. Even after re-reading several times and blinking my eyes countless times, I can't get it right. I suggest you to send those files according the multipart/form-data encoding with help of Commons HttpClient. You can find here a basic code example. You just have to modify Part[] parts to include all the files. The servlet on the other side can in turn use Commons FileUpload to parse the multipart/form-data request.
To calculate the progress, I'd suggest to pick CountingOutputStream of Commons IO. Just wrap the OutputStream with it and write to it.
Update: if you don't like to ship your applet with more 3rd party libraries (which I imagine is reasonable), then have a look at this code snippet (and the original question mentioned as 1st link) how to create a multipart/form-data body yourself using URLConnection.

What I was looking for was actually:
setFixedLengthStreamingMode(int contentLength)
This prevents any internal buffering allowing me know exactly the amount of data being sent.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.