Retrieving Large Lists of Objects Using Java EE

Retrieving Large Lists of Objects Using Java EE - java

Is there a generally-accepted way to return a large list of objects using Java EE?
For example, if you had a database ResultSet that had millions of objects how would you return those objects to a (remote) client application?
Another example -- that is closer to what I'm actually doing -- would be to aggregate data from hundreds of sources, normalize it, and incrementally transfer it to a client system as a single "list".
Since all the data cannot fit in memory, I was thinking that a combination of a stateful SessionBean and some sort of custom Iterator that called back to the server would do the trick.
So, in other words, if I have an API like Iterator<Data> getData() then what's a good way to implement getData() and Iterator<Data>?
How have you successfully solved this problem in the past?

Definitely don't duplicate the entire DB into Java's memory. This makes no sense and only makes things unnecessarily slow and memory-hogging. Rather introduce pagination at database level. You should query only the data you actually need to display on the current page, like as Google does.
If you actually have a hard time in implementing this properly and/or figuring the SQL query for the specific database, then have a look at this answer. For JPA/Hibernate equivalent, have a look at this answer.
Update as per the comments (which actually changes the entire question subject...), here's a basic (pseudo) kickoff example:
List<Source> inputSources = createItSomehow();
Source outputSource = createItSomehow();
for (Source inputSource : inputSources) {
while (inputSource.next()) {
outputSource.write(inputSource.read());
}
}
This way you effectively end up with a single entry in Java's memory instead of the entire collection as in the following (inefficient) example:
List<Source> inputSources = createItSomehow();
List<Entry> entries = new ArrayList<Entry>();
for (Source inputSource : inputSources) {
while (inputSource.next()) {
entries.add(inputSource.read());
}
}
Source outputSource = createItSomehow();
for (Entry entry : entries) {
outputSource.write(entry);
}

Pagination is a good solution when working with a web based ui. sometimes, however, it is much more efficient to stream everything in one call. the rmiio library was written explicitly for this purpose, and is already known to work in a variety of app servers.

If your list is huge, you must assume that it can't fit in memory. Or at least that if your server need to handle that on many concurrent access then you have high risk of OutOfMemoryException.
So basically, what you do is paging and using batch reading. let say you load 1 thousand objects from your database, you send them to the client request response. And you loop until you have processed all objects. (See response from BalusC)
Problem is same on client side, and you'll likely to need to stream the data to the file system to prevent OutOfMemory errors.
Please also note : It is okay to load millions of object from a database as an administrative task : like for performing a backup, and export of some 'exceptional' case. But you should not use it as a request any user could do. It will be slow and drain server resources.

Related

How to access a shared file from multiple threads most effectively?

I'm developing a small web-app whose servlets periodically get access to a shared resource which is a simple text-file on the server side holding some lines of mutable data. Most of the time, servelts just read file for the data, but some servelts may also update it, adding new lines to the file or removing and replacing existing lines. Although file contents is not updated very often, there is still little chance for the data inconsistency and file corruption if two or more servlets decide to read and write to file at the same time.
The first goal is to make the file reading/writing safe. For this purpose, I've created a helper FileReaderWriter class providing some static methods for thread-safe file access. The read and write methods are coordinated by ReentrantReadWiteLock. The rule is quite simple: multiple threads may read from file at any time as far as no other thread is writing to it at the same time.
public class FileReaderWriter {
private static final ReentrantReadWriteLock rwLock = new ReentrantReadWriteLock();
public static List<String> read(Path path) {
List<String> list = new ArrayList<>();
rwLock.readLock().lock();
try {
list = Files.readAllLines(path);
} catch (IOException e) {
e.printStackTrace();
} finally {
rwLock.readLock().unlock();
}
return list;
}
public static void write(Path path, List<String> list) {
rwLock.writeLock().lock();
try {
Files.write(path, list);
} catch (IOException e) {
e.printStackTrace();
} finally {
rwLock.writeLock().unlock();
}
}
}
Then, every servelt may use the above method for file reading like this:
String dataDir = getServletContext().getInitParameter("data-directory");
Path filePath = Paths.get(dataDir, "test.txt");
ArrayList<String> list = FileReaderWriter.read(filePath);
Similarly, writing may be done with FileReaderWriter.write(filePath, list) method. Note: if some data needs to be replaced or removed (which means fetching the data form a file, processing it and writing updated data back to a file), then the whole code paths for this operation should be locked by rwLock.writeLock() for atomicity reasons.
Now, when access to a shared file seems to be safe (at least, I hope so), the next step is to make it fast. From the scalability perspective, reading a file at every user's request to the servlet doesn't sound reasonable. So, what I thought of is to read the contents of file into ArrayList (or other collection) only once during the context initialization time and then share this ArrayList (not the file) as a context-scoped data-holder attribute. Then a context-scoped attribute can be shared by servlets with the same locking mechanism as described above and the contents of the updated ArrayList may be independently stored back to the file on some regular basis.
Another solution (in order to avoid locking) would be to use CopyOnWriteArrayList (or some other collection from java.util.concurrent package) for holding a shared data and designate a single-threaded ExecutorService to dump its contents into a file when needed. I also heard of Java Memory-Mapped Files for mapping the entire file into internal memory, but not sure if such approach is appropriate for this particular situation.
So, could anybody, please, guide me thorough the most effective ways (maybe, suggesting some other alternatives) to solve the problem with a shared file access, provided that the writing to a file is quite infrequent and the contents of it is not expected to exceed a dozens of lines.

You don't explain your real problem, only your current attempt then, is difficult to provide a good solution.
Your approach has two serious problems:
Problem 1: concurrency
a shared resource which is a simple text-file on the server side
holding some lines of mutable data
90% of the solution to a problem is a good data structure. A mutable file it's not. Even popular database engines have important concurrency limitations (eg. SQLite), don't try to reinvent the wheel.
Problem 2: horizontal scalability
Even if he solves his local concurrency problems (eg. synchronous methods), you won't be able to deploy multiple instances (nodes/servers) of your application.
Solution 1: use the right tool for the job
You don't explain exactly the nature of your (data management) problem but probably any NoSQL database will do you good (reading about MongoDB can be a good starting point).
(Bad) solution 2: use FileLock
If for some reason you insist on doing what you indicate, use low level file locks using FileLock. You will only have to deal with partial file locks and even these can be distributed horizontally. You won't have to worry about synchronizing other resources either, as file-level locks will suffice.
(Limited) solution 3: in memory structure
If you don't need horizontal scalability, you can use a shared in memory structure like ConcurrentHashMap but you will lose the horizontal scalability and you could lose transactions if you do not persist the information before an application stop.
Conclusion
Although there are more exotic distributed data models, using a database for even a single table may be the best and simplest solution.

Ways to buffer REST response

There's a REST endpoint, which serves large (tens of gigabytes) chunks of data to my application.
Application processes the data in it's own pace, and as incoming data volumes grow, I'm starting to hit REST endpoint timeout.
Meaning, processing speed is less then network throughoutput.
Unfortunately, there's no way to raise processing speed enough, as there's no "enough" - incoming data volumes may grow indefinitely.
I'm thinking of a way to store incoming data locally before processing, in order to release REST endpoint connection before timeout occurs.
What I've came up so far, is downloading incoming data to a temporary file and reading (processing) said file simultaneously using OutputStream/InputStream.
Sort of buffering, using a file.
This brings it's own problems:
what if processing speed becomes faster then downloading speed for
some time and I get EOF?
file parser operates with
ObjectInputStream and it behaves weird in cases of empty file/EOF
and so on
Are there conventional ways to do such a thing?
Are there alternative solutions?
Please provide some guidance.
Upd:
I'd like to point out: http server is out of my control.
Consider it to be a vendor data provider. They have many consumers and refuse to alter anything for just one.
Looks like we're the only ones to use all of their data, as our client app processing speed is far greater than their sample client performance metrics. Still, we can not match our app performance with network throughoutput.
Server does not support http range requests or pagination.
There's no way to divide data in chunks to load, as there's no filtering attribute to guarantee that every chunk will be small enough.
Shortly: we can download all the data in a given time before timeout occurs, but can not process it.
Having an adapter between inputstream and outpustream, to pefrorm as a blocking queue, will help a ton.

You're using something like new ObjectInputStream(new FileInputStream(..._) and the solution for EOF could be wrapping the FileInputStream first in an WriterAwareStream which would block when hitting EOF as long a the writer is writing.
Anyway, in case latency don't matter much, I would not bother start processing before the download finished. Oftentimes, there isn't much you can do with an incomplete list of objects.
Maybe some memory-mapped-file-based queue like Chronicle-Queue may help you. It's faster than dealing with files directly and may be even simpler to use.
You could also implement a HugeBufferingInputStream internally using a queue, which reads from its input stream, and, in case it has a lot of data, it spits them out to disk. This may be a nice abstraction, completely hiding the buffering.
There's also FileBackedOutputStream in Guava, automatically switching from using memory to using a file when getting big, but I'm afraid, it's optimized for small sizes (with tens of gigabytes expected, there's no point of trying to use memory).

Are there alternative solutions?
If your consumer (the http client) is having trouble keeping up with the stream of data, you might want to look at a design where the client manages its own work in progress, pulling data from the server on demand.
RFC 7233 describes the Range Requests
devices with limited local storage might benefit from being able to request only a subset of a larger representation, such as a single page of a very large document, or the dimensions of an embedded image
HTTP Range requests on the MDN Web Docs site might be a more approachable introduction.

This is the sort of thing that queueing servers are made for. RabbitMQ, Kafka, Kinesis, any of those. Perhaps KStream would work. With everything you get from the HTTP server (given your constraint that it cannot be broken up into units of work), you could partition it into chunks of bytes of some reasonable size, maybe 1024kB. Your application would push/publish those records/messages to the topic/queue. They would all share some common series ID so you know which chunks match up, and each would need to carry an ordinal so they can be put back together in the right order; with a single Kafka partition you could probably rely upon offsets. You might publish a final record for that series with a "done" flag that would act as an EOF for whatever is consuming it. Of course, you'd send an HTTP response as soon as all the data is queued, though it may not necessarily be processed yet.

not sure if this would help in your case because you haven't mentioned what structure & format the data are coming to you in, however, i'll assume a beautifully normalised, deeply nested hierarchical xml (ie. pretty much the worst case for streaming, right? ... pega bix?)
i propose a partial solution that could allow you to sidestep the limitation of your not being able to control how your client interacts with the http data server -
deploy your own webserver, in whatever contemporary tech you please (which you do control) - your local server will sit in front of your locally cached copy of the data
periodically download the output of the webservice using a built-in http querying library, a commnd-line util such as aria2c curl wget et. al, an etl (or whatever you please) directly onto a local device-backed .xml file - this happens as often as it needs to
point your rest client to your own-hosted 127.0.0.1/modern_gigabyte_large/get... 'smart' server, instead of the old api.vendor.com/last_tested_on_megabytes/get... server
some thoughts:
you might need to refactor your data model to indicate that the xml webservice data that you and your clients are consuming was dated at the last successful run^ (ie. update this date when the next ingest process completes)
it would be theoretically possible for you to transform the underlying xml on the way through to better yield records in a streaming fashion to your webservice client (if you're not already doing this) but this would take effort - i could discuss this more if a sample of the data structure was provided
all of this work can run in parallel to your existing application, which continues on your last version of the successfully processed 'old data' until the next version 'new data' are available
^
in trade you will now need to manage a 'sliding window' of data files, where each 'result' is a specific instance of your app downloading the webservice data and storing it on disc, then successfully ingesting it into your model:
last (two?) good result(s) compressed (in my experience, gigabytes of xml packs down a helluva lot)
next pending/ provisional result while you're streaming to disc/ doing an integrity check/ ingesting data - (this becomes the current 'good' result, and the last 'good' result becomes the 'previous good' result)
if we assume that you're ingesting into a relational db, the current (and maybe previous) tables with the webservice data loaded into your app, and the next pending table
switching these around becomes a metadata operation, but now your database must store at least webservice data x2 (or x3 - whatever fits in your limitations)
... yes you don't need to do this, but you'll wish you did after something goes wrong :)
Looks like we're the only ones to use all of their data
this implies that there is some way for you to partition or limit the webservice feed - how are the other clients discriminating so as not to receive the full monty?

You can use in-memory caching techniques OR you can use Java 8 streams. Please see the following link for more info:
https://www.conductor.com/nightlight/using-java-8-streams-to-process-large-amounts-of-data/

Camel could maybe help you the regulate the network load between the REST producer and producer ?
You might for instance introduce a Camel endpoint acting as a proxy in front of the real REST endpoint, apply some throttling policy, before forwarding to the real endpoint:
from("http://localhost:8081/mywebserviceproxy")
.throttle(...)
.to("http://myserver.com:8080/myrealwebservice);
http://camel.apache.org/throttler.html
http://camel.apache.org/route-throttling-example.html
My 2 cents,
Bernard.

If you have enough memory, Maybe you can use in-memory data store like Redis.
When you get data from your Rest endpoint you can save your data into Redis list (or any other data structure which is appropriate for you).
Your consumer will consume data from the list.

How to manage a crawler URL frontier?

Guys
I have the following code to add visited links on my crawler.
After extracting links i have a for loop which loop thorough each individual href tags.
And after i have visited a link , opened it , i will add the URL to a visited link collection variable defined above.
private final Collection<String> urlForntier = Collections.synchronizedSet(new HashSet<String>());
The crawler implementation is mulithread and assume if i have visited 100,000 urls, if i didn't terminate the crawler it will grow day by day . and It will create memory issues ? Please , what option do i have to refresh the variable without creating inconsistency across threads ?
Thanks in advance!

If your crawlers are any good, managing the crawl frontier quickly becomes difficult, slow and error-prone.
Luckily, your don't need to write this yourself, just write your crawlers to use consume the URL Frontier API and plug-in an implementation that suits you.
See https://github.com/crawler-commons/url-frontier

The most usable way for modern crawling systems is to use NoSQL databases.
This solution is notable slower than HashSet. That is why you can leverage different caching strategy like a Redis, or even Bloom filters
But including specific nature of URL, I'd like to recommend Trie data structure that gives you lot of options to manipulate and search by url string. (Discussion of java implementation can be found on this Stackoevrflow topic)

As per question, I would recommend using Redis to replace use of Collection. It's in-memory database for data structure store and super fast to insert and retrieve data with support of all standard data structures.In your case Set and you can check existence of key in set with SISMEMBER command).
Apache Nutch is also good to explore.

Performance issue loading dataset/model from Apache TDB

I have a RDF file that has 7MB and ~ 80k statements.
When starting the application, I have the following code, that retrieves a list of itens I need to show to the user:
NodeIterator iterator = technologyModel.listObjectsOfProperty(subject);
while (iterator.hasNext()) {
RDFNode node = iterator.nextNode();
myCollection.add(node.asLiteral().getString().trim());
}
Note: This code works just fine and returns something about 3k results, and is the first time the "technologyModel" is accessed.
Obviously, before doing that, I have to load the dataset/model, and here is the problem.
Case (1) When I load the dataset/model from a RDF file, doing this:
InputStream in = FileManager.get().open(ParamsHelper.sourceRDF);
technologyModel.read(in, "RDF/XML-ABBREV");
the technologyModel seems instantly loaded and the first code posted runs in less than a second.
Case (2) However, when I try to load the model from a TDB database (previously loaded with the same RDF file used on first case), with this code:
dataset = TDBFactory.createDataset(ParamsHelper.tdbBaseDir);
dataset.begin(ReadWrite.READ) ;
technologyModel = dataset.getNamedModel("http://a.example.biz/technology");
dataset.end();
the technologyModel doesn´t seem to be instantly loaded, and even though the first code posted returns as expected, it runs in about 30 seconds at the first call.
If I call that same code after the first time, or, for example, insert another operation like technologyModel.listSubjects() before calling this code for the first time, it will run immediately, as expected.
It seems to me that on the second case, the model is really loaded only afthe the first operation it suffers. Does it make any sense?
I don´t want to keep my data in a RDF file, but rather have a TDB database storing the triples. That´s why the second option seems to fit me better.
Can anyone help me on this? I hope I could expose the problem correctly.
Thanks in advance.

There are two effects here:
TDBFactory.createDataset doesn't loaded any data - it connects to the database. Data is loaded into memory (cached) as it is used so when you are doing listObjectsOfProperty the first time, all caches are cold and the database may well be slow. It will be quite sensitive to the hardware you are running on at this point.
The second is that Model API calls can have access patterns that are databse-unfriendly. It is better to use SPARQL on the dataset.
By the way: listObjectsOfProperty does not take a subject - it takes a property and can access a lotof the database. If myCollection is a set, then you may be adding a lot more than 3K items.

Is it possible to get a deep copy of objects using the VersionOne Java SDK?

Let's say I want to calculate the cumulative estimate of my defects. I do
double estimate = 0.0;
Double tEstimate = 0.0;
Collection<Defect> defects = project.getDefects(null);
for(Defect d : defects){
tEstimate = d.getEstimate();
if(tEstimate != null){
estimate += tEstimate;
}
}
Here each call to d.getEstimate() does a callback to the server, meaning this code runs extremely slowly. I would like to take the one-time performance hit up front and download all the info along with the Defect object, probably including getting some information I won't use, but avoid hitting the latency of a server callback during each iteration of the loop.

You are using the VersionOne Object model SDK. It does lack robustness because of the very thing you are complaining about. One of the inefficiencies is how it knows that you are requesting a list of assets but first gets all of the assets with a predetermined set of attributes such as AssetState and checks to see if it is dead asset. After this, it makes another call to get the same list of assets again but with your specified attributes. This could be remedied by applying a greedy algorithm that could grab a set a of attributes such that each member of this set is returned regardless of which attributes are requested in your .get_() method. Why? This already (sort of) happens in the Rest based VersionOne API as it stands. If the query returned all attributes, it would probably a little wasteful especially for humongous backlogs.
Anyway, the VersionOne will be deprecating the Object Model in the near future so if you plan on a lot of coding using the OM, consider this.
Here are some ways to circumvent this problem
1) Rewrite your code to use the VersionOne APIClient SDK. It has XML plumbing so that you will save you a lot of time writing your own. This is a little bit more verbose but it is more powerful, fast and efficient. The Object model is actually built upon the APIClient.
2) Rewrite your code using Java and the raw VersionOne Rest API - The requires that you understand http and the VersionOne Rest API.
3) If you cannot change from the Object model, you can mix the 2 sdks. When you need to read large amounts to data, just use APIClient code to manage that segment of the code. Kind of pointless when you can just learn the APIclient and use exclusively unless you have a huge investment in using the Object model and you can't change. The code gets mucky real fast. Not recommended.

The rest-1.v1 API endpoint exposes operations for assets, including DeepCopy. There is no client code that enumerates all of the operations, so you must first explore the asset using the meta.v1 API endpoint. Using the API Client backdoor from the Object Model, you can get to the classes that will allow you to call an operation once you know the name.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.