Spring JPA and Streaming - Is the data fetched incrementally?

Spring JPA and Streaming - Is the data fetched incrementally? - java

I am looking at streaming query results section of the Spring documentation. Does this functionality fetch all the data at once but provide it as a stream? Or does it fetch data incrementally so that it will be more memory efficient?
If it doesn't fetch data incrementally, is there any other way to achieve this with spring data jpa?

It depends on your platform.
Instead of simply wrapping the query results in a Stream data store specific methods are used to perform the streaming.
With MySQL for example the streaming is performed in a truly streaming fashion, but of course if the underlying datastore (or the driver being used) doesn't have support for such a mechanism (yet) it won't make a difference.
MySQL is IIRC currently the only driver that can provide streaming without additional configuration in this fashion whereas other databases/drivers go with the standard fetch size setting as described by the venerable Vlad Mihalcea here: https://vladmihalcea.com/whats-new-in-jpa-2-2-stream-the-result-of-a-query-execution/, note the trade-off between performance vs. memory use. Other databases are most likely going to need a reactive database client in order to even perform true streaming.
Whatever the underlying streaming method, what affects most is how you process the stream. Using Spring's StreamingResponseBody for example would allow you to stream large amounts of data directly from the database to the client with minimal memory use. Still it's a very specific use case, so don't start streaming everything just yet unless you're sure it's worth it.

Related

Custom caching implementation in Java

I want to implement some sort of lightweight caching in Java which is easily integrable in Java and should be easy to deploy with a Java application.
The cache layer will be between the application and the database layer: no database caching, no Spring, no Hibernate, no EHcache, no http caching.
We can use a file system or a nano database so that the cache can be restored so that the cache can be restored after the process restart.
I tried LRU Cache:
http://stackoverflow.com/questions/224868/easy-simple-to-use-lru-cache-in-java
http://www.programcreek.com/2013/03/leetcode-lru-cache-java/
But I am not sure how to after overflow should I save database into database (which database will be better to use for faster insert and seek of data). Or I should use File System?
Any one has better inputs to implement caching mechanism in Java?

But I am not sure how to after overflow should I save database into database(which database will be better to use for faster insert ans seek ok data) Or I should use File System?
It depends on the use case. If your cached values are very big, you can store each of it in a file and use the hash of the cache key as file name.
If you have values small in size, storing them as separate files would be a lot of overhead, so it is better to store the cached entries into one or a couple of files. To implement this you need to learn about "external indexes" and "memory management" or "free space management" (e.g. best fit, next fit and compaction strategies). This actually leads to the implementation of a tiny database, so may be use one :) Some stuff that comes to my mind: LevelDB, MapDB, LMDB, RocksDB
Keep in mind that caching operations come in concurrently from the application, so the cache may evict a value and a request to the same key may come in at the same time. Will you implement just the basic operations like Cache.get and Cache.put or also CAS-operations like Cache.putIfAbsent? Do you want to efficiently use multi core system, as they are common today?
Still, when using a tiny database, you will need to prepare for some months of engineering work.
Any one has better inputs to implement caching mechanism in Java?
You can read my blog at cruftex.net for some more input to implement lightweight and fast caching in Java.
For a cache implementation with overflow you can take a look at imcache. But imcache is not a fully-fledged generic cache, because for example CAS-operations are missing, see the Cache interface
My own high performance Java cache implementation cache2k, features CAS-operations, events, loaders&writers, expiry, etc. and it will eventually get some overflow to disk, too. However, I am not sure about the time frame... When you are interested to work in this area: contributions are welcome!

Choice between REST API or Java API

I have been reading about neo4j last few days. I got very confused about whether I need to use REST API or if can I go with Java APIs.
My need is to create millions of nodes which will have some connection among them. I want to add indexes on few of node attributes for searching. Initially I started with embedded mode of GraphDB with Java API but soon reached OutOfMemory with indexing on few nodes so I thought it would be better if my neo4j is running as service and I connect to it through REST API then it will do all memory management by itself by swapping in/out data to underlying files. Is my assumption right?
Further, I have plans to scale my solution to billion of nodes which I believe wont be possible with single machine's neo4j installation. I also believe Neo4j has the capability of running in distributed mode. For this reason also I thought continuing with REST API implementation is best idea.
Though I couldn't find out any good documentation about how to run Neo4j in distributed environment.
Can I do stuff like batch insertion, etc. using REST APIs as well, which I do with Java APIs with Graph DB running in embedded mode?

Do you know why you are getting your OutOfMemory Exception? This sounds like you are creating all these nodes in the same transaction, which causes it to live in memory. Try committing small chunks at a time, so that Neo4j can write it to Disk. You don't have to manage the memory of Neo4j aside from things like cache.
Distributed mode is in a Master/Slave architecture, so you'll still have a copy of the entire DB on each system. Neo4j is very efficient for disk storage, a Node taking 9 Bytes, Relationship taking 33 Bytes, properties are variable.
There is a Batch REST API, which will group many calls into the same HTTP call, however making REST calls is still a slower then if this were embedded.
There are some disadvantages to using the REST API that you did not mentions, and that's stuff like transactions. If you are going to do atomic operations, where you need to create several nodes, relationships, change properties, and if any step fails not commit any of it, you cannot do this in the REST API.

Options for In-memory databases (Open source and Java-based)

I've a web app that makes external web service calls on behalf of it's clients. I want to cache the data returns by some web services in the web app so that other clients can reuse this data and run filters and queries on this cached data.
The current architecture of the web app uses Apache Camel, Spring and Jetty. I'm looking for options (pros/cons) of in-memory database options.

Hazelcast (Java API) - you can distribute the in-memory datagrid (with map, multimap, sets, lists, queues, topics) over multiple nodes very easily & use load/store interface implementation with a disk based DB. You can do something similar with EHCache.
Redis is another option (use the Java client to access it). You can simply configure the conf file to write data to disk (or avoid it altogether) & should not have to write your own load/store classes.
Besides these, there are a number of options you could use. Not sure if you are only looking at open source options, looking at distributed options or not.
Hope it helps.

Have you considered using MemCached? It is not a database, but a caching system you can control from inside your application.
Here are a few more thoughts about in-memory databases. First almost every modern RDBMS has a memory caching system inside it. The more memory you give to the database server (and configure it for caching) the more that it will store in memory for later. If you put together a system with enough memory to cache all the tables, you will have an "in memory" cache without the overhead of another database.
Most total "in memory" databases are used for high volume/large data systems where performance is totally key. And, because they are for extreme performance systems, you are going to pay for them. Or more specifically, pay extra for them. For example, the SAP/Sybase DB's that support full in-memory can cost you from 40% to 300% more than our existing products.
So, in answer to your question, do you really need one?

Try Redisson - distributed and scalable familar Java data structures (Set, Map, ConcurrentMap, List, Queue, Lock, AtomicLong, CountDownLatch, Publish / Subscribe) on top of in-memory db Redis.

Can a streaming collection be implemented in Java?

I needed to implement a utility server that tracks few custom variables that will be sent from any other server. To track the variables, a key value collection, either JDK defined or custom needs to be used.
Here are few considerations -
Keeping all the variables in memory of the server all the time is memory intensive.
This server needs to be a very lightweight server and I do not want heavy database operations.
Is there a pre-defined streaming collection which can serialize the data after a threshold memory and retrieve it on need basis?
I hope I am clear in defining the problem statement.
Please suggest if any other better approach.

this thing looks very promising, but is in development stage...
JDBM3
Edit Current version of the file backed collections: MapDB.

Database
What you've described sounds exactly like you should use a database (i.e. indexed key/value store, too big for memory but want performance benefits of in-memory caching where possible).
I'd recommend a lightweight embedded database such as H2 - it's small, fast and should suit your purposes very well.

Have you thought of using an on the shelf nosql queue value store? Redis for example?
If you want it java only you have the option of using a lib like ehcache, it would have the functionalities you need.

java embedded database w/ ability to store as one file

I need to create a storage file format for some simple data in a tabular format, was trying to use HDF5 but have just about given up due to some issues, and I'd like to reexamine the use of embedded databases to see if they are fast enough for my application.
Is there a reputable embedded Java database out there that has the option to store data in one file? The only one I'm aware of is SQLite (Java bindings available). I tried H2 and HSQLDB but out of the box they seem to create several files, and it is highly desirable for me to have a database in one file.
edit: reasonably fast performance is important. Object storage is not; for performance concerns I only need to store integers and BLOBs. (+ some strings but nothing performance critical)
edit 2: storage data efficiency is important for larger datasets, so XML is out.

Nitrite Database http://www.dizitart.org/nitrite-database.html
NOsql Object (NO2 a.k.a Nitrite) database is an open source nosql
embedded document store written in Java with MongoDB like API. It
supports both in-memory and single file based persistent store.

H2 uses only one file, if you use the latest H2 build with the PAGE_STORE option. It's a new feature, so it might not be solid.

If you only need read access then H2 is able to read the database files from a zip file.
Likewise if you don't need persistence it's possible to have an in-memory only version of H2.
If you need both read/write access and persistence, then you may be out of luck with standard SQL-type databases, as these pretty much all uniformly maintain the index and data files separately.

Once i used an object database that saved its data to a file. It has a Java and a .NET interface. You might want to check it out. It's called db4o.

Chronicle Map is an embedded pure Java database.
It stores data in one file, i. e.
ChronicleMap<Integer, String> map = ChronicleMap
.of(Integer.class, String.class)
.averageValue("my-value")
.entries(10_000)
.createPersistedTo(databaseFile);
Chronicle Map is mature (no severe storage bugs reported for months now, while it's in active use).
Idependent benchmarks show that Chronicle Map is the fastest and the most memory efficient key-value store for Java.
The major disadvantage for your use case is that Chronicle Map supports only a simple key-value model, however more complex solution could be build on top of it.
Disclaimer: I'm the developer of Chronicle Map.

If you are looking for a small and fast database to maybe ship with another program I would check Apache Derby I don't know how you would define embedded-database but I used this in some projects as a debugging database that can be checked in with the source and is available on every developer machine instantaneous.

This isn't an SQL engine, but If you use Prevayler with XStream, you can easily create a single XML file with all your data. (Prevayler calls it a snapshot file.)
Although it isn't SQL-based, and so requires a little elbow grease, its self-contained nature makes development (and especially good testing) much easier. Plus, it's incredibly fast and reliable.

You may want to check out jdbm - we use it on several projects, and it is quite fast. It does use 2 files (a database file and a log file) if you are using it for ACID type apps, but you can drop directly to direct database access (no log file) if you don't need solid ACID.
JDBM will easily support integers and blobs (anything you want), and is quite fast. It isn't really designed for concurrency, so you have to manage the locking yourself if you have multiple threads, but if you are looking for a simple, solid embedded database, it's a good option.

Since you mentioned sqlite, I assume that you don't mind a native db (as long as good java bindings are available). Firebird works well with java, and does single file storage by default.
Both H2 and HSQLDB would be excellent choices, if you didn't have the single file requirement.

I think for now I'm just going to continue to use HDF5 for the persistent data storage, in conjunction with H2 or some other database for in-memory indexing. I can't get SQLite to use BLOBs with the Java driver I have, and I can't get embedded Firebird up and running, and I don't trust H2 with PAGE_STORE yet.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.