Hazelcast keySet streaming? - java

I'm new to Hazelcast, and I'm trying to use it to store data in a map that is too large than possible to fit on a single machine.
One of the processes that I need to implement is to go over each of the values in the map and do something with them - not accumulating or aggregation and I don't need to see all the data at once, so there is no memory concern with that.
My trivial implementation would be to use IMap.keySet() and then to iterate over all the keys to get each stored value in turn (and allow the value to be GCed after processing), but my concern is that there is going to be so much data in the system that even just getting the list of keys will be large enough to put undue stress on the system.
I was hoping that there was a streaming API that I can stream keys (or even full entries) in such a way that the local node will not have to cache the entire set locally - but failed to find anything that seemed relevant to me in the documentation.
I would appreciate any suggestions that you may come up with. Thanks.

Hazelcast Jet provides distributed version of j.u.s and adds «streaming» capabilities to IMap.
It allows execution of Java Streams API on the Hazelcast cluster.
import com.hazelcast.jet.JetInstance;
import com.hazelcast.jet.stream.DistributedCollectors;
import com.hazelcast.jet.stream.IStreamMap;
import com.hazelcast.jet.stream.IStreamList;
import static com.hazelcast.jet.stream.DistributedCollectors.toIList;
final IStreamMap<String, Integer> streamMap = instance1.getMap("source");
// stream of entries, you can grab keys from it
IStreamList<String> counts = streamMap.stream()
.map(entry -> entry.getKey().toLowerCase())
.filter(key -> key.length() >= 5)
.sorted()
// this will store the result on cluster as well
// so there is no data movement between client and cluster
.collect(toIList());
Please, find more info about jet here and more examples here.
Cheers,
Vik

While the Hazelcast Jet stream implementation looks impressive, I didn't have a lot of time to invest in looking at upgrading to Hazelcast Jet (in our pretty much bog-standard vert.x setup). Instead I used IMap.executeOnEntries which seems to be doing about the same thing as detailed for Hazelcast Jet by #Vik Gamov, except with a more annoying syntax.
My example:
myMap.executeOnEntries(new EntryProcessor<String,MyEntity>(){
private static final long serialVersionUID = 1L;
#Override
public Object process(Entry<String, MyEntity> entry) {
entry.getValue().fondle();
return null;
}
#Override
public EntryBackupProcessor<String, MyEntity> getBackupProcessor() {
return null;
}});
As you can see, the syntax is quite annoying:
We need to create an actual object, that can be serialized to the cluster - no fancy lambdas here (don't use my serial ID, if you copy&paste this - its broken by design).
One reason it cannot be lambda is that the interface is not functional - you need another method to handle backup copies (or at least to declare that you don't want to handle them, as I do), which while I acknolwedge its importance, its not important all of the time and I would guess that its only important in rare cases.
Obviously you can't (or at least its not trivial) to return data from the process - which is not important in my case, but still.

Related

Sandboxed java scripting replacement for Nashorn

I've been using Nashorn for awk-like bulk data processing. The idea is, that there's a lot of incoming data, coming row by row, one by another. And each row consists of named fields. These data are processed by user-defined scripts stored somewhere externally and editable by users. Scripts are simple, like if( c>10) a=b+3, where a, b and c are fields in the incoming data rows. The amount of data is really huge. Code is like that (an example to show the use case):
ScriptEngine engine = new NashornScriptEngineFactory().getScriptEngine(
new String[]{"-strict", "--no-java", "--no-syntax-extensions", "--optimistic-types=true"},
null,
scr -> false);
CompiledScript cs;
Invocable inv=(Invocable) engine;
Bindings bd=engine.getBindings(ScriptContext.ENGINE_SCOPE);
bd.remove("load");
bd.remove("loadWithNewGlobal");
bd.remove("exit");
bd.remove("eval");
bd.remove("quit");
String scriptText=readScriptText();
cs = ((Compilable) engine).compile("function foo() {\n"+scriptText+"\n}");
cs.eval();
Map params=readIncomingData();
while(params!=null)
{
Map<String, Object> res = (Map) inv.invokeFunction("foo", params);
writeProcessedData(res);
params=readIncomingData();
}
Now nashorn is obsolete and I'm looking for alternatives. Was googling for a few days but didn't found exact match for my needs. The requirements are:
Speed. There's a lot of data so it shall be really fast. So I assume as well, precompilation is the must
Shall work under linux/openJDK
Support sandboxing at least for data access/code execution
Nice to have:
Simple, c-like syntax (not lua;)
Support sandboxing for CPU usage
So far I found that Rhino is still alive (last release dated 13 Jan 2020) but I'm not sure is it still supported and how fast it is - as I remember, one of reasons Java switched to Nashorn was speed. And speed is very important in my case. Also found J2V8 but linux is not supported. GraalVM looks like a bit overkill, also didn't get how to use it for such a task yet - maybe need to explore further if it is suitable for that, but looks like it is complete jvm replacement and cannot be used as a library.
It's not necessary shall be javascript, maybe there are other alternatives.
Thank you.
GraalVM's JavaScript can be used as a library with the dependencies obtained as any Maven artifact. While the recommended way to run it is to use the GraalVM distribution, there are some explanations how to run it on OpenJDK.
You can restrict things script should have access to, like Java classes, creating threads, etc:
From the documentation:
The following access parameters may be configured:
* Allow access to other languages using allowPolyglotAccess.
* Allow and customize access to host objects using allowHostAccess.
* Allow and customize host lookup to host types using allowHostLookup.
* Allow host class loading using allowHostClassLoading.
* Allow the creation of threads using allowCreateThread.
* Allow access to native APIs using allowNativeAccess.
* Allow access to IO using allowIO and proxy file accesses using fileSystem.
And it is several times faster than Nashorn. Some measurements can be found for example in this article:
GraalVM CE provides performance comparable or superior to Nashorn with
the composite score being 4 times higher. GraalVM EE is even faster.

Persisting state into Kafka using Kafka Streams

I am trying to wrap my head around Kafka Streams and having some fundamental questions that I can't seem to figure out on my own. I understand the concept of a KTable and Kafka State Stores but am having trouble deciding how to approach this. I am also using Spring Cloud Streams, which adds another level of complexity on top of this.
My use case:
I have a rule engine that reads in a Kafka event, processes the event, returns a list of rules that matched and writes it into another topic. This is what I have so far:
#Bean
public Function<KStream<String, ProcessNode>, KStream<String, List<IndicatorEvaluation>>> process() {
return input -> input.mapValues(this::analyze).filter((host, evaluation) -> evaluation != null);
}
public List<IndicatorEvaluation> analyze(final String host, final ProcessNode process) {
// Does stuff
}
Some of the stateful rules look like:
[some condition] REPEATS 5 TIMES WITHIN 1 MINUTE
[some condition] FOLLOWEDBY [some condition] WITHIN 1 MINUTE
[rule A exists and rule B exists]
My current implementation is storing all this information in memory to be able to perform the analysis. For obvious reasons, it is not easily scalable. So I figured I would persist this into a Kafka State Store.
I am unsure of the best way to go about it. I know there is a way to create custom state stores that allow for a higher level of flexibility. I'm not sure if the Kafka DSL will support this.
Still new to Kafka Streams and wouldn't mind hearing a variety of suggestions.
From the description you have given, I believe this use case can still be implemented using the DSL in Kafka Streams. The code you have shown above does not track any state. In your topology, you need to add state by tracking the counts of the rules and store them in a state store. Then you only need to send the output rules when that count hits a threshold. Here is the general idea behind this as a pseudo-code. Obviously, you have to tweak this to satisfy the particular specifications of your use case.
#Bean
public Function<KStream<String, ProcessNode>, KStream<String, List<IndicatorEvaluation>>> process() {
return input -> input
.mapValues(this::analyze)
.filter((host, evaluation) -> evaluation != null)
...
.groupByKey(...)
.windowedBy(TimeWindows.of(Duration.ofHours(1)))
.count(Materialized.as("rules"))
.filter((key, value) -> value > 4)
.toStream()
....
}

Operations on Multiple Streams

Here's what I am doing: I have an event from an RSS feed that is telling me that a Ticket was edited. To get the changes made to that ticket, I have to call a REST service.
So I wanted to do it with a more compact, functional approach, but it just turned into a bunch of craziness. When in fact, the straight old style Java is this simple:
/**
* Since the primary interest is in what has been changed, we focus on getting changes
* often and pushing them into the appropriate channels.
*
* #return changes made since we last checked
*/
public List<ProcessEventChange> getLatestChanges(){
List<ProcessEventChange> recentChanges = new ArrayList<>();
List<ProcessEvent> latestEvents = getLatestEvents();
for (ProcessEvent event : latestEvents){
recentChanges.addAll(getChanges(event));
}
return recentChanges;
}
There were a couple questions on here that related to this that did not seem to have straightforward answers, I am asking this question so that there's a very specific example and the question is crystal clear: is it work reworking this with streams and if so how?
If streams are not good for things like this they are really not good for much. The reason I say that is this is a very common requirement: that some piece of data be enriched with more information from another source.
What you need is flatMap, which can map a single ProcessEvent object of the input list to multiple ProcessEventChange objects, and flatten all those objects to a single Stream of ProcessEventChange.
List<ProcessEventChange> recentChanges = getLatestEvents().
stream().
flatMap(e -> getChanges(e).stream()).
collect(Collectors.toList());

How to cache file handles?

I have an application that wants to keep open many files: periodically it receives a client request
saying "add some data to file X", and it would be ideal to have that file already opened, and the file's
header section already parsed, so that this write is fast. However, keeping open this many files is
not very nice to the operating system, and could become impossible if our data-storage needs grow.
So I would like a "give me this file handle, opening if it's not cached" function, and some process
for automatically closing files which have not been written to for, say, five minutes. For the
specific case of caching file handles which are written to in short spurts, this is probably enough,
but this seems a general enough problem that there ought to be functions like "give me the object named X,
from cache if possible" and "I'm now done with object X, so make it eligible for eviction five
minutes from now".
core.cache looks like it might be suitable for this
purpose, but the documentation is quite lacking and the
source
provides no particular clues about how to use it. Its TTLCache looks promising, but as well as being
unclear how to use it relies on garbage collection to evict items, so I can't cleanly close a
resource when I'm ready to expire it.
I could roll my own, of course, but there are a number of tricky spots and I'm sure I would get some
things subtly wrong, so I'm hoping someone out there can point me to an implementation of this
functionality. My code is in clojure, but of course using a java library would be perfectly fine if
that's where the best implementation can be found.
Check out Guava's cache implementation.
You can supply a Callable (or a CacheLoader) to the get method for "if handle is cached, return it, otherwise open, cache and return it" semantics
You can configure timed eviction such as expireAfterAccess
You can register a RemovalListener to close the handles on removal
Modifying the code examples from the linked Guava page slightly, using CacheLoader:
LoadingCache<Key, Handle> graphs = CacheBuilder.newBuilder()
.maximumSize(100) // sensible value for open handles?
.expireAfterAccess(5, TimeUnit.MINUTES)
.removalListener(removalListener)
.build(
new CacheLoader<Key, Handle>() {
public Handle load(Key key) throws AnyException {
return openHandle(key);
}
});
RemovalListener<Key, Handle> removalListener =
new RemovalListener<Key, Handle>() {
public void onRemoval(RemovalNotification<Key, Handle> removal) {
Handle h = removal.getValue();
h.close(); // tear down properly
}
};
* DISCLAIMER * I have not used the cache myself this way, ensure you test this sensibly.
If you don't mind some java, see http://ehcache.org/ and http://ehcache.org/apidocs/net/sf/ehcache/event/CacheEventListener.html.

Retrieving Large Lists of Objects Using Java EE

Is there a generally-accepted way to return a large list of objects using Java EE?
For example, if you had a database ResultSet that had millions of objects how would you return those objects to a (remote) client application?
Another example -- that is closer to what I'm actually doing -- would be to aggregate data from hundreds of sources, normalize it, and incrementally transfer it to a client system as a single "list".
Since all the data cannot fit in memory, I was thinking that a combination of a stateful SessionBean and some sort of custom Iterator that called back to the server would do the trick.
So, in other words, if I have an API like Iterator<Data> getData() then what's a good way to implement getData() and Iterator<Data>?
How have you successfully solved this problem in the past?
Definitely don't duplicate the entire DB into Java's memory. This makes no sense and only makes things unnecessarily slow and memory-hogging. Rather introduce pagination at database level. You should query only the data you actually need to display on the current page, like as Google does.
If you actually have a hard time in implementing this properly and/or figuring the SQL query for the specific database, then have a look at this answer. For JPA/Hibernate equivalent, have a look at this answer.
Update as per the comments (which actually changes the entire question subject...), here's a basic (pseudo) kickoff example:
List<Source> inputSources = createItSomehow();
Source outputSource = createItSomehow();
for (Source inputSource : inputSources) {
while (inputSource.next()) {
outputSource.write(inputSource.read());
}
}
This way you effectively end up with a single entry in Java's memory instead of the entire collection as in the following (inefficient) example:
List<Source> inputSources = createItSomehow();
List<Entry> entries = new ArrayList<Entry>();
for (Source inputSource : inputSources) {
while (inputSource.next()) {
entries.add(inputSource.read());
}
}
Source outputSource = createItSomehow();
for (Entry entry : entries) {
outputSource.write(entry);
}
Pagination is a good solution when working with a web based ui. sometimes, however, it is much more efficient to stream everything in one call. the rmiio library was written explicitly for this purpose, and is already known to work in a variety of app servers.
If your list is huge, you must assume that it can't fit in memory. Or at least that if your server need to handle that on many concurrent access then you have high risk of OutOfMemoryException.
So basically, what you do is paging and using batch reading. let say you load 1 thousand objects from your database, you send them to the client request response. And you loop until you have processed all objects. (See response from BalusC)
Problem is same on client side, and you'll likely to need to stream the data to the file system to prevent OutOfMemory errors.
Please also note : It is okay to load millions of object from a database as an administrative task : like for performing a backup, and export of some 'exceptional' case. But you should not use it as a request any user could do. It will be slow and drain server resources.

Categories

Resources