JavaSpaces-like patterns in Ignite - java

I've used GigaSpaces in the past and I'd like to know if I can use Ignite in a similar fashion. Specifically, I need to implement a master-worker pattern where one set of process writes objects to the in-memory data grid and another set reads those objects, does some processing, and possibly writes results back to the grid. One important GigaSpaces/JavaSpaces feature I need is leasing. If I write an object to the space and it isn't picked up within a certain time period, it should automatically expire and I should get some kind of notification.
Is Apache Ignite a good match for this use case?

I've worked with GigaSpaces before. What you are looking for is perhaps "continuous queries" in Ignite. That would allow create a filter for a specific predicate I.e. Checking a field of a new object being written to the grid. Once the filter is evaluated it will trigger a listener that can execute the logic you require and write results or changes back to the grid. You can create as many of these queries as desired and create chains. Similar to the "notification container" in gigaspaces. And as you would expect you can control the thread pools for this separately.
As for master worker pattern, you can configure client Ignite nodes to be written the the data and server nodes to store and process the data. You can even use other client nodes as remote listeners for data changes as you mentioned.
Check these links:
https://apacheignite.readme.io/docs/continuous-queries
https://apacheignite.readme.io/docs/clients-vs-servers

I've worked with GigaSpaces before. What you are looking for is perhaps "continuous queries" in Ignite. That would allow create a filter for a specific predicate I.e. Checking a field of a new object being written to the grid. Once the filter is evaluated it will trigger a listener that can execute the logic you require and write results or changes back to the grid. You can create as many of these queries as desired and create chains. Similar to the "notification container" in gigaspaces. And as you would expect you can control the thread pools for this separately.
As for master worker pattern, you can configure client Ignite nodes to be written the the data and server nodes to store and process the data. You can even use other client nodes as remote listeners for data changes as you mentioned.

Related

Database polling when there's an insert - Spring Data JPA

I have a requirement where whenever there's an entry in the table, I want to trigger an event. I have used EntityListeners (Spring Data JPA concept) for this, which is working perfectly fine; but the issue here is the insert can happen through stored procedures or manual entry. I tried searching online and found the Spring JPA inbound and outbound channel adapter concept, but I think this concept doesn't help me much in what I want to achieve. Can anybody clarify to me if this concept helps me as I have no much idea on this concept or provide me with any solutions on how I can achieve this?
There are no "great" mechanisms for raising events "from the data layer" in SQL Server.
There are three "OK" ones:
Triggers (only arguably OK)
Triggers seem like an obvious solution, but then you have to ask yourself... what will the trigger actually do? If it just writes data into another table, you still haven't gotten yourself outside the database. There are various arcane tricks you could try to use for this, like CLR procedures, or a few extended procedures.
But if you go down that route, you have to start thinking about another consideration: Triggers happen in the same transaction as the DML operation that caused them to fire. If they take time to execute, you'll be slowing down your OLTP workloads. If they do anything that is potentially unreliable they could fail, causing your transaction to roll back.
Triggers plus service broker
Service broker provides a mechanism - perhaps the only even-half-sensible mechanism - to get your data out of SQL and into some kind of listener in a "push" based manner. You still have a trigger, but the trigger writes data to a service broker queue. A listener can use a special waitfor receive statement to listen for data as it appears in the queue. The nice thing about this is that once the trigger has pushed data into a broker queue, its job is done. The "receipt" of that data is decoupled from the transaction that caused it to be enqueued in the first place. This sort of service broker mechanism is what is used by things like the SqlDependency built into dot net.
The two main issues with service broker are complexity and performance. Service broker has a steep learning curve, and it's easy to get things wrong. Performance becomes complex if you need to scale, because while it's "easy" to build xml or json payloads, large set based data changes can mean those payloads are massive.
In any case, if you want to explore this route, you're going to want to read (all of) the excellent articles on the subject by Remus Rusanu
Bear in mind that this is an asynchronous "near real time" mechanism, not a synchronous "real time" mechanism like triggers.
Polling a built in change detection mechanism: CDC or Change Tracking.
Sql server comes with two flavours of technology that can natively "watch" changes that happen in tables, and record them: Change Tracking, and Change Data Capture
Neither of these push data out of the database, they're both "pull" based. What they do is store additional data in the database when changes happen. CDC can provide a complete log of every change, whereas change tracking "points to" rows that have changed via the primary key values. Though both of these involve "polling history", there are significant differences between them, so read the fine print.
Note that CDC is "doubly asynchronous" - the data is read from the transaction log, so recording the data is not part of the original transaction. And then you have to poll the CDC data, it's not pushed out to you. Furthermore, the functions generated by Microsoft when you enable CDC can be unbelievably slow as soon as you ask for something useful, like net changes with mask (which can tell you which columns really changed their value), and your ability to enable CDC comes with a lot of caveats and limitations (again, read the docs for all of this).
As to which of these is "best", well, that's a matter of opinion and circumstance. I have used CDC extensively, service broker rarely, and triggers almost never, as a way of getting events out of SQL. I have never actually used change tracking in a production environment, but if I had the choice again I would probably have chosen change tracking rather than change data capture, at least until or unless there were requirements that mandated the use of CDC because of its additional functionality beyond what change tracking can provide.
One last note: If you need to "guarantee" that the events that get raised have in fact been collected by a listener and successfully forwarded to subscribers, well, you have some work ahead of you! Guaranteed messaging is hard.

Can I force a step in my dataflow pipeline to be single-threaded (and on a single machine)?

I have a pipeline that takes URLs for files and downloads these generating BigQuery table rows for each line apart from the header.
To avoid duplicate downloads, I want to check URLs against a table of previously downloaded ones and only go ahead and store the URL if it is not already in this "history" table.
For this to work I need to either store the history in a database allowing unique values or it might be easier to use BigQuery for this also, but then access to the table must be strictly serial.
Can I enforce single-thread execution (on a single machine) to satisfy this for part of my pipeline only?
(After this point, each of my 100s of URLs/files would be suitable for processed on a separate thread; each single file gives rise to 10000-10000000 rows, so throttling at that point will almost certainly not give performance issues.)
Beam is designed for parallel processing of data and it tries to explicitly stop you from synchronizing or blocking except using a few built-in primitives, such as Combine.
It sounds like what you want is a filter that emits an element (your URL) only the first time it is seen. You can probably use the built-in Distinct transform for this. This operator uses a Combine per-key to group the elements by key (your URL in this case), then emits each key only the first time it is seen.

Contact Java client from a Hazelcast node

I have a hazelcast cluster that performs several calculations for a java-client triggered by command-line. I need to persist parts of the calculated results on the client-system while the nodes are still working. I am going to store parts of the data in Hazelcasts maps. Now I am looking for a way to inform the client that a node have stored data inside the map and that he can start using it. Is there a way to trigger client operations from any hazelcast-node?
Your question is not very clear, but it looks like you could use com.hazelcast.core.EntryListener to trigger a callback that will notify the client, when a new entry is stored in the data map.
You member node can publish some intermediate results (or just notification message) to Hazelcast IQueue, ITopic or RingBuffer.
A flow looks like this.
a client registers a listener for, say, rignbuffer.
a client submits command to perform on the cluster.
a member persists intermediate results to the IMaps or any other data structure
a member sends a message to the topic about the availability of partial results.
a client receives messages and accessing data in IMap.
a member sends a message when it's done with it's task.
Something like that.
You can find some examples here
Let me know if you have any questions about it.
Cheers,
Vik
There are several paths to solve the problem. The most simple one is using a dedicated IMap or any other of Hazelcasts synchronized collections. One can simply write data in such a map and retrieve/remove it after it got added. But this will cause a huge Overhead, because the data has to be synchronized throughout the cluster. If the data is quite big and the cluster is huge with a few hundred nodes all over the world or at least the USA, the data will be synchronized over all nodes just to get deleted a few moments later, which also has to be synchronized. Not deleting is no option, because the data can get several gb big, which will make synchronization of the data even more expensive. The question got answered but the solution is not suited for every scenario.

Pipelining web service response

I am designing a web service that wraps a very large data source and I would be very grateful for any suggestions whether my design is appropriate or I am missing something substantially better.
So here is the problem:
We have several data sources that all provide the same interface with the "most important" method being RowIterator select(Table table, String where). Now, functionally everything is going fine for all our implementations but the problem is that the web service that we need to wrap around one of the sources would (in a naive implementation) upon receiving a query
wait for the wrapped data source to return the whole result set
marshal the whole result set before sending it to the client
at the client side unmarshal the whole result set before returning it to the caller
Only after this sequence would the caller be able to see the first row. This is a quite disappointing behavior as the caller has to wait unnecessessarily for the whole result set twice. I want to have some pipelining, instead. The caller must be able to see the first results while the service is still sending rows. Now I am planning to overcome this by implementing some kind of paging that is encapsulated in my client-side row iterator. The service would maintain a session id (with a timeout) that is created upon receiving a query and can be used to fetch chunks of data. The session id could already be returned before sending the actual query to the wrapped data source. The client would then fetch chunks (pages) until a chunk is empty or smaller than the expected (= requested) chunk size.
So, in this design the caller would be able to see the first results while the service is still sending rows. However, I am wondering whether there is a way to efficiently pipeline results on a per-row basis using a SOAP web service?
Also, would it be possible to return the results to the caller without repeatedly asking for more results?
In the end I used MTOM to transmit the data in binary and used blocking queues at the client and the server to achieve the desired parallelism. I sketched this here: Streaming large SOAP attachments

how to create a copy of a table in HBase on same cluster? or, how to serve requests using original state while operating on a working state

Is there an efficient way to create a copy of table structure+data in HBase, in the same cluster? Obviously the destination table would have a different name. What I've found so far:
The CopyTable job, which has been described as a tool for copying data between different HBase clusters. I think it would support intra-cluster operation, but have no knowledge on whether it has been designed to handle that scenario efficiently.
Use the export+import jobs. Doing that sounds like a hack but since I'm new to HBase maybe that might be a real solution?
Some of you might be asking why I'm trying to do this. My scenario is that I have millions of objects I need access to, in a "snapshot" state if you will. There is a batch process that runs daily which updates many of these objects. If any step in that batch process fails, I need to be able to "roll back" to the original state. Not only that, during the batch process I need to be able to serve requests to the original state.
Therefore the current flow is that I duplicate the original table to a working copy, continue to serve requests using the original table while I update the working copy. If the batch process completes successfully I notify all my services to use the new table, otherwise I just discard the new table.
This has worked fine using BDB but I'm in a whole new world of really large data now so I might be taking the wrong approach. If anyone has any suggestions of patterns I should be using instead, they are more than welcome. :-)
All data in HBase has a certain timestamp. You can do reads (Gets and Scans) with a parameter indicating that you want to the latest version of the data as of a given timestamp. One thing you could do would be to is to do your reads to server your requests using this parameter pointing to a time before the batch process begins. Once the batch completes, bump your read timestamp up to the current state.
A couple things to be careful of, if you take this approach:
HBase tables are configured to store the most recent N versions of a given cell. If you overwrite the data in the cell with N newer values, then you will lose the older value during the next compaction. (You can also configure them to with a TTL to expire cells, but that doesn't quite sound like it matches your case).
Similarly, if you delete the data as part of your process, then you won't be able to read it after the next compaction.
So, if you don't issue deletes as part of your batch process, and you don't write more versions of the same data that already exists in your table than you've configured it to save, you can keep serving old requests out of the same table that you're updating. This effectively gives you a snapshot.

Categories

Resources