Here is a situation I have encountered: I have two similair java application running on different servers. Both applications obtain data from the same website using web-service provided. But the site doesn't know of course that the first app has taken the same peace of data as the second app. After fetching data should be saved in database. So I have a problem of saving the same data two times in a database.
How can I avoid duplicate entries in my db?
Probably there are two ways:
1) use database side. write something that looks like "insert if unique".
2) use server side. write some intermediate service that will receive responses from two data fetchers and process them somehow.
I suppose second solution is more effecient.
Can you advice something on this topic?
How would you implement that intermediate service? How would implement communication between the services? If we would use the HashMaps to store received data, how can we estimate maximum size of HashMap that our system can handle?
Do you really need to fetch data at two servers simultaneously? Checking every entry during insert if not present could be expensive. Merging several fetches can be time consuming as well. Is there any benefit of fetching in parallel? Consider having one fetcher at time.
The problem you will face is that you have to choose which one of you distributed processes should perform data fetching and storing it in DB.
It is some kind of Leader Election problem.
Take a look at Apache ZooKeeper which is distributed coordination service.
There is a receipt how to implement leader election with ZooKeeper.
There are a lot of frameworks that already implemented this receipt. I'd recommend you to use Netflix curator. More details about the leader election with curator is available at wiki.
There are distributed frameworks for this sort of problem.
Hazelcast - will allow you to have a single distributed ConcurrentMap across multiple JVM's.
Terracotta - Using it's DSO (Distributed shared objects I think) it will maintain a Map implementation across JVM;s
Related
During some testing of multiple memcached instance I realized that spymemcached Java client was evenly distributing the key data across the configured instances. I know that memcached is a distributed, but is there a way to configure a client to write key data to all configured instances? I know that memory cache approaches like this are not designed to replace persistent storage (DB) but I have zero need for persistent storage and need a lightweight way to synchronize basic key/value data between two or more instances of my service.
The test Java code I prototyped worked great, and I feel the spymemcached API would integrate well, but I need to replicate the data between memcached instances. I assumed if I specified multiple MC instances that the data would be distributed to all, not across all available. Thanks.
There is some memcached client that allow data replication among multiple memcached servers. From what I can tell, SpyMemcached is not one of them.
I do not understand however, why you want this. Lightweight synchronization works just as well without replication. Memcached clients generally (this includes SpyMemcached) use consistent hashing to map from a key to a server, so every instance of your service will look for a key on the same server.
With CQRS architecture, in write intensive realtime applications like trading systems, traditional approaches of loading aggregates from database + distributed cache + distributed lock do not perform well.
Actor model (AKKA) fits well here but i am looking for an alternative solution. What I have in mind is to use Kafka for sending commands, make use of topic partitioning to make sure commands for the same aggregate always arrive on the same node. Then use database + local cache + local pessimistic lock to load aggregate roots and handle commands. This brings 3 main benefits:
aggregates are distributed across multiple nodes
no network traffics for looking up central cache and distributed locks
no serialization & deserialization when saving and loading aggregates
One problem of this approach is when consumer groups rebalance, may produce stale aggregate state in the local cache, setting short cache timeouts should work most of time.
Has anyone used this approach in real projects?
Is it a good design and what are the down sides?
Please share your thoughts and experiences. Thank you.
IMHO Kafka will do the job for you. You need to ensure, if the network is fast enough.
In our project are reacting on soft real time on customer needs and purchases, and we are sending over Kafka infos to different services, which are performing busines logic. This works well.
Confirmantions on network levels are done well within Kafka broker.
For example when one of Broker nodes crashes, we do not loose messages.
Another matter is if you need any kind of very strong transactional confirmations for all actions, then you need to be careful in design - perhaps you need more topics, to send infos and all needed logical confirmations.
If you need to implement some more logic, like confirmation when the message is processed by other external service, perhaps you will need also disable auto commits.
I do not know if it is complete answer to your question.
I have a case where I need to frequently update and retrieve values of a map. This variable should have the same keys and values throughout all four servers. If one server updates the Map, it should be reflective in the other servers.
I believe I should be caching this..
Can I have some example codes in how I should be achieving this ?
Thank you.
You need a distributed cache. Choosing one is another issue...
see here.
Example of using EhCache - here.
I would suggest to use any distributed cache for it, i.e. Hazelcast implementation for distributed map
You could setup a Hazelcast cluster and implement MapStore
Also you will need to configure Hazelcast clients on each tomcat server. This clients will load distributed map and take care of syncing the data.
Hazelcast has a great documentation and plenty of examples so it should be easy for you to deal with it.
In Java, I have a HashMap containing objects (which can be serializable, if it helps). Elsewhere on a network, I have another HashMap in another copy of the application that I would like to stay in sync with the first.
For example if on computer A, someone runs myMap.put("Hello", "World"); and on computer B, someone runs myMap.put("foo", "bar");, then after some time delay for changes to propagate, both computers would have mayMap.get("Hello") == "World" and mayMap.get("foo") == "bar".
Is this requirement met by an existing facility in the Java language, a library, or some other program? If this is already a "solved problem" it would be great not to have to write my own code for this.
If there are multiple ways of achieving this I would prefer, in priority order:
Changes are guaranteed to propagate 100% of the time (doesn't matter how long it takes)
Changes propagate rapidly
Changes propagate with minimal bandwidth use between computers.
(Note: I have had trouble searching for solutions as results are dominated by questions about synchronizing access to a Map from multiple threads in the same application. This is not what my question is about.)
You could look at the hazelcast in-memory database.
It's an open source solution designed for distributed architectures.
It maps really well to your problem since the hazelcast IMap extends java.util.Map.
Link: Hazelcast IMap
what you are trying to do is call clustering between two node
here i have some solution
you can achieve your requirement using serialization make your map
serializable read and write state of map in each interval of time
and sync it.this is core and basic way to achieve your
functionality.but by using serialization you have to manually manage
sync of map(i.e you have to do code for that)
Hazelcast open source distributed caching mechanism hazelcast
is best api and have reach libarary to achive cluster environment
and share data between different node
coherence web also provide mechanism to achieve clustering by
Oracle
Ehcache is a cache library introduced in 2003 to improve
performance by reducing the load on underlying resources. Ehcache is
not for both general-purpose caching and caching Hibernate
(second-level cache), data access objects, security credentials, and
web pages. It can also be used for SOAP and RESTful server caching,
application persistence, and distributed caching
among all of above Hazelcast is best api go through it will sure help you
I have large set of data(more than 1TB). This will be accessed by more than 1000 people concurrently. Storing it in one database will make the application really slow. So I was planning to store it across different databases. Does mongo DB support routing between different databases? Or should this in our application? I am developing using Java and use Spring framework to interact with mongo.
Given the reason for splitting your data into multiple databases is to improve performance, I would suggest sharding a single database rather than splitting across multiple. If location is granular enough and you would like to split load across servers you could then use tag aware sharding to pin specific locations or location ranges to a specific server. There is a good tutorial on this available here.
Before following this route I would suggest performing load tests on your application with your database on the hardware you plan to use for your system. It is worth confirming that you really do need to shard/split data and if so the # of servers you may need. If your database is going to be read rather than write intensive it could be that a non-sharded database would handle your load giving your working set fits in memory.