Few days ago, Google publish this article:
https://cloud.google.com/blog/big-data/2018/07/developing-a-janusgraph-backed-service-on-google-cloud-platform
We can read from there, that it is common to deploy janus graph as a separate instance behind the internal load balancer.
So, in my project we have pretty much the same architecture: bigtable, gke with janus and some app which calls janus through load balancer. The only difference ( dunno if that's important or no, we don't have internal load balancer, we have the "external(?)" one )
So. The question is: what is the state of load balancing when using gremlin driver in java application. Our research shows that it does not work. Since connections are stateful the throughput is not forwarded randomly to janus replicas. When it sticked to one - it stays with that particular replica till the end.
However, when the replica is killed, the connection somehow hangs, without any exception, warning, log, anything. It's like not information about the state of the connection at all. It is bad cause if we assume that one have automatic load balancer which spins out additional replicas when needed, it will simply does not work.
We are using janus graph 0.21 with corresponding tinkerpop driver 3.2.9 ( however we've tried many different combinations ) and still the schema stays the same. Load balancing does not work for us, as well as failover when some pod gets killed. - to make this even worse it is no really deterministic - we had some tests where it worked, but when we return to that test after a while, it doesn't.
Do you, stackoverflowers have any idea what is the state of this problem?
Related
We are currently preparing hazelcast for going live in the next weeks. There is still one bigger issue left, that troubles our OPs department and could be a possible show stopper in case we cannot fix it.
Since we are maintaining a high availability payment application, we have to survive in case the cluster ist not available. Reasons could be:
Someone messed up the hazelcast configuration and a map on the cluster increases until we have OOM (had this on the test system).
There is some issue with the network cards/hardware that temporary breaks the connection to the cluster
OPs guys reconfigured the firewall and accidentaly blocked some ports that are necessary, whatosoever.
Whatever else
I spent some time on finding good existing solution, but the only solution so far was to increase the number of backup servers, which of course does not solve the case.
During my current tests the application completely stopped working because after certain retries the clients disconnect from the cluster and the hibernate 2nd level cache is no longer working. Since we are using hazelcast throughout the whole ecosystem this would kill 40 java clients almost instantly.
Thus I wonder how we could achieve that the applications are still working in a of course slower manner when the cluster is down. Our current approach is to switch over to ehcache local cache but I think there should be hazelcast solution for that problem as well?
If I were you I would use a LocalSessionFactoryBean and set the cacheRegionFactory to a Spring Bean that can delegate a call to either Hazelcast or a NoCachingRegionFactory, if the Hazelcast server is down.
This is desirable, since Hibernate assumes the Cache implementation is always available, so you need to provide your own CacheRegion proxy that can decide the cache region routing at runtime.
Please note: if the cache systems mentioned in this question work so completely differently from one another that an answer to this question is nearly-impossible, then I would simplify this question down to anything that is just JCache (JSR107) compliant.
The major players in the distributed cache game, for Java at least, are EhCache, Hazelcast and Infinispan.
First of all, my understanding of a distributed cache is that it is a cache that lives inside a running JVM process, but that is constantly synchronizing its in-memory contents across other multiple JVM processes running elsewhere. Hence Process 1 (P1) is running on Machine 1 (M1), P2 is running on M2 and P3 is running on M3. An instance of the same distributed cache is running on all 3 processes, but they somehow all know about each other and are able to keep their caches synchronized with one another.
I believe EhCache accomplishes this inter-process synchrony via JGroups. Not sure what the others are using.
Furthermore, my understanding is that these configurations are limiting because, for each node/instance/process, you have to configure it and tell it about the other nodes/instances/processes in the system, so they can all sync their caches with one another. Something like this:
<cacheConfig>
<peers>
<instance uri="myapp01:12345" />
<instance uri="myapp02:12345" />
<instance uri="myapp03:12345" />
</peers>
</cacheConfig>
So to begin with, if anything I have stated is incorrect or is mislead, please begin by correcting me!
Assuming I'm more or less on track, then I'm confused how distributed caches could possibly work in an elastic/cloud environment where nodes are regulated by auto-scalers. One minute, load is peaking and there are 50 VMs serving your app. Hence, you would need 50 "peer instances" defined in your config. Then the next minute, load dwindles to a crawl and you only need 2 or 3 load balanced nodes. Since the number of "peer instances" is always changing, there's no way to configure your system properly in a static config file.
So I ask: How do distributed caches work on the cloud if there are never a static number of processes/instances running?
One way to handle that problem is to have an external (almost static) caching cluster which holds the data and your application (or the frontend servers) are using clients to connect to the cluster. You can still scale the caching clusters up and down to your needs but most of the time you'll need less nodes in the caching cluster than you'll need frontend servers.
I have a Java application, and I need it to be high available.
I was thinking of FastMPJ, like running multiple instances on different PCs. Every minute the app will check if master instance is running, and if not, the other will run instead of it.
I'd like to ask if it is a good solution, or if there is any better.
A more general solution is to use a load-balancing system, that is: you have N instances of the application running with the same privileges (if possible on different hardware), then a redundant load-balancer in front selects one of those based on the actual load for each request/task.
The benefit of this solution is obviously, that hardware is actually used and doesn't sit somewhere idle, waiting on the 0.01% case to jump in. Then the instance is actually tested all the time, and errors will be reported when they happen (like faulty hardware), and you prevent a: "Oh... the backup isn't even working". And on top of that you balance the load between machines adaptively.
In one of my project while implementing a exchange we used Apache Qpid for high availability and my experiense was quite satisfaotory. It scales very well too. I have been running application up to 32 node clusters. Please find further details here and let me know in case u need any further infromation:
http://qpid.apache.org/releases/qpid-0.18/books/AMQP-Messaging-Broker-Java-Book/html/High-Availability.html
Hope it helps:)
One often forgets that there must also be high availability from the application to database as well. It is my experience that the data access layer is where most of the application bottlenecks occur. So make sure you have a good application aware DB load balancer. Oracle has a solid solution but is for Oracle databases only. PostGres has an open source version. Heimdall Data is a commercial solution.
I have cluster setup up and running ...Jboss 7.1.1.Final and mod_cluster mod_cluster-1.2.6.Final.
mod_cluster load balancing is happening bitween two nodes - nodeA nodeB.
But when I stop one node and start, mod_cluster still sends the all load to the other node. It is not distributing load after comeback.
What is configuration changes required this ? I could see both nodes enabled in mod_cluster_manager. But it directs load only to one node even after the other node comeback after fail over.
Thanks
If you are seeing existing requests being forwarded to the active node, then it's because of sticky session being enabled. This is the default behavior.
If you are seeing new requests are not being forwarded to the new node (even when it's not busy) then it is a different issue. You may want to look at the load balancing factor/algorithm that you are currently utilizing in your mod-cluster subsystem.
It came to my mind, that you might actually be seeing the correct behaviour -- within a short time span. Take a look at my small FAQ: I started mod_cluster and it looks like it's using only one of the workers. TL;DR: If you send only a relatively small amount of requests, it might look like the load balancing doesn't work whereas it's actually correct not to flood fresh newcomers with a barrage of requests at once.
I have this in mind:
On each server: (they all are set up identically)
A free database like MySQL or PostgreSQL.
Tomcat 6.x for hosting Servlet based Java applications
Hibernate 3.x as the ORM tool
Spring 2.5 for the business layer
Wicket 1.3.2 for the presentation layer
I place a load balancer in front of the servers and a replacement load balancer in case my primary load balancer goes down.
I use Terracotta to have the session information replicated between the servers. If a server goes down the user should be able to continue their work at another server, ideally as if nothing happened.
What is left to "solve" (as I haven't actually tested this and for example do not know what I should use as a load balancer) is the database replication which is needed.
If a user interacts with the application and the database changes, then that change must be replicated to the database servers on the other server machines. How should I go about doing that? Should I use MySQL PostgreSQL or something else (which ideally is free as we have a limited budget)? Does the other things above sound sensible?
Clarification: I cluster to get high availability first and foremost and I want to be able to add servers and use them all at the same time to get high scalability.
Since you're already using Terracotta, and you believe that a second DB is a good idea (agreed), you might consider expanding Terracotta's role. We have customers who use Terracotta for database replication. Here's a brief example/description but I think they have stopped supporting clients for this product.:
http://www.terracotta.org/web/display/orgsite/TCCS+Asynchronous+Data+Replication
You are trying to create a multi-master replication, which is a very bad idea, as any change to any database has to replicate to every other database. This is terribly slow - on one server you can get several hundred transactions per second using a couple of fast disks and RAID1 or RAID10. It can be much more if you have a good RAID controller with battery backed cache. If you add the overhead of communicating with all your servers, you'll get at most tens of transactions per second.
If you want high availability you should go for a warm standby solution, where you have a server, which is replicated but not used - when main server dies a replacement takes over. You can lose some recent transactions if your main server dies.
You can also go for one master, multiple slaves asynchronous replication. Every change to a database will have to be performed on one master server. But you can have several slave, read-only servers. Data on this slave servers can be several transactions behind the master so you can also lose some recent transactions in case of server death.
PostgreSQL does have both types of replication - warm standby using log shipping and one master, multiple slaves using slony.
Only if you will have a very small number of writes you can go for synchronous replication. This can also be set for PostgreSQL using PgPool-II or Sequoia.
Please read High Availability, Load Balancing, and Replication chapter in Postgres documentation for more.
For my (Perl-driven) website, I am using MySQL on two servers with database replication. Each MySQL server is slave and master at the same time. I did this for redudancy, not for performance, but the setup has worked fine for the past 3 years, we had almost no downtime at all during this period.
Regarding Kent's question / comment: I am using the standard replication that comes with MySQL.
Regarding the failover mechanism: I am using DNSMadeEasy.com's failover functionality. I have a Perl script run every 5 minutes via cron that checks if replication is still running (and also lots of other things such as server load, HDD sanity, RAM usage, etc.). During normal operation, the faster of the two servers delivers all web pages. If the script detects that something is wrong with the server (or if the server is just plain down), DNSMadeEasy switches DNS entries so that the secondary server becomes primary. Once the "real" primary server is back up, MySQL automatically catches up on missing database changes and DNSMadeEasy automatically switches back.
Here's an idea. Read Theo Schlossnagle's book Salable Internet Architectures.
What you're proposing is not a the best idea.
Load balancers are expensive and not as valuable as they would appear. Use something simpler for distributing the load between your servers (something like Wackamole).
Rather than fool around with DB replication, spend your money on a reliable DB server separate from your front-end web servers. Do regular backups and in the very unlikely event of DB failure, get back running as quickly as possible from ordinary backups.
AFAIK, MySQL does better job being scalable. See the documentation
http://dev.mysql.com/doc/mysql-ha-scalability/en/ha-overview.html
And there is a blog, where you can take a look at real life examples:
http://highscalability.com/tags/mysql