Moving data from many machines to a server using Hazelcast - java

We have a goal of moving text-file represented database table row coming from several machines to a single machine, our current solution is file based
- Zip the files then send it over the wire
- Server will receive zip files from those machines and unzip to some folder according.
There are lots of other file moving operation in between that is happening which is really faulty.
I'm thinking of using hazlecast to move the each "row" String into the server. Is Hazelcast up to this kind of job?
The text file is being generate from many machines with a rate of 200K to 300K per day. These files must be send to the server. So I want to migrate this to Hazelcast.

You can do this with hazelcast, but it is the wrong use case for it. Hazelcast will synchronize in all directions. If you add an entry on client1 it will be transfered to the server but also to client2. Even if this doesn't scare you it shows hazelcast is mis-used here.
You will be better by implement a simple webservice on the server to which the clients pushes the "rows".

Related

How do I do coordination between multiple application instances?

Need help in deciding what frameworks I can use in this scenario. I'm exploring Zookeeper, but not completely sure on how to solution this usecase.
background :
Say there is application that connects to a streaming source(Kafka or Activemq etc) and writes messages that were processed from the stream
to a file.
This application is deployed as 4 instances.Each instance is processing messages and writing to file that were processed in last 1 hr.
Each instance creates a file that stores messages that it processed last 1
hour. example -filename is servername_8.00 for messages processed from 8-9
Requirement is to transfer all the files that were created last 1 hour if every instance created a file in that window and also send only one consolidated file which lists all the 4 file names and
number of records.
what i'm looking for :
1. How do I make sure application instances know if other instances also created files and if every instance created then only they should transmit file
2. whatever instance sending, consolidated file should know what was transmitted.
what frameworks I can use to solve this?
You can definitely use ZooKeeper for this. I would use Apache Curator as well (note: I'm the main author of Curator).
Do all the instances share a file server? i.e. can each instance see all of the created files? If so, you can nominate a leader using ZooKeeper/Curator and only the leader does all of the work. You can see sample leader election code here: https://github.com/apache/curator/tree/master/curator-examples/src/main/java/leader
If the instances do not share a file server, you could still use ZooKeeper to coordinate the writing of the shared file. You'd again nominate a leader who exposes an API of some kind that all the instances can write to and the leader creates the shared file.
You also might find the Curator barrier recipes useful: http://curator.apache.org/curator-recipes/double-barrier.html and http://curator.apache.org/curator-recipes/barrier.html
You'd have to give a lot more detail on your use case if you want a more detail design.

Lightweight distributed file system implemention in java

We are working on an JavaEE application where user can upload images,videos(usually a zip contains a lot of small files) to the server, generally we will save the file(files) to a certain local directory, and then user can access them.
But things get complicated once the application is deployed on multiple servers in front of which a load balance resides. Say there are two servers Server1 and Server2, and one user try to upload some files, and this request is dispatched to Server2, nothing wrong here. And later, another user try to access the file, and his request is dispatched to Server1, then the application can not find the file.
Sounds like I need a distributed file system, while I only need a few features:
1)Nodes can detect each other by themselves.
2)read and write file/directory
3)unzip archive
4)automatically distributes data based on the available space of nodes
HDFS is too big for my application, and I does not need to process the data, I only care about the storage.
Is there a java based lightweight alternative solution which can be embedded in my application?
I'd probably solve this at the operating system level using shared network attached storage. (If you don't have a dedicated NAS available, you can have an application server act as NAS by sharing the relevant directory with NFS).

multiple clients querying data with no server

I'm working on a school project where the client needs to have multiple users querying and writing to a single data source. The users have access to shared network drives and all functionality has to be in the client application, the IT department won't allow a service to run from one of their servers and external server hosting isn't an option.
The amount of data that actually needs to be stored is actually very little, about 144 rows maximum.
I've looked into using embedded databases, sqllite , hsql , objectdb ... etc but they seem over kill for how little data needs to be saved. It also seemed like with hsql if anyone accessed the database it would be completely locked to any other user. Concurrency wouldn't be much of an issue there will be 5-7 people using the system albeit scarcely only a few times a year.
Would using something like XQuery and serializing everything in xml be a viable option or just simply using the java serializable api?
A distributed, client side database writing files to the shared network drive could be a good solution for this use case. Take a look at Cloud DB, it might be what your looking for.

How to perform an as fair as possible load balancing based on specific resource paths

I have an application that serves artifacts from files (pages from PDF files as images), the original PDF files live on S3 and they are downloaded to the servers that generate the images when a client hits one of them. These machines have a local caching mechanism that guarantees that each PDF file is downloaded only once.
So, when a client comes with a request give me page 1 of pdf 123.pdf this cache is checked, if there is no pdf file in there, it's downloaded from S3 and stored at the local cache and then a process generates this page 1 and sends the image back to the client.
The client itself does not know it's connected to a special server, it all looks like it's just accessing the website server, but, for the sake of performance I would like to make sure this client is always going to be directed to the same file server that served it's first request (and downloaded the file from S3).
I could just set a cookie on the client to get him to always download from that specific file server, but placing this on the client leads to unfair usage, as some users are going to open many documents and some are not so I would like to perform this load balancing at the resource level (PDF document).
Each document has a unique identification (integer primary key in the database) and my first solution was using Redis and storing the document id as a key and the value is the host of the server machine that currently has this document cached, but I would like to remove Redis or look for a simpler way to implement this that would not require looking for keys somewhere else.
Also, it would be nice if the defined algorithm or idea would allow for adding more file servers on the fly.
What would be the best way to perform this kind of load balancing with affinity based on resources?
Just for the sake of saying, this app is a mix of Ruby, java and Scala.
I'd use the following approach in the load balancer:
Strip the requested resource URL to remove the query and fragment parts.
Turn the stripped URL into a String and take its hashcode.
Use the hashcode to select the back end server from the list of available servers; e.g.
String[] serverNames = ...
String serverName = serverNames[hash % serverNames.length];
This spreads the load evenly across all servers, and always sends the same request to the same server. If you add more servers, it adjusts itself ... though you take a performance hit while the caching warms up again.
I don't think you want to aim for "fairness"; i.e. some kind of guarantee that each requests takes roughly the same time. To achieve fairness you need to actively monitor the load on each backend and dispatch according to load. That's going to (somewhat) negate the caching / affinity, and is going to consume resources to do the measurement and load-balancing decision making. A dumb load spreading approach (e.g. my suggestion) should give you better throughput overall for your use-case.

Storing java objects in server memory

I got a java web project handling several objects (again containing n objects of type A (e.g. time and value) and m objects of type B (e.g. time and String array)). The web projects itself contains several servlets/jsps for visualization as well as some logic for data manipulation and currently runs on an Apache Tomcat.
Is it possible to store the whole data in the servers (or most of the time: local) memory while the server is running? If the Tomcat is shut down, the data could be stored in a simple file, no restrictions there. On server startup, I just want to read in the files and write the objects to memory. How can I initiate the Tomcat to do so?
The reason why I do not want to use an extra database is, that I want to deliver a zip file containing the tomcat including the deployed *.war file (as I don't want my prof getting stuck with tomcat server setup etc.)
Thanks, ChrisH
You could implement ServletContextListener and write the load-from-file and save-to-file logic in the contextInitialized() and contextDestroyed() methods which are invoked during webapp's startup and shutdown respectively.
You can read and write objects to disk, but they all need to implement java.io.Serializable first. Here is a Serialization tutorial with code examples.
That said, have you considered an embedded database so that you don't need to install a database server? You could use the JDK6's built-in JavaDB for this or its competitor HSQLDB. Alternatively, if it are pure key-value pairs, then you could also just use the java.util.Properties API for this (tutorial here). Just place the propertiesfile somewhere in the classpath and use ClassLoader#getResourceAsStream() to get an InputStream of it, or place it somewhere in WEB-INF and use ServletContext#getResourceAsStream().
I think that HSQLDB is exactly what you need, a small database server that is also embedded natively in Apache Tomcat. It stores data in memory allowing also to write and read contents from a file.
If the app shuts down unexpectedly, you'll lose all your data, because it won't have time to write it to disk.
You could use a database like SQLite/derby/hsql etc. which store their data to the filesystem.
If you don't want to mess with a DB, then you could store everything in memory and flush it to disk every time it's modified. A couple tips here:
Serialization can make this really easy. Make all your objects implement Serializable, and give them a serial version id
use a BufferedOutputStream when
writing to disk, this is faster than a straight FileOutputStream
DO NOT overwrite your old data file directly! Write to a new file, and when done writing, move the completed file on top of your old file. That way, if the server shuts down while you're in the middle of writing your data file, you still have the good file which was written before.
You should acquire a read lock on your data while writing it. Any other code which modifies the data should get a Write lock on the data.
If you don't care about the possibility that your application may scribble all over your data files, your Tomcat / JVM may crash, or your machine may die losing all in-memory objects, then managing persistence as you suggest is an option. But you'll have quite a bit of infrastructure to build, test and maintain. And you'll miss out on the "value add" tools that most RDBMs provide; backup, a query tool, optimizers, replication, etc.
But if catastrophic data loss is not an option for you, you should use an RDBMs, ODBMs, whatever to do your persistence.

Categories

Resources