Lightweight distributed file system implemention in java

Lightweight distributed file system implemention in java - java

We are working on an JavaEE application where user can upload images,videos(usually a zip contains a lot of small files) to the server, generally we will save the file(files) to a certain local directory, and then user can access them.
But things get complicated once the application is deployed on multiple servers in front of which a load balance resides. Say there are two servers Server1 and Server2, and one user try to upload some files, and this request is dispatched to Server2, nothing wrong here. And later, another user try to access the file, and his request is dispatched to Server1, then the application can not find the file.
Sounds like I need a distributed file system, while I only need a few features:
1)Nodes can detect each other by themselves.
2)read and write file/directory
3)unzip archive
4)automatically distributes data based on the available space of nodes
HDFS is too big for my application, and I does not need to process the data, I only care about the storage.
Is there a java based lightweight alternative solution which can be embedded in my application?

I'd probably solve this at the operating system level using shared network attached storage. (If you don't have a dedicated NAS available, you can have an application server act as NAS by sharing the relevant directory with NFS).

Related

How do I get access to a Google Compute VM Filesystem remotely using the Java API?

I'm building a framework application for managing files between compute engines based on their roles. I've been able to find a lot of information for managing the compute engines themselves, but nothing for getting access to them to manage files.

There is no Google API for reading and writing to instance filesystems from outside the instance. You will need to run a service on the instance (ftp, sftp, whatever) if you want external access to its filesystem.

Fratzke, first of all your question is incomplete, what platform are you using on compute engine instances, do you want only read only access ?
Note if you want a persistence disk to be shared among many compute engine instances, you can do that only in read only mode. The google documentation clearly states that :
It is possible to attach a persistent disk to more than one instance. However, if you attach a persistent disk to multiple instances, all instances must attach the persistent disk in read-only mode. It is not possible to attach the persistent disk to multiple instances in read-write mode.
However, you can share files between multiple instances using the managed services like google cloud storage, google cloud datastore etc. Answer for similar question is already on SO. Refer this links:
Share a persistent disk between Google Compute Engine VMs
Edit:
If you want to access a filesystem on a multiple instances remotely using JAVA, then they have to make filesystem available with a service. Such a service is typically a background job, which handles incoming requests and returns a response, e.g. for authentication, authorization, reading and writing. The specification of the request/response pattern is called a protocol - e.g. NFS on UNIX/LINUX. U cn write ur own file service provider & run it on the instnces. As transport layer for such an endeavor sockets (TCP/IP) can be used. good transport layer would be the http protocol, e.g. with a restful service.
Edit - 2
Nother possibility is using sshfs, mount the filesystem on each instance and use them within your java code as mounted network shared drive.
Hope this helps you ...Warm Regards.

How to emulate persistent file-system on cloudfoundry/similar PaaS?

In Cloudfoundry, there is no persistent file-system storage in the app containers. The standard suggestion is to use DB, but I have a specific scenario where write to persistent file is needed. I use a 3rd party app that requires storing config data on file. I can select the file path but I cannot override the persistent file storage requirement.
Is there any library that abstracts filesystem visible to Java and actually stores files on Amazon S3?
It is only one single file that builds up as we go along. The file size is about 1 MB but could reach a few MBs.

The application design documentation from cloud foundry recommends not writing to the local file system:
http://docs.cloudfoundry.org/devguide/deploy-apps/prepare-to-deploy.html#filesystem
This is to be compliant with what is called a 12 factor application where one uses backing services to access items like storage systems and run processes that don't rely on shared storage.
Unfortunately there does not appear to be a file system service for Cloud Foundry, although it's been discussed:
https://groups.google.com/a/cloudfoundry.org/forum/#!topic/vcap-dev/Kj08I2H7HHc
Such a service will eventually appear in order to support applications like Drupal, in spite of recommendations to use more efficient mechanisms like S3 to store files.

Azure Java Tomcat logging

I am planning to migrate a previously created Java web application to Azure. The application previously used log4j for application level logs that where saved in a locally created file. The problem is that with the Azure Role having multiple instances I must collect and aggregate these logs and also make sure that they are stored in a persistent storage instead of the virtual machines hard drive.
Logging is a critical component of the application but it must not slow down the actual work. I have considered multiple options and I am curious about the best practice, the best solution considering security, log consistency and performance in both storage-time and by later processing. Here is a list of the options:
Using log4j with a custom Appender to store information in Azure SQL.
Using log4j with a custom Appender to store information in Azure Tables storage.
Writing an additional tool that transfers data from local hard drive to either of the above persistent storages.
Is there any other method or are there any complete solutions for this problem for Java?
Which of the above would be best considering the above mentioned criteria?

There's no out-of-the-box solution right now, but... a custom appender for Table Storage makes sense, as you can then query your logs in a similar fashion to diagnostics (perf counters, etc.).
The only consideration is if you're writing log statements in a massive quantity (like hundreds of times per second). At that rate, you'll start to notice transaction costs showing up on the monthly bill. At a penny per 10,000, and 100 per second, you're looking about $250 per instance. If you have multiple instances, the cost goes up from there. With SQL Azure, you'd have no transaction cost, but you'd have higher storage cost.
If you want to go with a storage transfer approach, you can set up Windows Azure diagnostics to watch a directory and upload files periodically to blob storage. The only snag is that Java doesn't have direct support for configuring diagnostics. If you're building your project from Eclipse, you only have a script file that launches everything, so you'd need to write a small .net app, or use something like AzureRunMe. If you're building a Visual Studio project to launch your Java app, then you have the ability to set up diagnostics without a separate app.
There's a blog post from Persistent Systems that just got published, regarding Java and diagnostics setup. I'll update this answer with a link once it's live. Also, have a look at Cloud Ninja for Java, which implements Tomcat logging (and related parsing) by using an external .net exe that sets up diagnostics, as described in the upcoming post.

Please visit my blog and download the document. In this document you can look for chapter "Tomcat Solution Diagnostics" for error logging solution. This document was written long back but you sure can use this method to generate the any kind of Java Based logging (log4j, sure )in Tomcat and view directly.
Chapter 6: Tomcat Solution Diagnostics
Error Logging
Viewing Log Files
http://blogs.msdn.com/b/avkashchauhan/archive/2010/10/29/windows-azure-tomcat-solution-accelerator-full-solution-document.aspx
In any scenario where there is custom application i.e. java.exe, php.exe, python etc, I suggest to create the log file directly at "Local Storage" Folder and then initialize Azure Diagnostics in Worker Role (WorkerRole.cs) to export these custom log files directly from Azure VM to your Azure Blob storage.
How to create custom logs on local storage is described here.
Using Azure Diagnostics and sending logs to Azure blob would be cheapest and robust then any other method u have described.

Finally I decided to write a Log4J Appender. I didn't need to gather diagnostics information, my main goal was only to gather the log files in an easily exchangeable way. My first fear was that it would slow down the application, but with by writing only to memory and only periodically writing out the log data to Azure tables it works perfectly without making too many API calls.
Here are the main steps for my implementation:
First I created an entity class to be stored in Azure Tables, called LogEntity that extends com.microsoft.windowsazure.services.table.client.TableServiceEntity.
Next I wrote the appender that extends org.apache.log4j.AppenderSkeleton containing a java.util.List<LogEntity>.
From the overrided method protected void append(LoggingEvent event) I only added to this collection and then created a thread that periodically empties this list and writes the data to Azure tables.
Finally I added the newly created Appender to my log4j configuration file.

Another alternative;
Can we not continue using log4j the standard way (such as DailyRollingFileAppender) only the file should be created on a UNC path, on a VM (IaaS).
This VM, will only need to have a bit of disk space, but need not have any great processing power. So one could share an available VM, or create a VM with the most minimal configuration, preferably in the same region and cloud service.
The accumulated log files can be accessed via RDP/ FTP etc.
That way one will not incur transaction cost and
cost of developing a special Log4j appender ... it could turn out as a cheaper alternative.
thanks
Jeevan
PS: I am referring more towards, ones application logging and not to the app-server logs (catalina/ manager .log or .out files of Weblogic)

Single virtual tomcat application which serves multiple contexts

I have multiple clients:
client 1 - 40 users
client 2 - 50 users
client 3 - 60 users
And I have a web application that is supposed to serve all the clients.
The application is deployed into Tomcat. Each client has it's own database.
What I want to implement is the single web application instance which servers all the clients. The client (and the database to connect to) is identified by the context path from the URL.
I.e. I imply the following scenario:
Some user requests the http://mydomain.com/client1/
Tomcat invokes a single instance of my application (no matter which context is requested)
My application processes the rest of the request thinking that it's deployed to /client1 context path, i.e. all redirect or relative URLs should be resolved against http://mydomain.com/client1/
When the client 2 requests the http://mydomain.com/client2/, I want my application (the same instance) now process it just like if it was deployed to /client2 context path.
Is this possible with Tomcat?

Your application has to do this not tomcat. Now you could deploy your application in three new contexts (client1, client2, client3) with slightly different configuration for the database, and if you are careful to use relative URLs (ie don't do things like /images) then you can do this without changes. This is the transparent way of making your application reusable in that your application is unaware of the global picture that you have 3 different instances of itself running. That means you can easily deploy more or more without having to change your application. You just configure a new instance and go. This only requires you don't use absolute URLs to resources. Using ServletContext.getContextPath() and using .. in your CSS, scripts, etc is helpful as well here.
Probably one of the biggest advantages working this way is that your app doesn't care about global concerns. Because its not involved in those decisions you can run 3 instances on one tomcat server, or if one client needs more scaling they can be moved to their own tomcat server easily. By making your app portable it has forced you to deal with how to install your app in any environment. This is a pillar of horizontal scaling which your situation could very much take advantage being you can split your DB data without having to rejoin them (huge advantage). The option you asked about doesn't force you to deal with this so when the time comes to deal with it it will be painful.
The other option is more involved and requires significant changes to your application to handle this. This is by parsing the incoming URL and pulling out the name of the client then using that name to look up in a configuration file for the database that should be used for that client. SpringMVC can handle things like extracting variables from URL paths. Then making sure you render everything back to them so it points to their portion of the URL. This probably would require a lot of the same requirements as the first. You can use absolute URLs for things like javascript, CSS, and images, but URLs to your app would have to be rewritten at runtime so that it is relative to the requesting client. The benefit is that your only load your application once.
Just as an aside, if you host your CSS, Javascript, images on a CDN in production then both of these options must be relative URL aware. Upsides and downsides to using CDNs as well.
While that sounds good it might not be a good thing because all clients use the same version of the app. Also if you bring down a the app to fix client1 to do maintenance it affects all clients. If you think you'll have to do customization per client then this option will get messy quick. Upgrading a single client means all clients must upgrade and depending on your business model this might not be compatible. Furthermore, I'm not entirely sure you'll save a lot of memory either running only a single version of the application because most apps only take up 10MB of code loaded. A vast majority of the memory is in the VM and processing requests, and using a single Tomcat instance means you share the VM. And with 1 or 3 instances running you still have the same number of requests. You might see a difference of 30-100MBs which in todays world is chump change, and all of those other concerns aren't addresses if you choose to save only a couple of MB.
Essentially there are facilities in Tomcat to aid you in doing this (multiple contexts), but its mostly up to your application to handle this especially if its a single instance.

How to perform an as fair as possible load balancing based on specific resource paths

I have an application that serves artifacts from files (pages from PDF files as images), the original PDF files live on S3 and they are downloaded to the servers that generate the images when a client hits one of them. These machines have a local caching mechanism that guarantees that each PDF file is downloaded only once.
So, when a client comes with a request give me page 1 of pdf 123.pdf this cache is checked, if there is no pdf file in there, it's downloaded from S3 and stored at the local cache and then a process generates this page 1 and sends the image back to the client.
The client itself does not know it's connected to a special server, it all looks like it's just accessing the website server, but, for the sake of performance I would like to make sure this client is always going to be directed to the same file server that served it's first request (and downloaded the file from S3).
I could just set a cookie on the client to get him to always download from that specific file server, but placing this on the client leads to unfair usage, as some users are going to open many documents and some are not so I would like to perform this load balancing at the resource level (PDF document).
Each document has a unique identification (integer primary key in the database) and my first solution was using Redis and storing the document id as a key and the value is the host of the server machine that currently has this document cached, but I would like to remove Redis or look for a simpler way to implement this that would not require looking for keys somewhere else.
Also, it would be nice if the defined algorithm or idea would allow for adding more file servers on the fly.
What would be the best way to perform this kind of load balancing with affinity based on resources?
Just for the sake of saying, this app is a mix of Ruby, java and Scala.

I'd use the following approach in the load balancer:
Strip the requested resource URL to remove the query and fragment parts.
Turn the stripped URL into a String and take its hashcode.
Use the hashcode to select the back end server from the list of available servers; e.g.
String[] serverNames = ...
String serverName = serverNames[hash % serverNames.length];
This spreads the load evenly across all servers, and always sends the same request to the same server. If you add more servers, it adjusts itself ... though you take a performance hit while the caching warms up again.
I don't think you want to aim for "fairness"; i.e. some kind of guarantee that each requests takes roughly the same time. To achieve fairness you need to actively monitor the load on each backend and dispatch according to load. That's going to (somewhat) negate the caching / affinity, and is going to consume resources to do the measurement and load-balancing decision making. A dumb load spreading approach (e.g. my suggestion) should give you better throughput overall for your use-case.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.