How to emulate persistent file-system on cloudfoundry/similar PaaS?

How to emulate persistent file-system on cloudfoundry/similar PaaS? - java

In Cloudfoundry, there is no persistent file-system storage in the app containers. The standard suggestion is to use DB, but I have a specific scenario where write to persistent file is needed. I use a 3rd party app that requires storing config data on file. I can select the file path but I cannot override the persistent file storage requirement.
Is there any library that abstracts filesystem visible to Java and actually stores files on Amazon S3?
It is only one single file that builds up as we go along. The file size is about 1 MB but could reach a few MBs.

The application design documentation from cloud foundry recommends not writing to the local file system:
http://docs.cloudfoundry.org/devguide/deploy-apps/prepare-to-deploy.html#filesystem
This is to be compliant with what is called a 12 factor application where one uses backing services to access items like storage systems and run processes that don't rely on shared storage.
Unfortunately there does not appear to be a file system service for Cloud Foundry, although it's been discussed:
https://groups.google.com/a/cloudfoundry.org/forum/#!topic/vcap-dev/Kj08I2H7HHc
Such a service will eventually appear in order to support applications like Drupal, in spite of recommendations to use more efficient mechanisms like S3 to store files.

Related

How do I get access to a Google Compute VM Filesystem remotely using the Java API?

I'm building a framework application for managing files between compute engines based on their roles. I've been able to find a lot of information for managing the compute engines themselves, but nothing for getting access to them to manage files.

There is no Google API for reading and writing to instance filesystems from outside the instance. You will need to run a service on the instance (ftp, sftp, whatever) if you want external access to its filesystem.

Fratzke, first of all your question is incomplete, what platform are you using on compute engine instances, do you want only read only access ?
Note if you want a persistence disk to be shared among many compute engine instances, you can do that only in read only mode. The google documentation clearly states that :
It is possible to attach a persistent disk to more than one instance. However, if you attach a persistent disk to multiple instances, all instances must attach the persistent disk in read-only mode. It is not possible to attach the persistent disk to multiple instances in read-write mode.
However, you can share files between multiple instances using the managed services like google cloud storage, google cloud datastore etc. Answer for similar question is already on SO. Refer this links:
Share a persistent disk between Google Compute Engine VMs
Edit:
If you want to access a filesystem on a multiple instances remotely using JAVA, then they have to make filesystem available with a service. Such a service is typically a background job, which handles incoming requests and returns a response, e.g. for authentication, authorization, reading and writing. The specification of the request/response pattern is called a protocol - e.g. NFS on UNIX/LINUX. U cn write ur own file service provider & run it on the instnces. As transport layer for such an endeavor sockets (TCP/IP) can be used. good transport layer would be the http protocol, e.g. with a restful service.
Edit - 2
Nother possibility is using sshfs, mount the filesystem on each instance and use them within your java code as mounted network shared drive.
Hope this helps you ...Warm Regards.

Lightweight distributed file system implemention in java

We are working on an JavaEE application where user can upload images,videos(usually a zip contains a lot of small files) to the server, generally we will save the file(files) to a certain local directory, and then user can access them.
But things get complicated once the application is deployed on multiple servers in front of which a load balance resides. Say there are two servers Server1 and Server2, and one user try to upload some files, and this request is dispatched to Server2, nothing wrong here. And later, another user try to access the file, and his request is dispatched to Server1, then the application can not find the file.
Sounds like I need a distributed file system, while I only need a few features:
1)Nodes can detect each other by themselves.
2)read and write file/directory
3)unzip archive
4)automatically distributes data based on the available space of nodes
HDFS is too big for my application, and I does not need to process the data, I only care about the storage.
Is there a java based lightweight alternative solution which can be embedded in my application?

I'd probably solve this at the operating system level using shared network attached storage. (If you don't have a dedicated NAS available, you can have an application server act as NAS by sharing the relevant directory with NFS).

EJB and working with binary files - best practice

We have two ways of keeping files - database or filesystem.
So I have EJB, that can be connected to as from servlet as from java application.
The problem is with working with binary files. I found here that it's a restriction for EJB (NOT FOR OTHER JAVA EE CONTAINERS) to write/read files to/from filesystem. They say keep in database.
Ok. I said. But here and here everybody says that's a very bad idea. And I totally agree with it.
Another solution is JCR in Apache Jackrabbit implementation. But I consider it as both of ways. Here it says it keeps files also in filesystem.
So what is the best practice? Why can't we write files to filesystem without JCR?

You could use a JCA connector to access the filesystem from your EJBs. Thus the filesystem would be a resource http://code.google.com/p/txfs/ for a basic filesystem access.
XADisk for a more complete and transactional access.
Nuxeo doesn't use JCR anymore and you have also Modeshape ( JBoss/ RedHat implementation) on top of Infinispan as a data grid.

Keeping the binary files on the file system will have the least overall performance impact and that's the most appropriate way to go (applications managing tons of binary content like Alfresco use the file system and that works fine for them). If you need your files to be accessed from multiple servers you could store them on a single server and share the directory on the network or just use NAS. However working with the file system is not transactional but if that is critical you may use tools like XADisk that support transactions on a file system level.
JCR is also a good way to go but it is not meant to store only binary files, it's meant to store a "content" (binary data + metadata describing the binary data) in a hierarchical way. However using JCR you are adding another level of abstraction and that may have a performance impact. As the most popular JCR implementations like Apache Jackrabbit, Alfresco and Nuxeo use the file system behind the scene you should consider if you really need the additional features that JCR provides like metadata, versioning, thumbnails, document preview, etc. but with a performance penalty and integration cost or you can just go with the file system if it satisfy your needs.

Azure Java Tomcat logging

I am planning to migrate a previously created Java web application to Azure. The application previously used log4j for application level logs that where saved in a locally created file. The problem is that with the Azure Role having multiple instances I must collect and aggregate these logs and also make sure that they are stored in a persistent storage instead of the virtual machines hard drive.
Logging is a critical component of the application but it must not slow down the actual work. I have considered multiple options and I am curious about the best practice, the best solution considering security, log consistency and performance in both storage-time and by later processing. Here is a list of the options:
Using log4j with a custom Appender to store information in Azure SQL.
Using log4j with a custom Appender to store information in Azure Tables storage.
Writing an additional tool that transfers data from local hard drive to either of the above persistent storages.
Is there any other method or are there any complete solutions for this problem for Java?
Which of the above would be best considering the above mentioned criteria?

There's no out-of-the-box solution right now, but... a custom appender for Table Storage makes sense, as you can then query your logs in a similar fashion to diagnostics (perf counters, etc.).
The only consideration is if you're writing log statements in a massive quantity (like hundreds of times per second). At that rate, you'll start to notice transaction costs showing up on the monthly bill. At a penny per 10,000, and 100 per second, you're looking about $250 per instance. If you have multiple instances, the cost goes up from there. With SQL Azure, you'd have no transaction cost, but you'd have higher storage cost.
If you want to go with a storage transfer approach, you can set up Windows Azure diagnostics to watch a directory and upload files periodically to blob storage. The only snag is that Java doesn't have direct support for configuring diagnostics. If you're building your project from Eclipse, you only have a script file that launches everything, so you'd need to write a small .net app, or use something like AzureRunMe. If you're building a Visual Studio project to launch your Java app, then you have the ability to set up diagnostics without a separate app.
There's a blog post from Persistent Systems that just got published, regarding Java and diagnostics setup. I'll update this answer with a link once it's live. Also, have a look at Cloud Ninja for Java, which implements Tomcat logging (and related parsing) by using an external .net exe that sets up diagnostics, as described in the upcoming post.

Please visit my blog and download the document. In this document you can look for chapter "Tomcat Solution Diagnostics" for error logging solution. This document was written long back but you sure can use this method to generate the any kind of Java Based logging (log4j, sure )in Tomcat and view directly.
Chapter 6: Tomcat Solution Diagnostics
Error Logging
Viewing Log Files
http://blogs.msdn.com/b/avkashchauhan/archive/2010/10/29/windows-azure-tomcat-solution-accelerator-full-solution-document.aspx
In any scenario where there is custom application i.e. java.exe, php.exe, python etc, I suggest to create the log file directly at "Local Storage" Folder and then initialize Azure Diagnostics in Worker Role (WorkerRole.cs) to export these custom log files directly from Azure VM to your Azure Blob storage.
How to create custom logs on local storage is described here.
Using Azure Diagnostics and sending logs to Azure blob would be cheapest and robust then any other method u have described.

Finally I decided to write a Log4J Appender. I didn't need to gather diagnostics information, my main goal was only to gather the log files in an easily exchangeable way. My first fear was that it would slow down the application, but with by writing only to memory and only periodically writing out the log data to Azure tables it works perfectly without making too many API calls.
Here are the main steps for my implementation:
First I created an entity class to be stored in Azure Tables, called LogEntity that extends com.microsoft.windowsazure.services.table.client.TableServiceEntity.
Next I wrote the appender that extends org.apache.log4j.AppenderSkeleton containing a java.util.List<LogEntity>.
From the overrided method protected void append(LoggingEvent event) I only added to this collection and then created a thread that periodically empties this list and writes the data to Azure tables.
Finally I added the newly created Appender to my log4j configuration file.

Another alternative;
Can we not continue using log4j the standard way (such as DailyRollingFileAppender) only the file should be created on a UNC path, on a VM (IaaS).
This VM, will only need to have a bit of disk space, but need not have any great processing power. So one could share an available VM, or create a VM with the most minimal configuration, preferably in the same region and cloud service.
The accumulated log files can be accessed via RDP/ FTP etc.
That way one will not incur transaction cost and
cost of developing a special Log4j appender ... it could turn out as a cheaper alternative.
thanks
Jeevan
PS: I am referring more towards, ones application logging and not to the app-server logs (catalina/ manager .log or .out files of Weblogic)

Using file system instead of database to store pdf files in jackrabbit

In our project we use jackrabbit with spring and tomcat to manage pdf files.
Currently MySql database is being used to store blob files (in terms of jackrabbit it's called BundleDbPersistenceManager).
As soon as the number of generated files grow we thought of using file system instead of database to boost performance and to eliminate replication overhead.
In the spec jackrabbit team recommend using BundleFsPersistenceManager instead but with comments like this
Not meant to be used in production environments (except for read-only uses)
Does anyone have any experience using BundleFsPersistenceManager and can reference any resources on painless migration from blobs in mysql database to files in the filesystem?
Thank you very much in advance

Persistence in Jackrabbit is a bit complicated, it makes sense to read the configuration overview documentation first.
In Jackrabbit, binaries are stored in the data store by default, and not in the persistence manager. Even if you use the BundleDbPersistenceManager, large binary files are stored in the data store. You can combine the (default) FileDataStore with the BundleDbPersistenceManager.
I would recommended to not use the BundleFsPersistenceManager, because data can get corrupt quite easily if the program gets killed while writing.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.