How to store and manage large files - java

Did anybody know a java framework to store and handle large number of files? I am searching for an alternative to store my files into a SQL BLOB field.

Apache Hadoop claims to support large number of large files..

If you need meta data for the files as well, you might want to look at something like Jackrabbit.

Related

Simplest and fastest way to find rows in a large CSV

I have several CSV files and I need to load them and search for rows by column value.
Someone suggests to use OpenCSV project to load CSV. But I don't know if this is the best way.
Does OpenCSV provide some search/filter utility?
Is there a better way to do what I need?
You can load the data from your CSV files into your favourite SQL engine, like e.g. MySQL or SQLite, and use SQL to filter conveniently and fast. This is a common task so databases have ready to use tools for importing data from CSV files, this is how you can do it in SQLite: http://www.sqlite.org/cvstrac/wiki?p=ImportingFiles
If your CSV files are too big to keep in memory and you don't want to resort to storing everything in a database first (this would be a tedious disk to memory to disk operation) then there is another approach nobody seems to have mentioned: streaming.
The approach would consist of reading a number of rows from the file, processing them and then discarding the ones that don't match your search. You could do this with the Apache commons FileUtils for example. It could be some of the existing CSV API's offer this, I haven't checked that.
Use an embedded database, separating CSV from search functionality.
Something like Apache Commons CSV will simply give you a 2-dimensional string array of values. I doubt any solution will give you something more than this (given no type/schema info in a CVS file) and I suspect a well-crafted loop over these results is all you need. That'll be the simplest and fastest (as requested).
If you want to do more, you can run up the standard Java-provided JavaDb database in-JVM, load the results into that and perform SQL queries without an external datasource/service.
Note that memory may be a problem if you load a sizable CSV, but just how big are these ? Memory is very cheap these days.

Output logging in Java

I'm currently working on applying genetic algorithms to a particular application, and the issue is that there is a large amount of data that I need to analyze, graph and simply, tabulate. Upto this point I have been using csv files, but they have been kind of limited as I still have to generate charts manually, and its an issue when this needs to be done on over 100 documents.
Are there any other options for output logging in Java, for analysis other than CSV files? Any link to any API of any kind would also be useful.
P.S: (The question seems common enough to have been asked already, but I couldn't find it.) I'm not asking about how to log data in Java, or how to redirect it to a file, but if there are any existing ways to easily tabulate and graph large amounts of output.
The kind of data I'm working with involves a lot of numerical data, specifically the attributes of different generations and different organisms inside those generations. I'm trying to find and interpret trends within the numerical data which would mean that I need to generate separate graphs for different populations or test runs, and also find representative values for each file and graph those against specific test run conditions.
Also, there is a time parameter which references the speed of the algorithm. Which methods let me log output without letting the post-processing and disk access affect my test runs? Is it possible?
You could use Apache POI to write out an Excel spreadsheet directly. You can also have it start with a spreadsheet already containing macros and whatever else you need to display your information.
There are plenty of choices in exporting reports / data. There is an open source project call JasperReport that can do charts, PDF, XML, CSV, plain text exporting. But it is an involved process, but does offer Java API to accomplish that task.
How about writing table into a database, like MySQL?
Using such, you can search your data by better means than in a text file.
Have a look here: http://www.vogella.de/articles/MySQLJava/article.html
On linux, you also have the possibility to stay using your csv files and then generate plots using gnuplot scipts.
Sounds like you need to be saving the data to a database and then using Jasper Reports to build a report containing the graphs and whatever else you need using the data that you stored in the database instead of trying to use excel. Jasper is fairly easy to use and you can have your java application generate the report for you after the data has been stored in the database.

Ideal place to store Binary data that can be rendered by calling a url

I am looking for an ideal (performance effective and maintainable) place to store binary data. In my case these are images. I have to do some image processing,scale the images and store in a suitable place which can be accesses via a RESTful service.
From my research so far I have a few options, like:
NoSql solution like MongoDB,GridFS
Storing as files in a file system in a directory hierarchy and then using a web server to access the images by url
Apache Jackrabbit Document repository
Store in a cache something like Memcache,Squid Proxy
Any thoughts of which one you would pick and why would be useful or is there a better way to do it?
Just started using GridFS to do exactly what you described.
From my experience thus far, the main advantage to GridFS is that it obviates the need for a separate file storage system. Our entire persistency layer is already put into Mongo, and so the next logical step would be to store our filesystem there as well. The flat namespacing just rocks and allows you a rich query language to fetch your files based off whatever metadata you want to attach to them. In our app we used an 'appdata' object that embedded all the ownership information, ensure
Another thing to consider with NoSQL file storage, and especially GridFS, is that it will shard and expand along with your other data. If you've got your entire DB key-value store inside the mongo server, then eventually if you ever have to expand your server cluster with more machines, your filesystem will grow along with it.
It can feel a little 'black box' since the binary data itself is split into chunks, a prospect that frightens those used to a classic directory based filesystem. This is alleviated with the help of admin programs like RockMongo.
All in all to store images in GridFS is as easy as inserting the docs themselves, most of the drivers for all the major languages handle everything for you. In our environment we took image uploads at an endpoint and used PIL to perform resizing. The images were then fetched from mongo at another endpoint that just output the data and mimetyped it as a jpeg.
Best of luck!
EDIT:
To give you an example of a trivial file upload with GridFS, here's the simplest approach in PyMongo, the python library.
from pymongo import Connection
import gridfs
binary_data = 'Hello, world!'
db = Connection().test_db
fs = gridfs.GridFS(db)
#the filename kwarg sets the filename in the mongo doc, but you can pass anything in
#and make custom key-values too.
file_id = fs.put(binary_data, filename='helloworld.txt',anykey="foo")
output = fs.get(file_id).read()
print output
>>>Hello, world!
You can also query against your custom values if you like, which can be REALLY useful if you want your queries to be based off custom information relative to your application.
try:
file = fs.get_last_version({'anykey':'foo'})
return file.read()
catch gridfs.errors.NoFile:
return None
These are just some simple examples, and the drivers for alot of the other languages (PHP, Ruby etc.) all have cognates.
I would go for jackrabbit in combination with its REST framework sling http://sling.apache.org
Sling allows you to upload/download files via REST calls or webdav while the underlying jackrabbit repository gives you a performant storage with the possibility to store your files in a tree structure (or flat if you like).
Both jackrabbit and sling support an event mechanism where you can asynchronously process the image after upload to i.e. create thumbnails.
The manual at http://sling.apache.org/site/manipulating-content-the-slingpostservlet-servletspost.html describes how to manipulate data using the REST interface provided by sling.
Storing the images as blobs in an RDBMS in another option, and you immediately get some guarantees about integrity, security etc (if this is setup properly on the database), store extra metadata, manage the collection with SQL etc.

Alternative of Storing data except databases like mysql,sql etc

I had completed my project Address Book in Java core, in which my data is stored in database (MySql).
I am facing a problem that when i run my program on other computer than tere is the requirement of creating the hole data base again.
So please tell me any alternative for storing my data without using any database software like mysql, sql etc.
You can use an in-memory database such as HSQLDB, Derby (a.k.a JavaDB), H2, ..
All of those can run without any additional software installation and can be made to act like just another library.
I would suggest using an embeddable, lightweight database such as SQLite. Check it out.
From the features page (under the section Suggested Uses For SQLite):
Application File Format. Rather than
using fopen() to write XML or some
proprietary format into disk files
used by your application, use an
SQLite database instead. You'll avoid
having to write and troubleshoot a
parser, your data will be more easily
accessible and cross-platform, and
your updates will be transactional.
The whole point of StackOverflow was so that you would not have to email around questions/answers :)
You could store data in a filesystem, memory (use serialisation etc) which are simple alternatives to DB. You can even use HSQLDB which can be run completely in memory
If you data is not so big, you may use simple txt file and store everything in it. Then load it in memory. But this will lead to changing the way you modify/query data.
Database software like mysql, sql etc provides an abstraction in terms of implementation effort. If you wish to avoid using the same, you can think of having your own database like XML or flat files. XML is still a better choice as XML parsers or handlers are available. Putting your data in your customised database/flat files will not be manageable in the long run.
Why don't you explore sqlite? It is file based, means you don't need to install it separately and still you have the standard SQL to retrieve or interact with the data? I think, sqlite will be a better choice.
Just use a prevayler (.org). Faster and simpler than using a database.
I assume from your question that you want some form of persistent storage to the local file system of the machine your application runs on. In addition to that, you need to decide on how the data in your application is to be used, and the volume of it. Do you need a database? Are you going to be searching the data different fields? Do you need a query language? Is the data small enough to fit in to a simple data structure in memory? How resilient does it need to be? The answers to these types of questions will help lead to the correct choice of storage. It could be that all you need is a simple CSV file, XML or similar. There are a host of lightweight databases such as SQLite, Berkelely DB, JavaDB etc - but whether or not you need the power of a database is up to your requirements.
A store that I'm using a lot these days is Neo4j. It's a graph database and is not only easy to use but also is completely in Java and is embedded. I much prefer it to a SQL alternative.
In addition of the others answers about embedded databases I was working on a objects database that directly serialize java objects without the need for ORM. Its name is Sofof and I use it in my projects. It has many features which are described in its website page.

java embedded database w/ ability to store as one file

I need to create a storage file format for some simple data in a tabular format, was trying to use HDF5 but have just about given up due to some issues, and I'd like to reexamine the use of embedded databases to see if they are fast enough for my application.
Is there a reputable embedded Java database out there that has the option to store data in one file? The only one I'm aware of is SQLite (Java bindings available). I tried H2 and HSQLDB but out of the box they seem to create several files, and it is highly desirable for me to have a database in one file.
edit: reasonably fast performance is important. Object storage is not; for performance concerns I only need to store integers and BLOBs. (+ some strings but nothing performance critical)
edit 2: storage data efficiency is important for larger datasets, so XML is out.
Nitrite Database http://www.dizitart.org/nitrite-database.html
NOsql Object (NO2 a.k.a Nitrite) database is an open source nosql
embedded document store written in Java with MongoDB like API. It
supports both in-memory and single file based persistent store.
H2 uses only one file, if you use the latest H2 build with the PAGE_STORE option. It's a new feature, so it might not be solid.
If you only need read access then H2 is able to read the database files from a zip file.
Likewise if you don't need persistence it's possible to have an in-memory only version of H2.
If you need both read/write access and persistence, then you may be out of luck with standard SQL-type databases, as these pretty much all uniformly maintain the index and data files separately.
Once i used an object database that saved its data to a file. It has a Java and a .NET interface. You might want to check it out. It's called db4o.
Chronicle Map is an embedded pure Java database.
It stores data in one file, i. e.
ChronicleMap<Integer, String> map = ChronicleMap
.of(Integer.class, String.class)
.averageValue("my-value")
.entries(10_000)
.createPersistedTo(databaseFile);
Chronicle Map is mature (no severe storage bugs reported for months now, while it's in active use).
Idependent benchmarks show that Chronicle Map is the fastest and the most memory efficient key-value store for Java.
The major disadvantage for your use case is that Chronicle Map supports only a simple key-value model, however more complex solution could be build on top of it.
Disclaimer: I'm the developer of Chronicle Map.
If you are looking for a small and fast database to maybe ship with another program I would check Apache Derby I don't know how you would define embedded-database but I used this in some projects as a debugging database that can be checked in with the source and is available on every developer machine instantaneous.
This isn't an SQL engine, but If you use Prevayler with XStream, you can easily create a single XML file with all your data. (Prevayler calls it a snapshot file.)
Although it isn't SQL-based, and so requires a little elbow grease, its self-contained nature makes development (and especially good testing) much easier. Plus, it's incredibly fast and reliable.
You may want to check out jdbm - we use it on several projects, and it is quite fast. It does use 2 files (a database file and a log file) if you are using it for ACID type apps, but you can drop directly to direct database access (no log file) if you don't need solid ACID.
JDBM will easily support integers and blobs (anything you want), and is quite fast. It isn't really designed for concurrency, so you have to manage the locking yourself if you have multiple threads, but if you are looking for a simple, solid embedded database, it's a good option.
Since you mentioned sqlite, I assume that you don't mind a native db (as long as good java bindings are available). Firebird works well with java, and does single file storage by default.
Both H2 and HSQLDB would be excellent choices, if you didn't have the single file requirement.
I think for now I'm just going to continue to use HDF5 for the persistent data storage, in conjunction with H2 or some other database for in-memory indexing. I can't get SQLite to use BLOBs with the Java driver I have, and I can't get embedded Firebird up and running, and I don't trust H2 with PAGE_STORE yet.

Categories

Resources