I have a huge 20GB CSV file to copy into Hadoop/HDFS. Of course I need to manage any error cases (if the the server or the transfer/load application crashes).
In such a case, I need to restart the processing (in another node or not) and continue the transfer without starting the CSV file from the beginning.
What is the best and easiest way to do that?
Using Flume? Sqoop? a native Java application? Spark?
Thanks a lot.
If the file is not hosted in HDFS, flume wont be able to parallelize that file (Same issue with Spark or other Hadoop based frameworks). Can you mount your HDFS on NFS and then use file copy?
One advantage of reading using flume would be to read the file and publish each line as a separate record and publish those records and let flume write one record to HDFS at a time, if something goes wrong you could start from that record instead of starting from beginning.
I know that it is possible to distribute jobs over a hadoop cluster. Also do i know that it is possible to read and write semi-directly to SQL Databases from within an hadoop system.
My question is more directed as, is it done, in the real world, that data is read and write from files and a relational database from within hadoop jobs and then, after processing written back to the relational database. So using hadoop directly as process balancer, with something like hibernate and without the use of HDFS.
Thanks
This is not possible. Because you wont have access to the records in setup and clean up tasks of mapper and reducer.
Out of hdfs, the only way to execute the jobs is to input/output with local file system.
In our project we use jackrabbit with spring and tomcat to manage pdf files.
Currently MySql database is being used to store blob files (in terms of jackrabbit it's called BundleDbPersistenceManager).
As soon as the number of generated files grow we thought of using file system instead of database to boost performance and to eliminate replication overhead.
In the spec jackrabbit team recommend using BundleFsPersistenceManager instead but with comments like this
Not meant to be used in production environments (except for read-only uses)
Does anyone have any experience using BundleFsPersistenceManager and can reference any resources on painless migration from blobs in mysql database to files in the filesystem?
Thank you very much in advance
Persistence in Jackrabbit is a bit complicated, it makes sense to read the configuration overview documentation first.
In Jackrabbit, binaries are stored in the data store by default, and not in the persistence manager. Even if you use the BundleDbPersistenceManager, large binary files are stored in the data store. You can combine the (default) FileDataStore with the BundleDbPersistenceManager.
I would recommended to not use the BundleFsPersistenceManager, because data can get corrupt quite easily if the program gets killed while writing.
I need to know is there any way to import data from mysql to HDFS, there are some conditions I need to mention.
I know hbase,hive and sqoop can help me , but I dont wan't any extra layers. Just mapreduce and hadoop java api.
I also need to update HDFS as data is updated in mySQL.
I need to know best way to import mysql data into HDFS and update in real time.
Why don't you want to use sqoop - it does what you would have to do (open a JDBC connection get data , write to hadoop) see this presentation from hadoop world 09
You can use Real Time import using CDC and Talend.
http://www.talend.com/talend-big-data-sandbox
Yes, you can access the database and HDFS via JDBC connectors and hadoop Java API.
But in map-reduce things will be out of your control when accessing a database.
Each mapper/reducer tries to establish a separate connection to database, eventually impacts the database performance.
There won't be any clue which mapper/reducer executes what portion of the query result set.
Incase if there is a single mapper/reducer to access the database then hadoop parallelism will be lost.
Fault tolerant mechanism has to be implemented if any of the mapper/reducer is failed.
list goes on......
To overcome all these hurdles, Sqoop was developed to transfer data between RDBMS to/from HDFS.
I need to create a storage file format for some simple data in a tabular format, was trying to use HDF5 but have just about given up due to some issues, and I'd like to reexamine the use of embedded databases to see if they are fast enough for my application.
Is there a reputable embedded Java database out there that has the option to store data in one file? The only one I'm aware of is SQLite (Java bindings available). I tried H2 and HSQLDB but out of the box they seem to create several files, and it is highly desirable for me to have a database in one file.
edit: reasonably fast performance is important. Object storage is not; for performance concerns I only need to store integers and BLOBs. (+ some strings but nothing performance critical)
edit 2: storage data efficiency is important for larger datasets, so XML is out.
Nitrite Database http://www.dizitart.org/nitrite-database.html
NOsql Object (NO2 a.k.a Nitrite) database is an open source nosql
embedded document store written in Java with MongoDB like API. It
supports both in-memory and single file based persistent store.
H2 uses only one file, if you use the latest H2 build with the PAGE_STORE option. It's a new feature, so it might not be solid.
If you only need read access then H2 is able to read the database files from a zip file.
Likewise if you don't need persistence it's possible to have an in-memory only version of H2.
If you need both read/write access and persistence, then you may be out of luck with standard SQL-type databases, as these pretty much all uniformly maintain the index and data files separately.
Once i used an object database that saved its data to a file. It has a Java and a .NET interface. You might want to check it out. It's called db4o.
Chronicle Map is an embedded pure Java database.
It stores data in one file, i. e.
ChronicleMap<Integer, String> map = ChronicleMap
.of(Integer.class, String.class)
.averageValue("my-value")
.entries(10_000)
.createPersistedTo(databaseFile);
Chronicle Map is mature (no severe storage bugs reported for months now, while it's in active use).
Idependent benchmarks show that Chronicle Map is the fastest and the most memory efficient key-value store for Java.
The major disadvantage for your use case is that Chronicle Map supports only a simple key-value model, however more complex solution could be build on top of it.
Disclaimer: I'm the developer of Chronicle Map.
If you are looking for a small and fast database to maybe ship with another program I would check Apache Derby I don't know how you would define embedded-database but I used this in some projects as a debugging database that can be checked in with the source and is available on every developer machine instantaneous.
This isn't an SQL engine, but If you use Prevayler with XStream, you can easily create a single XML file with all your data. (Prevayler calls it a snapshot file.)
Although it isn't SQL-based, and so requires a little elbow grease, its self-contained nature makes development (and especially good testing) much easier. Plus, it's incredibly fast and reliable.
You may want to check out jdbm - we use it on several projects, and it is quite fast. It does use 2 files (a database file and a log file) if you are using it for ACID type apps, but you can drop directly to direct database access (no log file) if you don't need solid ACID.
JDBM will easily support integers and blobs (anything you want), and is quite fast. It isn't really designed for concurrency, so you have to manage the locking yourself if you have multiple threads, but if you are looking for a simple, solid embedded database, it's a good option.
Since you mentioned sqlite, I assume that you don't mind a native db (as long as good java bindings are available). Firebird works well with java, and does single file storage by default.
Both H2 and HSQLDB would be excellent choices, if you didn't have the single file requirement.
I think for now I'm just going to continue to use HDF5 for the persistent data storage, in conjunction with H2 or some other database for in-memory indexing. I can't get SQLite to use BLOBs with the Java driver I have, and I can't get embedded Firebird up and running, and I don't trust H2 with PAGE_STORE yet.