Best way to import 20GB CSV file to Hadoop

Best way to import 20GB CSV file to Hadoop - java

I have a huge 20GB CSV file to copy into Hadoop/HDFS. Of course I need to manage any error cases (if the the server or the transfer/load application crashes).
In such a case, I need to restart the processing (in another node or not) and continue the transfer without starting the CSV file from the beginning.
What is the best and easiest way to do that?
Using Flume? Sqoop? a native Java application? Spark?
Thanks a lot.

If the file is not hosted in HDFS, flume wont be able to parallelize that file (Same issue with Spark or other Hadoop based frameworks). Can you mount your HDFS on NFS and then use file copy?
One advantage of reading using flume would be to read the file and publish each line as a separate record and publish those records and let flume write one record to HDFS at a time, if something goes wrong you could start from that record instead of starting from beginning.

Related

Design a Spring batch application to read data from different resources(Flat files)

I am developing a batch application using (Spring boot, java, and Spring batch) for which I need to read data from different locations. Below is my use case:
Multiple paths such as C://Temp//M1, C://Temp//M2 , both locations can contain identical files with same data such as C://Temp//M1//File1.txt, C://Temp//M2//File1.txt, and C://Temp//M1//File2.txt, C://Temp//M2//File2.txt
At first, I need to merge them in memory if an identical file exists at both locations before starting batch after removing duplicates and pass the merged in-memory data as an argument to the reader.
I have designed batch using multiresourceitemreader which reads flat files and processes them but not able to achieve in-memory merging and duplicate removal from multiple files.
So may you please have a look and suggest me a way how can I achieve this?

Through my experience I have found the usage of BeanIO library priceless when it comes to dealing with flat files. Also it integrates with spring batch.
http://beanio.org/
Which regards of reading from 2 locations you can:
Implement your reader as a composite that read first line from file 1 then from file two
you first read through the reader file 1 and inside the prosessor you enrich with data from file number 2.
premerge the files

If you are aware of Kafka try Kafka connect framework. Use the Confluent platform to easily use their connectors.
Then consume from Kafka into your Spring application.
https://www.confluent.io/hub
If you are interested in Kafka I'll explain to you in detail

Would it be possible to use Hadoop for automatic process balancing without using HDFS

I know that it is possible to distribute jobs over a hadoop cluster. Also do i know that it is possible to read and write semi-directly to SQL Databases from within an hadoop system.
My question is more directed as, is it done, in the real world, that data is read and write from files and a relational database from within hadoop jobs and then, after processing written back to the relational database. So using hadoop directly as process balancer, with something like hibernate and without the use of HDFS.
Thanks

This is not possible. Because you wont have access to the records in setup and clean up tasks of mapper and reducer.
Out of hdfs, the only way to execute the jobs is to input/output with local file system.

HBASE : Bulk load (Is my understanding correct)

Bulk load usually uses map reduce to create a file on HDFS and this file is then assoicated with a region.
If thats the case, can my client create this file (locally) and put it on hdfs. See as we already know what keys are , what values, we can do it locally without loading the server.
Can someone point to an example, how hfile can be created (in any language will be fine)
regards

Nothing actually stops anyone from preparing HFile 'by hands' but doing so you start to depend on HFile compatibility issues. In accordance to this (https://hbase.apache.org/book/arch.bulk.load.html) you just need to put your files to HDFS ('closer' to HBase) and call completebulkload.
Proposed strategy:
- Check HFileOutputFormat2.java file from HBase sources. It is standard MapReduce OutputFormat. What you indeed need as base for this is just sequence of KeyValue elements (or Cell if we speak in term or interfaces).
- You need to free HFileOutputFormat2 from MapReduce. Check for its writer logics for this. You need only this part.
- OK, you need also to build effective solution for Put -> KeyValue stream handling for HFile. First place to look is TotalOrderPartitioner and PutSortReducer.
If you did all steps you have solution that can take sequence of Put (no issue to generate them from any data) and as a result you have local HFile. Looks like this should take up to week to get something pretty working.
I don't go this way because just having good InputFormat and data transforming mapper (which I have long ago) I now can use standard TotalOrderPartitioner and HFileOutputFormat2 INSIDE MapReduce framework to have everything working just using full cluster power. Feel confused by 10G SQL dump loaded in 5 minutes? Not me. You can't beat such speed using single server.
OK, this solution required careful SQL requests design for SQL DB to perform ETL process from. But now it's everyday procedure.

Hadoop with MongoDB Concept

Hi I am new to Hadoop and NoSQL technologies. I started learning with world-count program by reading file stored in HDFS and and processing it. Now I want to use Hadoop with MongoDB. Started program from here .
Now here is confusion with me that it stores mongodb data on my local file system, and read data from local file system to HDFS in map/reduce and again write it to mongodb local file system. When I studied HBase, we can configure it to store it's data on HDFS, and hadoop can directly process it on HDFS(map/reduce). How to configure mongodb to store it's data on HDFS.
I think it is better approach to store data in HDFS for fast processing. Not in the local file system. Am I right? Please clear my concept if I am going in wrong direction.

MongoDB isn't built to work on top of HDFS and it's not really necessary since Mongo already has its own approach for scaling horizontally and working with data stored across multiple machines.
A better approach if you need to work with MongoDB and Hadoop is to use MongoDB as the source of your data but process everything in Hadoop (which will use HDFS for any temporary storage). Once your done processing the data you can write it back to MongoDB, S3, or wherever you want.
I wrote a blog post that goes into a little more details about how you can work with Mongo and Hadoop here: http://blog.mortardata.com/post/43080668046/mongodb-hadoop-why-how

HDFS is a distributed file system while HBase is a NoSQL database that uses HDFS as its file system provide a fast and efficient integration with Hadoop that has been prove to work at scale. Being able to work with HBase data directly in Hadoop or push it into HDFS is one of the big advantages when picking HBase as a NoSQL database solution - I don't believe MongoDB provides such tight integration with Hadoop and HDFS which would mitigate any performance and efficiency concerns with moving data from/to a database.
Please look at this blog post for a detailed analysis on how well MongoDB integrates with Hadoop - one of the conclusions was that writes to HDFS from MongoDB didn't perform well: http://www.ikanow.com/how-well-does-mongodb-integrate-with-hadoop/

Moving data from many machines to a server using Hazelcast

We have a goal of moving text-file represented database table row coming from several machines to a single machine, our current solution is file based
- Zip the files then send it over the wire
- Server will receive zip files from those machines and unzip to some folder according.
There are lots of other file moving operation in between that is happening which is really faulty.
I'm thinking of using hazlecast to move the each "row" String into the server. Is Hazelcast up to this kind of job?
The text file is being generate from many machines with a rate of 200K to 300K per day. These files must be send to the server. So I want to migrate this to Hazelcast.

You can do this with hazelcast, but it is the wrong use case for it. Hazelcast will synchronize in all directions. If you add an entry on client1 it will be transfered to the server but also to client2. Even if this doesn't scare you it shows hazelcast is mis-used here.
You will be better by implement a simple webservice on the server to which the clients pushes the "rows".

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.