How to import a CSV into HBASE table using MapReduce

How to import a CSV into HBASE table using MapReduce - java

Hi I am quite new to hadoop and I am trying to import a csv table to Hbase using MapReduce.
I am using hadoop 1.2.1 and hbase 1.1.1
i have data in following format:
Wban Number, YearMonthDay, Time, Hourly Precip
03011,20060301,0050,0
03011,20060301,0150,0
I have written the following code for bulk load
public class BulkLoadDriver extends Configured implements Tool{
public static void main(String [] args) throws Exception{
int result= ToolRunner.run(HBaseConfiguration.create(), new BulkLoadDriver(), args);
}
public static enum COUNTER_TEST{FILE_FOUND, FILE_NOT_FOUND};
public String tableName="hpd_table";// name of the table to be inserted in hbase
#Override
public int run(String[] args) throws Exception {
//Configuration conf= this.getConf();
Configuration conf = HBaseConfiguration.create();
Job job= new Job(conf,"BulkLoad");
job.setJarByClass(getClass());
job.setMapperClass(bulkMapper.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
job.setInputFormatClass(TextInputFormat.class);
TableMapReduceUtil.initTableReducerJob(tableName, null, job); //for HBase table
job.setNumReduceTasks(0);
return (job.waitForCompletion(true)?0:1);
}
private static class bulkMapper extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put>{
//static class bulkMapper extends TableMapper<ImmutableBytesWritable, Put> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
String [] val= value.toString().split(",");
// store the split values in the bytes format so that they can be added to the PUT object
byte[] wban=Bytes.toBytes(val[0]);
byte[] ymd= Bytes.toBytes(val[1]);
byte[] tym=Bytes.toBytes(val[2]);
byte[] hPrec=Bytes.toBytes(val[3]);
Put put=new Put(wban);
put.add(ymd, tym, hPrec);
System.out.println(wban);
context.write(new ImmutableBytesWritable(wban), put);
context.getCounter(COUNTER_TEST.FILE_FOUND).increment(1);
}
}
}
I have created a jar for this and ran following in the terminal:
hadoop jar ~/hadoop-1.2.1/MRData/bulkLoad.jar bulkLoad.BulkLoadDriver /MR/input/200603hpd.txt hpd_table
But the output that I get is hundreds of following type of lines:
attempt_201509012322_0001_m_000000_0: [B#2d22bfc8
attempt_201509012322_0001_m_000000_0: [B#445cfa9e
I am not sure what do they mean and how to perform this bulk upload. please help.
Thanks in advance.

There are several ways to import data into HBase. Please have a look at this following link:
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/admin_hbase_import.html
HBase BulkLoad:
Data file in CSV format
Process your data into HFile format. See http://hbase.apache.org/book/hfile_format.html for details about HFile format. Usually you use a MapReduce job for the conversion, and you often need to write the Mapper yourself because your data is unique. The job must to emit the row key as the Key, and either a KeyValue, a Put, or a Delete as the Value. The Reducer is handled by HBase; configure it using HFileOutputFormat.configureIncrementalLoad() and it does the following:
Inspects the table to configure a total order partitioner
Uploads thepartitions file to the cluster and adds it to the
DistributedCache
Sets the number of reduce tasks to match the current number of
regions
Sets the output key/value class to match HFileOutputFormat
requirements
Sets the Reducer to perform the appropriate sorting
(either KeyValueSortReducer or PutSortReducer)
One HFile is created per region in the output folder. Input data is almost completely re-written, so you need available disk space at least twice the size of the original data set. For example, for a 100 GB output from mysqldump, you should have at least 200 GB of available disk space in HDFS. You can delete the original input file at the end of the process.
Load the files into HBase. Use the LoadIncrementalHFiles command (more commonly known as the completebulkload tool), passing it a URL that locates the files in HDFS. Each file is loaded into the relevant region on the RegionServer for the region. You can limit the number of versions that are loaded by passing the --versions= N option, where N is the maximum number of versions to include, from newest to oldest (largest timestamp to smallest timestamp).
If a region was split after the files were created, the tool automatically splits the HFile according to the new boundaries. This process is inefficient, so if your table is being written to by other processes, you should load as soon as the transform step is done.

Related

How can we prevent empty file write in dataflow pipeline when collection size is 0?

I have a dataflow pipeline and I'm parsing a file if I got any incorrect records then I'm writing it on the GCS bucket, but when there are no errors in the input file data still TextIO writes the empty file on the GCS bucket with a header.
So, how can we prevent this if the PCollection size is zero then skip this step?
errorRecords.apply("WritingErrorRecords", TextIO.write().to(options.getBucketPath())
.withHeader("ID|ERROR_CODE|ERROR_MESSAGE")
.withoutSharding()
.withSuffix(".txt")
.withShardNameTemplate("-SSS")
.withNumShards(1));

TextIO.write() always writes at least one shard, even if it is empty. As you are writing to a single shard anyway, you could get around this behavior by doing the write manually in a DoFn that takes the to-be-written elements as a side input, e.g.
PCollectionView<List<String>> errorRecordsView = errorRecords.apply(
View.<String>asList());
// Your "main" PCollection is a PCollection with a single input,
// so the DoFn will get invoked exactly once.
p.apply(Create.of(new String[]{"whatever"}))
// The side input is your error records.
.apply(ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(
#Element String unused,
OutputReceiver<String> out,
ProcessContext c) {
List<String> errors = c.sideInput(errorRecordsView);
if (!errors.isEmpty()) {
// Open the file manually and write all the errors to it.
}
}
}).withSideInputs(errorRecordsView);
Being able to do so with the native Beam writes is a reasonable request. This is not supported in the latest release of Beam by setting skipIfEmpty.

Using Data from a file as Hash-Map in Map Reduce job Hadoop

I have a file with 10,000("small file") rows with Key,Value
different keys in small file can have the same value.
I have to word count on a different file(big file).
buy i need to replace the key from the ("big file") with the Value from the ("small file") -in Mapper.
Only After it count it in reducer.
i would like to achieve it using single map reduce job WITHOUT using pig/hive.
could you help me and guide me how to do it ?
The small file will on hdfs and im not sure how would other nodes would be able to read from it - don't think its even recommended - because the node with the small file will have to work really hard sending data to each map task.

You could do a mapside join and then count the results in reduce side. Place your small file in the distributed cache so that your data will be available to all the nodes. In your mapper store all the key,value pairs in a java hashmap in the setup method and stream the big file through, then do a join in the map method. So this will yield something like this.
Small file (K,V)
Big file (K1,V1)
Mapper output (V(key),V1(value))
Then do a count in the reducer based on V(or interchange the key,value pair in the map output to achieve your need.
How to read from a distributed cache:
#Override
protected void setup(Context context) throws IOException,InterruptedException
{
Path[] filelist=DistributedCache.getLocalCacheFiles(context.getConfiguration());
for(Path findlist:filelist)
{
if(findlist.getName().toString().trim().equals("mapmainfile.dat"))
{
fetchvalue(findlist,context);
}
}
}
public void fetchvalue(Path realfile,Context context) throws NumberFormatException, IOException
{
BufferedReader buff=new BufferedReader(new FileReader(realfile.toString()));
//some operations with the file
}
How to add a file to distributed cache:
DistributedCache.addCacheFile(new URI("/user/hduser/test/mapmainfile.dat"),conf);

saveAsTextFile() to write the final RDD as single text file - Apache Spark

I am working on a batch application using Apache Spark, i wanted to write the final RDD as text file, currently i am using saveAsTextFile("filePath") method available in RDD.
My text file contains the fields delimited with \u0001 delimiter. So in the model class toString() method i added all the fields seperated with \u0001 delimiter.
is this the correct way to handle this? or any other best approach available?
Also what if i iterate the RDD and write the file content using FileWriter class available in Java?
Please advise on this.
Regards,
Shankar

To write as a single file there are a few options. If your writing to HDFS or a similar distributed store you can first coalesce your RDD down to a single partition (note your data must then fit on a single worker), or you could collect the data to the driver and then use a filewriter.

public static boolean copyMerge(SparkConf sparkConf, JavaRDD rdd, String dstPath) throws IOException, URISyntaxException {
Configuration hadoopConf = sparkConf.hadoopConfiguration();
hadoopConf.set("fs.s3.awsAccessKeyId", awsAccessKey);
hadoopConf.set("fs.s3.awsSecretAccessKey", awsSecretKey);
String tempFolder = "s3://bucket/folder";
rdd.saveAsTextFile(tempFolder);
FileSystem hdfs = FileSystem.get(new URI(tempFolder), hadoopConfig);
return FileUtil.copyMerge(hdfs, new Path(tempFolder), hdfs, new Path(dstPath), false, hadoopConfig, null);
}
This solution is for S3 or any HDFS system. Achieved in two steps:
Save the RDD by saveAsTextFile, this generates multiple files in the folder.
Run Hadoop "copyMerge".

Instead of doing collect and collecting it to driver I would rather suggest to use coalesce which would be good in reducing memory problems

Hadoop (1.1.2) XML processing & re-writing file

first question here... and learning hadoop...
I've spent the last 2 weeks trying to understand everything about hadoop, but it seems every hill has a mountain behind it.
Here's the setup:
Lots (1 million) of small (<50MB) XML files (Documents formatted into XML).
Each file is a record/record
Pseudo-distributed Hadoop cluster (1.1.2)
using old mapred API (can change, if new API supports what's needed)
I have found XmlInputFormat ("Mahout XMLInputFormat") as a good starting point for reading files, as I can specify the entire XML document as
My understanding is that XmlInputFormat will take care of ensuring each file is it's own record (as 1 tag exists per file/record).
My issue is this: I Want to use Hadoop to process every document, search for information, and then, for each file/record, re-write or output a new xml document with new xml tag added.
Not afraid of reading and learning, but a skeleton to play with would really help me 'play' and learn hadoop
here is my driver:
public static void main(String[] args) {
JobConf conf = new JobConf(myDriver.class);
conf.setJobName("bigjob");
// Input/Output Directories
if (args[0].length()==0 || args[1].length()==0) System.exit(-1);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.set("xmlinput.start", "<document>");
conf.set("xmlinput.end", "</document>");
// Mapper & Combiner & Reducer
conf.setMapperClass(Mapper.class);
conf.setReducerClass(Reduce.class);
conf.setNumReduceTasks(0);
// Input/Output Types
conf.setInputFormat(XmlInputFormat.class);
conf.setOutputFormat(?????);
conf.setOutputKeyClass(????);
conf.setOutputValueClass(????);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}

I would say a simple solution would be to use TextOutputFormat and then use Text as the output key and NullWritable as the output value.
TextOutputFormat uses a delimiting character to separate the key and value pairs you output from your job. For your requirement, you don't need this arrangement, but you'd just like to output a single body of XML. If you pass null or NullWritable as the output key or value, TextOutputFormat will not write the null or the delimiter, just the non-null key or value.
Another approach to using XmlINputFormat would be to use a WholeFileInput (as detailed in Tom White's Hadoop - The definitive guide).
Eitherway you'll need to write your mapper to consume the input value Text object (maybe with an XML SAX or DOM parser) and then output the transformed XML as a Text object.

Splitting Reducer output in Hadoop

The output files produced by my Reduce operation is huge (1 GB after Gzipping). I want it produce break output into smaller files of 200 MB. Is there a property/Java class to split reduce output by size or no. of lines ?
I can not increase the number of reducers because that has negative impact on performance of the hadoop job.

I'm curious as to why you cannot just use more reducers, but I will take you at your word.
One option you can do is use MultipleOutputs and write to multiple files from one reducer. For example, say that the output file for each reducer is 1GB and you want 256MB files instead. This means you need to write 4 files per reducer rather than one file.
In your job driver, do this:
JobConf conf = ...;
// You should probably pass this in as parameter rather than hardcoding 4.
conf.setInt("outputs.per.reducer", 4);
// This sets up the infrastructure to write multiple files per reducer.
MultipleOutputs.addMultiNamedOutput(conf, "multi", YourOutputFormat.class, YourKey.class, YourValue.class);
In your reducer, do this:
#Override
public void configure(JobConf conf) {
numFiles = conf.getInt("outputs.per.reducer", 1);
multipleOutputs = new MultipleOutputs(conf);
// other init stuff
...
}
#Override
public void reduce(YourKey key
Iterator<YourValue> valuesIter,
OutputCollector<OutKey, OutVal> ignoreThis,
Reporter reporter) {
// Do your business logic just as you're doing currently.
OutKey outputKey = ...;
OutVal outputVal = ...;
// Now this is where it gets interesting. Hash the value to find
// which output file the data should be written to. Don't use the
// key since all the data will be written to one file if the number
// of reducers is a multiple of numFiles.
int fileIndex = (outputVal.hashCode() & Integer.MAX_VALUE) % numFiles;
// Now use multiple outputs to actually write the data.
// This will create output files named: multi_0-r-00000, multi_1-r-00000,
// multi_2-r-00000, multi_3-r-00000 for reducer 0. For reducer 1, the files
// will be multi_0-r-00001, multi_1-r-00001, multi_2-r-00001, multi_3-r-00001.
multipleOutputs.getCollector("multi", Integer.toString(fileIndex), reporter)
.collect(outputKey, outputValue);
}
#Overrider
public void close() {
// You must do this!!!!
multipleOutputs.close();
}
This pseudo code was written with the old mapreduce api in mind. Equivalent apis exist using the mapreduce api, though, so either way, you should be all set.

There's no property to do this. You'll need to write your own output format & record writer.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to import a CSV into HBASE table using MapReduce - java

Related

How can we prevent empty file write in dataflow pipeline when collection size is 0?

Using Data from a file as Hash-Map in Map Reduce job Hadoop

saveAsTextFile() to write the final RDD as single text file - Apache Spark

Hadoop (1.1.2) XML processing & re-writing file

Splitting Reducer output in Hadoop

Categories

Resources