Unable to load the HFiles into HBase using mapreduce.LoadIncrementalHFiles - java

I want to insert the out-put of my map-reduce job into a HBase table using HBase Bulk loading API LoadIncrementalHFiles.doBulkLoad(new Path(), hTable).
I am emitting the KeyValue data type from my mapper and then using the HFileOutputFormat to prepare my HFiles using its default reducer.
When I run my map-reduce job, it gets completed without any errors and it creates the outfile, however, the final step - inserting HFiles to HBase is not happening. I get the below error after my map-reduce completes:
13/09/08 03:39:51 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://localhost:54310/user/xx.xx/output/_SUCCESS
13/09/08 03:39:51 WARN mapreduce.LoadIncrementalHFiles: Bulk load operation did not find any files to load in directory output/. Does it contain files in subdirectories that correspond to column family names?
But I can see the output directory containing:
1. _SUCCESS
2. _logs
3. _0/2aa96255f7f5446a8ea7f82aa2bd299e file (which contains my data)
I have no clue as to why my bulkloader is not picking the files from output directory.
Below is the code of my Map-Reduce driver class:
public static void main(String[] args) throws Exception{
String inputFile = args[0];
String tableName = args[1];
String outFile = args[2];
Path inputPath = new Path(inputFile);
Path outPath = new Path(outFile);
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
//set the configurations
conf.set("mapred.job.tracker", "localhost:54311");
//Input data to HTable using Map Reduce
Job job = new Job(conf, "MapReduce - Word Frequency Count");
job.setJarByClass(MapReduce.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(job, inputPath);
fs.delete(outPath);
FileOutputFormat.setOutputPath(job, outPath);
job.setMapperClass(MapReduce.MyMap.class);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(KeyValue.class);
HTable hTable = new HTable(conf, tableName.toUpperCase());
// Auto configure partitioner and reducer
HFileOutputFormat.configureIncrementalLoad(job, hTable);
job.waitForCompletion(true);
// Load generated HFiles into table
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf);
loader.doBulkLoad(new Path(outFile), hTable);
}
How can I figure out the wrong thing happening here which I avoiding my data insert to HBase?

Finally, I figured out as to why my HFiles were not getting dumped into HBase. Below are the details:
My create statement ddl was not having any default column-name so my guess is that Phoenix created the default column-family as "_0". I was able to see this column-family in my HDFS/hbase dir.
However, when I use the HBase's LoadIncrementalHFiles API for fetching the files from my output directory, it was not picking my dir named after the col-family ("0") in my case. I debugged the LoadIncrementalHFiles API code and found that it skips all the directories from the output path that starts with "" (for e.g. "_logs").
I re-tried the same again but now by specifying some column-family and everything worked perfectly fine. I am able to query data using Phoenix SQL.

Related

Hadoop File Empty after Write

We have an application that retrieves data from MongoDB and writes to Hadoop cluster.
The data is a list of strings that are converted to JSON and written to Hadoop using the following logic
˚
Configuration conf = new Configuration();
conf.addResource(new Path("/etc/hadoop/conf/core-site.xml"));
conf.addResource(new Path("/etc/hadoop/conf/hdfs-site.xml"));
conf.set("fs.defaultFS", HadoopConstants.HDFS_HOST + HadoopConstants.HDFS_DEFAULT_FS);
FSDataOutputStream out = null;
FileSystem fileSystem = null;
//Create Hadoop FS Path and Directory Structure
if (!fileSystem.exists(new Path(dir))) {
// Create new Directory
fileSystem.mkdirs(new Path(dir), FsPermission.getDefault());
out = fileSystem.create(new Path(filepath));
} else if (fileSystem.exists(new Path(dir))) {
if (!fileSystem.exists(new Path(filepath))) {
out = fileSystem.create(new Path(filepath));
} else if (fileSystem.exists(new Path(filepath))) {
//should not reach here .
fileSystem.delete(new Path(filepath), true);
out = fileSystem.create(new Path(filepath));
}
}
for (Iterator < String > it = list.iterator(); it.hasNext();) {
String node = it.next();
out.writeBytes(node.toString());
out.writeBytes("\n");
}
LOGGER.debug("Write to HDFS successful");
out.close();
The application works well for QA and Staging environments .
In production environment , which has an additional firewall in order to connect to it (This firewall has been opened now in order to grant access for write) , following error is seen .
The file is being created but the final Hadoop file is empty . ie. The size is 0 bytes.
The Hadoop fs –du and Hadoop fsck commands on the file being written is attached in the screenshot. The size after replication during write increases to 384M but then becomes 0 again .
Is this because out.close() in above code is not being called ?
This doesn’t explain QA data being written correctly.
Could it be a firewall issue ?
The file is being created correctly . Hence doesn’t seem to be connectivity issue . Unless after file is created and opened data is being written and not flushed correctly so as it is saved.
Following is file specifications during write
$ hadoop fs -du -h file.json
0 384M ...
The size after replication param above increases to 384M and changes to 0 after a while. Does this mean data is arriving but not being flushed correctly to disk?
$ hadoop fsck
What are some ways I could verify if data is being fetched and arriving from the Hadoop side?
**** UPDATE ****
Following exception is thrown in client logs during execution of following line:
out.close();
HDFSWriter ::Write Failed :: Could not get block locations. Source file "part-m-2017102304-0000.json" - Aborting...
Hadoop httpfs.out Logs has the following :
hadoop-httpfs ... INFO httpfsaudit: [/part-m-2017102304-0000.json] offset [0] len [204800]
It means that you have firewall access to the namenode (which can create the file), but not to the datanodes (which are needed to write data to the files).
Get the firewall rules updated so that you also have access to the datanodes.

It is not possible to read txt files form folder in spark streaming mode

I would like to read some text files from a folder and do some operation on them. I should mention that I am using intellij idea on Mac OS. Here's what I did during my tests:
1- I copied and paste files while my program ran
2- I moved them from another folder
3- I renamed them
4- I put in new files for each tests
5- I tested with both Scala and Java
Although I did the test with above conditions each time, nothing could be read from the folder. I put my java code here:
public static void main(String[] args) {
String outputPathPrefix;
String inputFolder;
inputFolder ="/Users/saeedtkh/Desktop/SparkStreamingWordCountFolder/inputfolder";
outputPathPrefix = "/Users/saeedtkh/Desktop/SparkStreamingWordCountFolder/output/";
// Create a configuration object and set the name of the application
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("Spark Streaming word count");
// Create a Spark Streaming Context object
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(10));
// Create a DStream reading the content of the input folder
JavaDStream<String> lines = jssc.textFileStream(inputFolder);
// Apply the "standard" trasformations to perform the word count task
// However, the "returned" RDDs are DStream/PairDStream RDDs
JavaDStream<String> words = lines.flatMap(new Split());
JavaPairDStream<String, Integer> wordsOnes = words.mapToPair(new WordOne());
JavaPairDStream<String, Integer> wordsCounts = wordsOnes.reduceByKey(new Sum());
wordsCounts.print();
wordsCounts.dstream().saveAsTextFiles(outputPathPrefix, "");
// Start the computation
jssc.start();
jssc.awaitTerminationOrTimeout(120000);
jssc.close();
}
Can you help me to find the problem? Did I miss something?

Overwriting HDFS file/directory through Spark

Problem
I have a file saved in HDFS and all I want to do is to run my spark application, calculate a result javaRDD and use saveAsTextFile() in order to store the new "file" in HDFS.
However Spark's saveAsTextFile() does not work if the file already exists. It does not overwrite it.
What I tried
So I searched for a solution to this and I found that a possible way to make it work could be deleting the file through the HDFS API before trying to save the new one.
I added the Code:
FileSystem hdfs = FileSystem.get(new Configuration());
Path newFolderPath = new Path("hdfs://node1:50050/hdfs/" +filename);
if(hdfs.exists(newFolderPath)){
System.out.println("EXISTS");
hdfs.delete(newFolderPath, true);
}
filerdd.saveAsTextFile("/hdfs/" + filename);
When I tried to run my Spark application, the file was deleted but I get a FileNotFoundException.
Considering the fact, that this exception occurs when someone is trying to read a file from a path and the file does not exist, this makes no sense because after deleting the file, there is no code that tries to read it.
Part of my code
JavaRDD<String> filerdd = sc.textFile("/hdfs/" + filename) // load the file here
...
...
// Transformations here
filerdd = filerdd.map(....);
...
...
// Delete old file here
FileSystem hdfs = FileSystem.get(new Configuration());
Path newFolderPath = new Path("hdfs://node1:50050/hdfs/" +filename);
if(hdfs.exists(newFolderPath)){
System.out.println("EXISTS");
hdfs.delete(newFolderPath, true);
}
// Write new file here
filerdd.saveAsTextFile("/hdfs/" + filename);
I am trying to do the simplest thing here but I have no idea why this does not work. Maybe the filerdd is somehow connected to the path??
The problem is you use the same path for input and output. Spark's RDD will be executed lazily. It runs when you call saveAsTextFile. At this point, you have already deleted the newFolderPath. So filerdd will complain.
Anyway, you should not use the same path for input and output.

Hadoop, MapReduce - Multiple Input/Output Paths

In my input file when making the Jar for my MapReduce Job, I am using the Hadoop-local command. I wanted to know whether there was a way of, instead of specifically specifying the path for each file in my input folder to be used in the MapReduce job, whether I could just specify and pass all the files from my input folder. This is because the contents and number of files could change due to the nature of the MapReduce job I am trying to configure and as I do not know the specific amount of files, apart from just the contents of these files, is there a way to pass all files from the input folder into my MapReduce program and then iterate over each file to compute a certain function which would then send the results to the Reducer. I am only using one Map/Reduce program and I am coding in Java. I am able to use the hadoop-moonshot command, but I am working with hadoop-local at the moment.
Thanks.
You don't have to pass individual file as input for MapReduce Job.
FileInputFormat class already provides API to accept list of multiple files as Input to Map Reduce program.
public static void setInputPaths(Job job,
Path... inputPaths)
throws IOException
Add a Path to the list of inputs for the map-reduce job.
Parameters:
conf - The configuration of the job
path - Path to be added to the list of inputs for the map-reduce job.
Example code from Apache tutorial
Job job = Job.getInstance(conf, "word count");
FileInputFormat.addInputPath(job, new Path(args[0]));
MultipleInputs provides below APIs.
public static void addInputPath(Job job,
Path path,
Class<? extends InputFormat> inputFormatClass,
Class<? extends Mapper> mapperClass)
Add a Path with a custom InputFormat and Mapper to the list of inputs for the map-reduce job.
Related SE question:
Can hadoop take input from multiple directories and files
Refer to MultipleOutputs API regarding your second query on multiple output paths.
FileOutputFormat.setOutputPath(job, outDir);
// Defines additional single text based output 'text' for the job
MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class,
LongWritable.class, Text.class);
// Defines additional sequence-file based output 'sequence' for the job
MultipleOutputs.addNamedOutput(job, "seq",
SequenceFileOutputFormat.class,
LongWritable.class, Text.class);
Have a look at related SE questions regarding multiple output files.
Writing to multiple folders in hadoop?
hadoop method to send output to multiple directories

Copying HDFS directory to local node

I'm working on a single node Hadoop 2.4 cluster.
I'm able to copy a directory and all its content from HDFS using hadoop fs -copyToLocal myDirectory .
However, I'm unable to successfully do the same operations via this java code :
public void map Object key, Text value, Context context)
throws IOException, InterruptedException {
Configuration conf = new Configuration(true);
FileSystem hdfs = FileSystem.get(conf);
hdfs.copyToLocalFile(false, new Path("myDirectory"),
new Path("C:/tmp"));
}
This code only copies a part of myDirectory. I also receive some error messages :
14/08/13 14:57:42 INFO mapreduce.Job: Task Id : attempt_1407917640600_0013_m_000001_2, Status : FAILED
Error: java.io.IOException: Target C:/tmp/myDirectory is a directory
My guess is that multiple instances of the mapper are trying to copy the same file to the same node at the same time. However, I don't see why not all the content is copied.
Is that the reason of my errors, and how could I solve it ?
You can use DistributedCache (documentation) to copy your files on all datanodes, or you could try to copy files in the setup of your mapper.

Categories

Resources