DistributedCache in Hadoop 2.x - java

I have a problem in DistributedCache in Hadoop 2.x the new API, I found some people working around this issue, but it does not solve my problem example
this solution does not work with me Because i got a NullPointerException when trying to retrieve the data in DistributedCache
My Configuration is as follows:
Driver
public int run(String[] arg) throws Exception {
Configuration conf = this.getConf();
Job job= new Job(conf,"job Name");
...
job.addCacheFile(new URI(arg[1]);
Setup
protected void setup(Context context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
URI[] cacheFiles = context.getCacheFiles();
BufferedReader dtardr = new BufferedReader(new FileReader(cacheFiles[0].toString()));
Here when it starts creating the buffered reader it throws the NullPointerException, this happenning because context.getCacheFiles(); returns always NULL. How to solve this problem, and where is the cache files stored(HDFS, or local file system)

If you use the local JobRunner in Hadoop (non-distributed mode, as a single Java process), then no local data directory is created; the getLocalCacheFiles() or getCacheFiles() call will return an empty set of results.Can you make sure that you are running your job in a Distributed or Pseudo-Distributed mode.
Hadoop frame work will copy files set in the distributed cache to the local working directory of each task in the job.
There are copies of all cached files, placed in the local file system of each worker machine. (They will be in a subdirectory of mapred.local.dir.)
Can you refer this link for understanding more about DistributedCache.

Related

File copying in java (cannot create file) by Hadoop

I currently want to copy a file from hdfs to local computer. I have finished most of the work by fileinputstream and fileoutputstream. But then I encounter the following issue.
JAVA I/O exception. Mkdirs fail to create file
I have do some research and figure out that as I am using
filesystem.create()(hadoop function)
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20org.apache.hadoop.util.Progressable)
the reason is as follows:
if I set my path to a non-existing folder, a folder will be created and the file I download is inside.
if I set my path to existing folder (say current directory), the above I/O exception occur.
Say if I already get the path and fileinputstream right, what should I use (better in filesystem library) in order to go around this problem
my code
//src and dst are the path input and output
Configuration conf = new Configuration();
FileSystem inFS = FileSystem.get(URI.create(src), conf);
FileSystem outFS = FileSystem.get(URI.create(dst), conf);
FSDataInputStream in = null;
FSDataOutputStream out = null;
in = inFS.open(new Path(src));
out = outFS.create(new Path(dst),
new Progressable() {
/*
* Print a dot whenever 64 KB of data has been written to
* the datanode pipeline.
*/
public void progress() {
System.out.print(".");
}
});
In the "File" class there is a method called
createNewFile() that will create a new file only if one doent exist.

Some errors happen when loading data to HDFS

I have a Java program trying to load data to HDFS:
public class CopyFileToHDFS {
public static void main(String[] args) {
try{
Configuration configuration = new Configuration();
String msg = "message1";
String file = "hdfs://localhost:8020/user/user1/input.txt";
FileSystem hdfs = FileSystem.get(new URI(file), configuration);
FSDataOutputStream outputStream = hdfs.create(new Path(file), true);
outputStream.write(msg.getBytes());
}
catch(Exception e){
System.out.println(e.getMessage());
}
}
}
When I run the program, it gives me an error:
java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.fs.s3.S3FileSystem not found
It looks like some configuration issues. Can anyone give me some suggestions?
Thanks
Something is specifying that org.apache.hadoop.fs.FileSystem includes S3. One possible cause is an old, stale META-INF file; see this Spark bug report.
If you're creating an uber-jar, it could be somewhere in there. If you can't find and eliminate the spec that's causing the problem, a work-around is to include AWS & Hadoop jars where the Spark driver/executors can find them; see this Stackoverflow question.

Copying HDFS directory to local node

I'm working on a single node Hadoop 2.4 cluster.
I'm able to copy a directory and all its content from HDFS using hadoop fs -copyToLocal myDirectory .
However, I'm unable to successfully do the same operations via this java code :
public void map Object key, Text value, Context context)
throws IOException, InterruptedException {
Configuration conf = new Configuration(true);
FileSystem hdfs = FileSystem.get(conf);
hdfs.copyToLocalFile(false, new Path("myDirectory"),
new Path("C:/tmp"));
}
This code only copies a part of myDirectory. I also receive some error messages :
14/08/13 14:57:42 INFO mapreduce.Job: Task Id : attempt_1407917640600_0013_m_000001_2, Status : FAILED
Error: java.io.IOException: Target C:/tmp/myDirectory is a directory
My guess is that multiple instances of the mapper are trying to copy the same file to the same node at the same time. However, I don't see why not all the content is copied.
Is that the reason of my errors, and how could I solve it ?
You can use DistributedCache (documentation) to copy your files on all datanodes, or you could try to copy files in the setup of your mapper.

How to index entire local Hard Drive into Apache Solr?

Is there a good approach with Solr or a client library feeding into Solr to index an entire hard drive. This should include content in the zip files, including recursively of zip files within zip files?
This should be able to run on Linux (no windows-only clients).
This will of course involve making a single scan over the entire file-system from the root (or any folder actually). I'm not concerned at this point with keeping the index up to date, just creating it initially. This would be similar to the old "Google Desktop" app, which Google discontinued.
You can manipulate Solr using the SolrJ API.
Here's the API documentation: http://lucene.apache.org/solr/4_0_0/solr-solrj/index.html
And here's a article on how to use SolrJ to index files on your harddrive.
http://blog.cloudera.com/blog/2012/03/indexing-files-via-solr-and-java-mapreduce/
Files are represented by InputDocument and you use .addField to attach fields that you'd like to search on at a later time.
Here's example code for an Index Driver:
public class IndexDriver extends Configured implements Tool {
public static void main(String[] args) throws Exception {
//TODO: Add some checks here to validate the input path
int exitCode = ToolRunner.run(new Configuration(),
new IndexDriver(), args);
System.exit(exitCode);
}
#Override
public int run(String[] args) throws Exception {
JobConf conf = new JobConf(getConf(), IndexDriver.class);
conf.setJobName("Index Builder - Adam S # Cloudera");
conf.setSpeculativeExecution(false);
// Set Input and Output paths
FileInputFormat.setInputPaths(conf, new Path(args[0].toString()));
FileOutputFormat.setOutputPath(conf, new Path(args[1].toString()));
// Use TextInputFormat
conf.setInputFormat(TextInputFormat.class);
// Mapper has no output
conf.setMapperClass(IndexMapper.class);
conf.setMapOutputKeyClass(NullWritable.class);
conf.setMapOutputValueClass(NullWritable.class);
conf.setNumReduceTasks(0);
JobClient.runJob(conf);
return 0;
}
}
Read the article for more info.
Compressed files
Here's info on handling compressed files: Using Solr CELL's ExtractingRequestHandler to index/extract files from package formats
There seems to be some bug with Solr not handling zip files, here's the bugreport with a fix: https://issues.apache.org/jira/browse/SOLR-2416

distributed cache

i am working with hadoop 19 on opensuse linux, i am not using any cluster rather running my hadoop code on my machine itself. i am following the standard technique on putting in distributed cache but instead of acessing the files from the distributed cache again and again, i stored the contents of the file in an array. this part of extracting from the file is done in the configure() function. i am getting nullPointerException when i try to use the fiel name. this is the part of the code:
.
..part of main()
..
DistributedCache.addCacheFile(new URI("/home/hmobile/hadoop-0.19.2/output/part-00000"), conf2);
DistributedCache.addCacheFile(new URI("/home/hmobile/hadoop-0.19.2/output/part-00001"), conf2);
.
.part of mapper
public void configure(JobConf conf2)
{
String wrd; String line; try {
localFiles = DistributedCache.getLocalCacheFiles(conf2);
System.out.println(localFiles[0].getName());// error NULLPOINTEREXCEPTION
} catch (IOException ex) {
Logger.getLogger(blur2.class.getName()).log(Level.SEVERE, null, ex);
}
for(Path f:localFiles)// error NULLPOINTEREXCEPTION
{
if(!f.getName().endsWith("crc"))
{
BufferedReader br = null;
try {
br = new BufferedReader(new FileReader(f.toString()));
can such processing be not done in configure()?
This will depend on if you're using the local job runner (mapred.job.tracker=local) or if you're running in pseudo-distributed mode (i.e. mapred.job.tracker=localhost:8021 or =mynode.mydomain.com:8021). The distributed cache does NOT work in local mode, only pseudo-distributed and fully distributed modes.
Using the distributed cache in configure() is fine, otherwise.

Categories

Resources