Copying HDFS directory to local node - java

I'm working on a single node Hadoop 2.4 cluster.
I'm able to copy a directory and all its content from HDFS using hadoop fs -copyToLocal myDirectory .
However, I'm unable to successfully do the same operations via this java code :
public void map Object key, Text value, Context context)
throws IOException, InterruptedException {
Configuration conf = new Configuration(true);
FileSystem hdfs = FileSystem.get(conf);
hdfs.copyToLocalFile(false, new Path("myDirectory"),
new Path("C:/tmp"));
}
This code only copies a part of myDirectory. I also receive some error messages :
14/08/13 14:57:42 INFO mapreduce.Job: Task Id : attempt_1407917640600_0013_m_000001_2, Status : FAILED
Error: java.io.IOException: Target C:/tmp/myDirectory is a directory
My guess is that multiple instances of the mapper are trying to copy the same file to the same node at the same time. However, I don't see why not all the content is copied.
Is that the reason of my errors, and how could I solve it ?

You can use DistributedCache (documentation) to copy your files on all datanodes, or you could try to copy files in the setup of your mapper.

Related

How to read files from HDFS using Spark?

I have built a recommendation system using Apache Spark with datasets stored locally in my project folder, now i need to access these files from HDFS.
How can i read files from HDFS using Spark ?
This is how i initialize my Spark session:
SparkContext context = new SparkContext(new SparkConf().setAppName("spark-ml").setMaster("local")
.set("fs.default.name", "hdfs://localhost:54310").set("fs.defaultFS", "hdfs://localhost:54310"));
Configuration conf = context.hadoopConfiguration();
conf.addResource(new Path("/usr/local/hadoop-3.1.2/etc/hadoop/core-site.xml"));
conf.addResource(new Path("/usr/local/hadoop-3.1.2/etc/hadoop/hdfs-site.xml"));
conf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");
conf.set("fs.hdfs.impl", "org.apache.hadoop.fs.LocalFileSystem");
this.session = SparkSession.builder().sparkContext(context).getOrCreate();
System.out.println(conf.getRaw("fs.default.name"));
System.out.println(context.getConf().get("fs.defaultFS"));
All the outputs return hdfs://localhost:54310 which is the correct uri for my HDFS.
When trying to read a file from HDFS:
session.read().option("header", true).option("inferSchema", true).csv("hdfs://localhost:54310/recommendation_system/movies/ratings.csv").cache();
I get this error:
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: hdfs://localhost:54310/recommendation_system/movies/ratings.csv, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:730)
at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:86)
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:636)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:930)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:631)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:65)
at org.apache.hadoop.fs.Globber.doGlob(Globber.java:281)
at org.apache.hadoop.fs.Globber.glob(Globber.java:149)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2034)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:257)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:253)
at scala.Option.getOrElse(Option.scala:138)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:253)
at scala.Option.getOrElse(Option.scala:138)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:945)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
at org.apache.spark.api.java.JavaRDDLike.collect(JavaRDDLike.scala:361)
at org.apache.spark.api.java.JavaRDDLike.collect$(JavaRDDLike.scala:360)
at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
at com.dastamn.sparkml.analytics.SparkManager.<init>(SparkManager.java:36)
at com.dastamn.sparkml.Main.main(Main.java:22)
What can i do to solve this issue ?
A couple of things from the code snippet pasted:
1. When a hadoop property has to be set as part of using SparkConf, it has to be prefixed with spark.hadoop., in this case key fs.default.name needs to be set as spark.hadoop.fs.default.name and likewise for the other properties.
2. The argument to the csv function does not have to tell about the HDFS endpoint, Spark will figure it out from default properties, since it is already set.
session.read().option("header", true).option("inferSchema",
true).csv("/recommendation_system/movies/ratings.csv").cache();
If the default filesystem properties are not set part of HadoopConfiguration, the complete URI isrequired for Spark/Hadoop to figure out the filesystem to use.
(Also the object name conf is not used)
3. In the above case, looks like Hadoop not was able to find a FileSystem for hdfs:// URI prefix and resorted to use the default filesystem which is local in this case(since it is using RawLocalFileSystemto process the path).
Make sure that hadoop-hdfs.jar is present in class path which has DistributedFileSystem,to instatntiate the FS objects for HDFS.
Here's the configuration that solved the problem:
SparkContext context = new SparkContext(new SparkConf().setAppName("spark-ml").setMaster("local[*]")
.set("spark.hadoop.fs.default.name", "hdfs://localhost:54310").set("spark.hadoop.fs.defaultFS", "hdfs://localhost:54310")
.set("spark.hadoop.fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName())
.set("spark.hadoop.fs.hdfs.server", org.apache.hadoop.hdfs.server.namenode.NameNode.class.getName())
.set("spark.hadoop.conf", org.apache.hadoop.hdfs.HdfsConfiguration.class.getName()));
this.session = SparkSession.builder().sparkContext(context).getOrCreate();

Hadoop File Empty after Write

We have an application that retrieves data from MongoDB and writes to Hadoop cluster.
The data is a list of strings that are converted to JSON and written to Hadoop using the following logic
˚
Configuration conf = new Configuration();
conf.addResource(new Path("/etc/hadoop/conf/core-site.xml"));
conf.addResource(new Path("/etc/hadoop/conf/hdfs-site.xml"));
conf.set("fs.defaultFS", HadoopConstants.HDFS_HOST + HadoopConstants.HDFS_DEFAULT_FS);
FSDataOutputStream out = null;
FileSystem fileSystem = null;
//Create Hadoop FS Path and Directory Structure
if (!fileSystem.exists(new Path(dir))) {
// Create new Directory
fileSystem.mkdirs(new Path(dir), FsPermission.getDefault());
out = fileSystem.create(new Path(filepath));
} else if (fileSystem.exists(new Path(dir))) {
if (!fileSystem.exists(new Path(filepath))) {
out = fileSystem.create(new Path(filepath));
} else if (fileSystem.exists(new Path(filepath))) {
//should not reach here .
fileSystem.delete(new Path(filepath), true);
out = fileSystem.create(new Path(filepath));
}
}
for (Iterator < String > it = list.iterator(); it.hasNext();) {
String node = it.next();
out.writeBytes(node.toString());
out.writeBytes("\n");
}
LOGGER.debug("Write to HDFS successful");
out.close();
The application works well for QA and Staging environments .
In production environment , which has an additional firewall in order to connect to it (This firewall has been opened now in order to grant access for write) , following error is seen .
The file is being created but the final Hadoop file is empty . ie. The size is 0 bytes.
The Hadoop fs –du and Hadoop fsck commands on the file being written is attached in the screenshot. The size after replication during write increases to 384M but then becomes 0 again .
Is this because out.close() in above code is not being called ?
This doesn’t explain QA data being written correctly.
Could it be a firewall issue ?
The file is being created correctly . Hence doesn’t seem to be connectivity issue . Unless after file is created and opened data is being written and not flushed correctly so as it is saved.
Following is file specifications during write
$ hadoop fs -du -h file.json
0 384M ...
The size after replication param above increases to 384M and changes to 0 after a while. Does this mean data is arriving but not being flushed correctly to disk?
$ hadoop fsck
What are some ways I could verify if data is being fetched and arriving from the Hadoop side?
**** UPDATE ****
Following exception is thrown in client logs during execution of following line:
out.close();
HDFSWriter ::Write Failed :: Could not get block locations. Source file "part-m-2017102304-0000.json" - Aborting...
Hadoop httpfs.out Logs has the following :
hadoop-httpfs ... INFO httpfsaudit: [/part-m-2017102304-0000.json] offset [0] len [204800]
It means that you have firewall access to the namenode (which can create the file), but not to the datanodes (which are needed to write data to the files).
Get the firewall rules updated so that you also have access to the datanodes.

DistributedCache in Hadoop 2.x

I have a problem in DistributedCache in Hadoop 2.x the new API, I found some people working around this issue, but it does not solve my problem example
this solution does not work with me Because i got a NullPointerException when trying to retrieve the data in DistributedCache
My Configuration is as follows:
Driver
public int run(String[] arg) throws Exception {
Configuration conf = this.getConf();
Job job= new Job(conf,"job Name");
...
job.addCacheFile(new URI(arg[1]);
Setup
protected void setup(Context context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
URI[] cacheFiles = context.getCacheFiles();
BufferedReader dtardr = new BufferedReader(new FileReader(cacheFiles[0].toString()));
Here when it starts creating the buffered reader it throws the NullPointerException, this happenning because context.getCacheFiles(); returns always NULL. How to solve this problem, and where is the cache files stored(HDFS, or local file system)
If you use the local JobRunner in Hadoop (non-distributed mode, as a single Java process), then no local data directory is created; the getLocalCacheFiles() or getCacheFiles() call will return an empty set of results.Can you make sure that you are running your job in a Distributed or Pseudo-Distributed mode.
Hadoop frame work will copy files set in the distributed cache to the local working directory of each task in the job.
There are copies of all cached files, placed in the local file system of each worker machine. (They will be in a subdirectory of mapred.local.dir.)
Can you refer this link for understanding more about DistributedCache.

ListFiles from HDFS Cluster

I am an amateur with hadoop and stuffs. Now, I am trying to access the hadoop cluster (HDFS) and retrieve the list of files from client eclipse. I can do the following operations after setting up the required configurations on hadoop java client.
I can perform copyFromLocalFile, copyToLocalFile operations accessing HDFS from client.
Here's what I am facing. When i give listFiles() method I am getting
org.apache.hadoop.fs.LocatedFileStatus#d0085360
org.apache.hadoop.fs.LocatedFileStatus#b7aa29bf
MainMethod
Properties props = new Properties();
props.setProperty("fs.defaultFS", "hdfs://<IPOFCLUSTER>:8020");
props.setProperty("mapreduce.jobtracker.address", "<IPOFCLUSTER>:8032");
props.setProperty("yarn.resourcemanager.address", "<IPOFCLUSTER>:8032");
props.setProperty("mapreduce.framework.name", "yarn");
FileSystem fs = FileSystem.get(toConfiguration(props)); // Setting up the required configurations
Path p4 = new Path("/user/myusername/inputjson1/");
RemoteIterator<LocatedFileStatus> ritr = fs.listFiles(p4, true);
while(ritr.hasNext())
{
System.out.println(ritr.next().toString());
}
I have also tried FileContext and ended up only getting the filestatus object string or something. Is there a possibility to take the filenames when i iterate to the remote hdfs directory, there is a method called getPath(), Is that the only way we can retrieve the full path of the filenames using the hadoop API or there are any other method so that i can retrieve only name of the files in a specified directory path, Please help me through this, Thanks.
You can indeed use getPath() this will return you a Path object which let you query the name of the file.
Path p = ritr.next().getPath();
// returns the filename or directory name if directory
String name = p.getName();
The FileStatus object you get can tell you if this is a file or directory.
Here is more API documentation:
http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/fs/Path.html
http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/fs/FileStatus.html

How to put a serialized object into the Hadoop DFS and get it back inside the map function?

I'm new to Hadoop and recently I was asked to do a test project using Hadoop.
So while I was reading BigData, happened to know about Pail. Now what I want to do is something like this. First create a simple object and then serialize it using Thrift and put that into the HDFS using Pail. Then I want to get that object inside the map function and do what ever I want. But I have no idea on getting tat object inside the map function.
Can someone please tell me of any references or explain how to do that?
I can think of three options:
Use the -files option and name the file in HDFS (preferable as the task tracker will download the file once for all jobs running on that node)
Use the DistributedCache (similar logic to the above), but you configure the file via some API calls rather than through the command line
Load the file directly from HDFS (less efficient as you're pulling the file over HDFS for each task)
As for some code, put the load logic into your mapper's setup(...) or configure(..) method (depending on whether you're using the new or old API) as follows:
protected void setup(Context context) {
// the -files option makes the named file available in the local directory
File file = new File("filename.dat");
// open file and load contents ...
// load the file directly from HDFS
FileSystem fs = FileSystem.get(context.getConfiguration());
InputStream hdfsInputStream = fs.open("/path/to/file/in/hdfs/filename.dat");
// load file contents from stream...
}
DistributedCache has some example code in the Javadocs

Categories

Resources