How to read files from HDFS using Spark?

How to read files from HDFS using Spark? - java

I have built a recommendation system using Apache Spark with datasets stored locally in my project folder, now i need to access these files from HDFS.
How can i read files from HDFS using Spark ?
This is how i initialize my Spark session:
SparkContext context = new SparkContext(new SparkConf().setAppName("spark-ml").setMaster("local")
.set("fs.default.name", "hdfs://localhost:54310").set("fs.defaultFS", "hdfs://localhost:54310"));
Configuration conf = context.hadoopConfiguration();
conf.addResource(new Path("/usr/local/hadoop-3.1.2/etc/hadoop/core-site.xml"));
conf.addResource(new Path("/usr/local/hadoop-3.1.2/etc/hadoop/hdfs-site.xml"));
conf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");
conf.set("fs.hdfs.impl", "org.apache.hadoop.fs.LocalFileSystem");
this.session = SparkSession.builder().sparkContext(context).getOrCreate();
System.out.println(conf.getRaw("fs.default.name"));
System.out.println(context.getConf().get("fs.defaultFS"));
All the outputs return hdfs://localhost:54310 which is the correct uri for my HDFS.
When trying to read a file from HDFS:
session.read().option("header", true).option("inferSchema", true).csv("hdfs://localhost:54310/recommendation_system/movies/ratings.csv").cache();
I get this error:
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: hdfs://localhost:54310/recommendation_system/movies/ratings.csv, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:730)
at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:86)
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:636)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:930)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:631)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:65)
at org.apache.hadoop.fs.Globber.doGlob(Globber.java:281)
at org.apache.hadoop.fs.Globber.glob(Globber.java:149)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2034)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:257)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:253)
at scala.Option.getOrElse(Option.scala:138)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:253)
at scala.Option.getOrElse(Option.scala:138)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:945)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
at org.apache.spark.api.java.JavaRDDLike.collect(JavaRDDLike.scala:361)
at org.apache.spark.api.java.JavaRDDLike.collect$(JavaRDDLike.scala:360)
at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
at com.dastamn.sparkml.analytics.SparkManager.<init>(SparkManager.java:36)
at com.dastamn.sparkml.Main.main(Main.java:22)
What can i do to solve this issue ?

A couple of things from the code snippet pasted:
1. When a hadoop property has to be set as part of using SparkConf, it has to be prefixed with spark.hadoop., in this case key fs.default.name needs to be set as spark.hadoop.fs.default.name and likewise for the other properties.
2. The argument to the csv function does not have to tell about the HDFS endpoint, Spark will figure it out from default properties, since it is already set.
session.read().option("header", true).option("inferSchema",
true).csv("/recommendation_system/movies/ratings.csv").cache();
If the default filesystem properties are not set part of HadoopConfiguration, the complete URI isrequired for Spark/Hadoop to figure out the filesystem to use.
(Also the object name conf is not used)
3. In the above case, looks like Hadoop not was able to find a FileSystem for hdfs:// URI prefix and resorted to use the default filesystem which is local in this case(since it is using RawLocalFileSystemto process the path).
Make sure that hadoop-hdfs.jar is present in class path which has DistributedFileSystem,to instatntiate the FS objects for HDFS.

Here's the configuration that solved the problem:
SparkContext context = new SparkContext(new SparkConf().setAppName("spark-ml").setMaster("local[*]")
.set("spark.hadoop.fs.default.name", "hdfs://localhost:54310").set("spark.hadoop.fs.defaultFS", "hdfs://localhost:54310")
.set("spark.hadoop.fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName())
.set("spark.hadoop.fs.hdfs.server", org.apache.hadoop.hdfs.server.namenode.NameNode.class.getName())
.set("spark.hadoop.conf", org.apache.hadoop.hdfs.HdfsConfiguration.class.getName()));
this.session = SparkSession.builder().sparkContext(context).getOrCreate();

Related

creating a virtual file system with JIMFS

I'd like to use Google's JIMFS for creating a virtual file system for testing purposes. I have trouble getting started, though.
I looked at this tutorial: http://www.hascode.com/2015/03/creating-in-memory-file-systems-with-googles-jimfs/
However, when I create the file system, it actually gets created in the existing file system, i. e. I cannot do:
Files.createDirectory("/virtualfolder");
because I am denied access.
Am I missing something?
Currently, my code looks something like this:
Test Class:
FileSystem fs = Jimfs.newFileSystem(Configuration.unix());
Path vTargetFolder = fs.getPath("/Store/homes/linux/abc/virtual");
TestedClass test = new TestedClass(vTargetFolder.toAbsolutePath().toString());
Java class somewhere:
targetPath = Paths.get(targetName);
Files.createDirectory(targetPath);
// etc., creating files and writing them to the target directory
However, I created a separate class just to test JIMFS and here the creation of the directory doesnt fail, but I cannot create a new file like this:
FileSystem fs = Jimfs.newFileSystem(Configuration.unix());
Path data = fs.getPath("/virtual");
Path dir = Files.createDirectory(data);
Path file = Files.createFile(Paths.get(dir + "/abc.txt")); // throws NoSuchFileException
What am I doing wrong?

The problem is a mix of Default FileSystem and new FileSystem.
Problem 1:
Files.createDirectory("/virtualfolder");
This will actually not compile so I suspect you meant:
Files.createDirectory( Paths.get("/virtualfolder"));
This attempts to create a directory in your root directory of the default filesystem. You need privileges to do that and probably should not do it as a test. I suspect you tried to work around this problem by using strings and run into
Problem 2:
Lets look at your code with comments
FileSystem fs = Jimfs.newFileSystem(Configuration.unix());
// now get path in the new FileSystem
Path data = fs.getPath("/virtual");
// create a directory in the new FileSystem
Path dir = Files.createDirectory(data);
// create a file in the default FileSystem
// with a parent that was never created there
Path file = Files.createFile(Paths.get(dir + "/abc.txt")); // throws NoSuchFileException
Lets look at the last line:
dir + "/abc.txt" >> is the string "/virtual/abc.txt"
Paths.get(dir + "/abc.txt") >> is this as path in the default filesystem
Remember the virtual filesystem lives parallel to the default filesystem.
Paths have a filesystem and can not be used in an other filesystem. They are not just names.
Notes:
Working with virtual filesystems avoid the Paths class. This class will always work in the default filesystem. Files is ok because you have create a path in the correct filesystem first.
if your original plan was to work with a virtual filesystem mounted to the default filesystem you need bit more. I have a project where I create a Webdav server based on a virtual filesystem and then use OS build in methods to mount that as a volume.

In your shell try ls /
the output should contain the "/virtual" directory.
If this is not the case which I suspect it is then:
The program is masking a:
java.nio.file.AccessDeniedException: /virtual/abc.txt
In reality the code should be failing at Path dir = Files.createDirectory(data);
But for some reason this exception is silent and the program continues without creating the directory (or thinking it has) and attempts to write to the directory that doesn't exist
Leaving a misleading java.nio.file.NoSuchFileException

I suggest you use memoryfilesystem instead. It has a much more complete implementation than Jimfs; in particular, it supports POSIX attributes when creating a "Linux" filesystem etc.
Using it, your code will actually work:
try (
final FileSystem fs = MemoryFileSystemBuilder.newLinux()
.build("testfs");
) {
// create a directory, a file within this directory etc
}

Seems like instead of
Path file = Files.createFile(Paths.get(dir + "/abc.txt"));
You should be doing
Path file = Files.createFile(dir.resolve("/abc.txt"))
This way, the context of dir (it's filesystem) is not lost.

Copying HDFS directory to local node

I'm working on a single node Hadoop 2.4 cluster.
I'm able to copy a directory and all its content from HDFS using hadoop fs -copyToLocal myDirectory .
However, I'm unable to successfully do the same operations via this java code :
public void map Object key, Text value, Context context)
throws IOException, InterruptedException {
Configuration conf = new Configuration(true);
FileSystem hdfs = FileSystem.get(conf);
hdfs.copyToLocalFile(false, new Path("myDirectory"),
new Path("C:/tmp"));
}
This code only copies a part of myDirectory. I also receive some error messages :
14/08/13 14:57:42 INFO mapreduce.Job: Task Id : attempt_1407917640600_0013_m_000001_2, Status : FAILED
Error: java.io.IOException: Target C:/tmp/myDirectory is a directory
My guess is that multiple instances of the mapper are trying to copy the same file to the same node at the same time. However, I don't see why not all the content is copied.
Is that the reason of my errors, and how could I solve it ?

You can use DistributedCache (documentation) to copy your files on all datanodes, or you could try to copy files in the setup of your mapper.

Unable to load the HFiles into HBase using mapreduce.LoadIncrementalHFiles

I want to insert the out-put of my map-reduce job into a HBase table using HBase Bulk loading API LoadIncrementalHFiles.doBulkLoad(new Path(), hTable).
I am emitting the KeyValue data type from my mapper and then using the HFileOutputFormat to prepare my HFiles using its default reducer.
When I run my map-reduce job, it gets completed without any errors and it creates the outfile, however, the final step - inserting HFiles to HBase is not happening. I get the below error after my map-reduce completes:
13/09/08 03:39:51 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://localhost:54310/user/xx.xx/output/_SUCCESS
13/09/08 03:39:51 WARN mapreduce.LoadIncrementalHFiles: Bulk load operation did not find any files to load in directory output/. Does it contain files in subdirectories that correspond to column family names?
But I can see the output directory containing:
1. _SUCCESS
2. _logs
3. _0/2aa96255f7f5446a8ea7f82aa2bd299e file (which contains my data)
I have no clue as to why my bulkloader is not picking the files from output directory.
Below is the code of my Map-Reduce driver class:
public static void main(String[] args) throws Exception{
String inputFile = args[0];
String tableName = args[1];
String outFile = args[2];
Path inputPath = new Path(inputFile);
Path outPath = new Path(outFile);
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
//set the configurations
conf.set("mapred.job.tracker", "localhost:54311");
//Input data to HTable using Map Reduce
Job job = new Job(conf, "MapReduce - Word Frequency Count");
job.setJarByClass(MapReduce.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(job, inputPath);
fs.delete(outPath);
FileOutputFormat.setOutputPath(job, outPath);
job.setMapperClass(MapReduce.MyMap.class);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(KeyValue.class);
HTable hTable = new HTable(conf, tableName.toUpperCase());
// Auto configure partitioner and reducer
HFileOutputFormat.configureIncrementalLoad(job, hTable);
job.waitForCompletion(true);
// Load generated HFiles into table
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf);
loader.doBulkLoad(new Path(outFile), hTable);
}
How can I figure out the wrong thing happening here which I avoiding my data insert to HBase?

Finally, I figured out as to why my HFiles were not getting dumped into HBase. Below are the details:
My create statement ddl was not having any default column-name so my guess is that Phoenix created the default column-family as "_0". I was able to see this column-family in my HDFS/hbase dir.
However, when I use the HBase's LoadIncrementalHFiles API for fetching the files from my output directory, it was not picking my dir named after the col-family ("0") in my case. I debugged the LoadIncrementalHFiles API code and found that it skips all the directories from the output path that starts with "" (for e.g. "_logs").
I re-tried the same again but now by specifying some column-family and everything worked perfectly fine. I am able to query data using Phoenix SQL.

ListFiles from HDFS Cluster

I am an amateur with hadoop and stuffs. Now, I am trying to access the hadoop cluster (HDFS) and retrieve the list of files from client eclipse. I can do the following operations after setting up the required configurations on hadoop java client.
I can perform copyFromLocalFile, copyToLocalFile operations accessing HDFS from client.
Here's what I am facing. When i give listFiles() method I am getting
org.apache.hadoop.fs.LocatedFileStatus#d0085360
org.apache.hadoop.fs.LocatedFileStatus#b7aa29bf
MainMethod
Properties props = new Properties();
props.setProperty("fs.defaultFS", "hdfs://<IPOFCLUSTER>:8020");
props.setProperty("mapreduce.jobtracker.address", "<IPOFCLUSTER>:8032");
props.setProperty("yarn.resourcemanager.address", "<IPOFCLUSTER>:8032");
props.setProperty("mapreduce.framework.name", "yarn");
FileSystem fs = FileSystem.get(toConfiguration(props)); // Setting up the required configurations
Path p4 = new Path("/user/myusername/inputjson1/");
RemoteIterator<LocatedFileStatus> ritr = fs.listFiles(p4, true);
while(ritr.hasNext())
{
System.out.println(ritr.next().toString());
}
I have also tried FileContext and ended up only getting the filestatus object string or something. Is there a possibility to take the filenames when i iterate to the remote hdfs directory, there is a method called getPath(), Is that the only way we can retrieve the full path of the filenames using the hadoop API or there are any other method so that i can retrieve only name of the files in a specified directory path, Please help me through this, Thanks.

You can indeed use getPath() this will return you a Path object which let you query the name of the file.
Path p = ritr.next().getPath();
// returns the filename or directory name if directory
String name = p.getName();
The FileStatus object you get can tell you if this is a file or directory.
Here is more API documentation:
http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/fs/Path.html
http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/fs/FileStatus.html

How to put a serialized object into the Hadoop DFS and get it back inside the map function?

I'm new to Hadoop and recently I was asked to do a test project using Hadoop.
So while I was reading BigData, happened to know about Pail. Now what I want to do is something like this. First create a simple object and then serialize it using Thrift and put that into the HDFS using Pail. Then I want to get that object inside the map function and do what ever I want. But I have no idea on getting tat object inside the map function.
Can someone please tell me of any references or explain how to do that?

I can think of three options:
Use the -files option and name the file in HDFS (preferable as the task tracker will download the file once for all jobs running on that node)
Use the DistributedCache (similar logic to the above), but you configure the file via some API calls rather than through the command line
Load the file directly from HDFS (less efficient as you're pulling the file over HDFS for each task)
As for some code, put the load logic into your mapper's setup(...) or configure(..) method (depending on whether you're using the new or old API) as follows:
protected void setup(Context context) {
// the -files option makes the named file available in the local directory
File file = new File("filename.dat");
// open file and load contents ...
// load the file directly from HDFS
FileSystem fs = FileSystem.get(context.getConfiguration());
InputStream hdfsInputStream = fs.open("/path/to/file/in/hdfs/filename.dat");
// load file contents from stream...
}
DistributedCache has some example code in the Javadocs

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to read files from HDFS using Spark? - java

Related

creating a virtual file system with JIMFS

Copying HDFS directory to local node

Unable to load the HFiles into HBase using mapreduce.LoadIncrementalHFiles

ListFiles from HDFS Cluster

How to put a serialized object into the Hadoop DFS and get it back inside the map function?

Categories

Resources