Hadoop Map Task : Read the content of a specified input file

Hadoop Map Task : Read the content of a specified input file - java

I'm pretty new to Hadoop environment. Recently, I run a basic mapreduce program. It was easy to run.
Now, I've a input file with following contents inside input path directory
fileName1
fileName2
fileName3
...
I need to read the lines of this file one by one and create a new File with those names (i.e fileName1, fileName2, and so on) at specified output directory.
I wrote the below map implementation, but it didn't work out
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
String fileName = value.toString();
String path = outputFilePath + File.separator + fileName;
File newFile = new File(path);
newFile.mkdirs();
newFile.createNewFile();
}
Can somebody explain me what I've missed out ?
Thanks

I think you should get started with studying the FileSystem class, I think you can only create files in the distributed filesystem. Here's a code example of where I opened a file for reading, you probably just need a FSDataOutputStream. In your mapper you can get your configuration out of the Context class.
Configuration conf = job.getConfiguration();
Path inFile = new Path(file);
try {
FileSystem fs;
fs = FileSystem.get(conf);
if (!fs.exists(inFile))
System.out.println("Unable to open settings file: "+file);
FSDataInputStream in = fs.open(inFile);
...
}

First of all get the path of the input directory inside your mapper with the help of FileSplit. Then append it to the name of the file which contains all these lines and read the lines of this file using FSDataInputStream. Something like this :
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
FileSplit fileSplit = (FileSplit)context.getInputSplit();
FileSystem fs = FileSystem.get(context.getConfiguration());
FSDataInputStream in = fs.open(new Path(fileSplit.getPath().getParent() + "/file.txt"));
while(in.available() > 0){
FSDataOutputStream out = fs.create(new Path(in.readLine()));
}
//Proceed further....
}

Related

Java: How do I access my properties file?

So I have a property file in my project. I need to access it.
Here's the tree structure:
+ Project Name
|--+ folder1
|--+ propertyfolder
|--+ file.properties
Or: Project/propertyfolder/file.properties
Here's what I've tried so far (one at a time, not all at once):
// error: java.io.File.<init>(Unknown Source)
File file = new File(System.getProperty("file.properties"));
File file = new File(System.getProperty("propertyfolder/file.properties"));
File file = new File(System.getProperty("propertyfolder\\file.properties"));
File file = new File(System.getProperty("../../propertyfolder/file.properties"));
And:
InputStream inputStream = getClass().getResourceAsStream("file.properties");
InputStream inputStream = getClass().getResourceAsStream("../../propertyfolder/file.properties");
InputStream inputStream = getClass().getResourceAsStream("propertyfolder/file.properties");
InputStream inputStream = getClass().getResourceAsStream("propertyfolder\\file.properties");
And all variations within getClass(), such as getClass().getClassLoader(), etc.
The error I'm getting is a NullReferenceException. It's not finding the file. How do I find it correctly?

(taken from comment to answer as OP suggested)
Just use File file = new File("propertyfolder/file.properties") but you do need to know where is java process working directory, if you cannot control it try an absolute path /c:/myapp/propertyfolder/file.properties.
You may also use /myapp/propertyfolder/file.properties path without C: disk letter to avoid windows-only mapping. You may use / path separator in Java apps works in Win,Linux,MacOSX. Watch out for text file encoding, use InputStreamReader to given an encoding parameter.
File file = new File("propertyfolder/file.properties");
InputStreamReader isr = new InputStreamReader(new FileInputStream(file), "UTF-8");
BufferedReader reader = new BufferedReader(isr);
..read...
reader.close(); // this will close underlaying fileinputstream

Inorder to use getClass().resourceAsStream("file.properties") you need to make sure the file is there in the classpath.
That is if your Test.java file is compiled into bin/Test.class then make sure to have file.properties in the bin/ folder along with the Test.class
Otherwise you can use the Absolute Path, which is not advisable.

Did you set System properties to load file.properties from
1) Command line using -Dpropertyname=value OR
2) System.setProperty() API OR
3) System.load(fileName) API?
If you have n't done any one of them, do not use System.getProperty() to load file.properties file.
Assuming that you have not done above three, the best way to create file InputStream is
InputStream inputStream = getClass().getResourceAsStream("<file.properties path from classpath without />");

Properties extends Hashtable so, Each key and its corresponding value in the property list is a string.
Properties props = new Properties();
// File - Reads from Project Folder.
InputStream fileStream = new FileInputStream("applicationPATH.properties");
props.load(fileStream);
// Class Loader - Reades Form src Folder (Stand Alone application)
ClassLoader AppClassLoader = ReadPropertyFile.class.getClassLoader();
props.load(AppClassLoader.getResourceAsStream("classPATH.properties"));
for(String key : props.stringPropertyNames()) {
System.out.format("%s : %s \n", key, props.getProperty(key));
}
// Reads from src folder.
ResourceBundle rb = ResourceBundle.getBundle("resourcePATH");// resourcePATH.properties
Enumeration<String> keys = rb.getKeys();
while(keys.hasMoreElements()){
String key = keys.nextElement();
System.out.format(" %s = %s \n", key, rb.getString(key));
}
// Class Loader - WebApplication : src folder (or) /WEB-INF/classes/
ClassLoader WebappClassLoader = Thread.currentThread().getContextClassLoader();
props.load(WebappClassLoader.getResourceAsStream("webprops.properties"));
To read properties from specific folder. Construct path form ProjectName
InputStream fileStream = new FileInputStream("propertyfolder/file.properties");
If Key:value pairs specified in .txt file then,
public static void readTxtFile_KeyValues() throws IOException{
props.load(new FileReader("keyValue.txt") );
// Display all the values in the form of key value
for (String key : props.stringPropertyNames()) {
String value = props.getProperty(key);
System.out.println("Key = " + key + " \t Value = " + value);
}
props.clear();
}

file not adding to DistributedCache

I am running Hadoop on my local system, in eclipse environment.
I tried to put a local file from workspace into the distributed cache in driver function as:
DistributedCache.addCacheFile(new Path(
"/home/hduser/workspace/myDir/myFile").toUri(), conf);
but when I tried to access it from Mapper, it returns null.
Inside mapper, I checked to see whether file cached.
System.out.println("Cache: "+context.getConfiguration().get("mapred.cache.files"));
it prints "null", also
Path[] cacheFilesLocal = DistributedCache.getLocalCacheFiles(context.getConfiguration());
returns null.
What's going wrong?

It's because you can only add files to the Distributed Cache from HDFS not local file system. So the Path doesn't exist. Put the file on HDFS and use the HDFS path to refer to it when adding to the DistributedCache.
See DistributedCache for more information.

Add file:// in the path when you add cache file
DistributedCache.addCacheFile(new Path( "file:///home/hduser/workspace/myDir/myFile"), conf);

Try this
DRIVER Class
Path p = new Path(your/file/path);
FileStatus[] list = fs.globStatus(p);
for (FileStatus status : list) {
/*Storing file to distributed cache*/
DistributedCache.addCacheFile(status.getPath().toUri(), conf);
}
Mapper class
public void setup(Context context) throws IOException{
/*Accessing data in file */
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.get(conf);
URI[] cacheFiles = DistributedCache.getCacheFiles(conf);
/* Accessing 0 th cached file*/
Path getPath = new Path(cacheFiles[0].getPath());
/*Read data*/
BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath)));
String setupData = null;
while ((setupData = bf.readLine()) != null) {
/*Print file content*/
System.out.println("Setup Line "+setupData);
}
bf.close();
}
public void map(){
}

How to change the output file name from part-00000 in reducer to inputfile name

Currently I am able to implement the name change from part-00000 to a custom fileName in mapper. I am doing this by taking the inputSplit. I tried the same in reducer to rename the file but, fileSplit method is not available for reducer. So, is there a best way to rename the output of a reducer to with inputfile name. Below is how I acheived it in mapper.
#Override
public void setup(Context con) throws IOException, InterruptedException {
fileName = ((FileSplit) con.getInputSplit()).getPath().getName();
fileName = fileName.substring(0,36);
outputName = new Text(fileName);
final Path baseOutputPath = FileOutputFormat.getOutputPath(con);
final Path outputFilePath = new Path(baseOutputPath, fileName);
TextOutputFormat<IntWritable, Text> write = new TextOutputFormat<IntWritable, Text>() {
#Override
public Path getDefaultWorkFile(TaskAttemptContext context, String extension) throws IOException {
return outputFilePath;

This is what hadoop wiki says:
You can subclass the OutputFormat.java class and write your own. You can locate and browse the code of TextOutputFormat, MultipleOutputFormat.java, etc. for reference. It might be the case that you only need to do minor changes to any of the existing Output Format classes. To do that you can just subclass that class and override the methods you need to change.
If you need to be on key and input file format, then you could create subclass of MultipleOutputFormat to control output file name.

How do I get last modified date from a Hadoop Sequence File?

I am using a mapper that converts BinaryFiles (jpegs) to a Hadoop Sequence File (HSF):
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String uri = value.toString().replace(" ", "%20");
Configuration conf = new Configuration();
FSDataInputStream in = null;
try {
FileSystem fs = FileSystem.get(URI.create(uri), conf);
in = fs.open(new Path(uri));
java.io.ByteArrayOutputStream bout = new ByteArrayOutputStream();
byte buffer[] = new byte[1024 * 1024];
while( in.read(buffer, 0, buffer.length) >= 0 ) {
bout.write(buffer);
}
context.write(value, new BytesWritable(bout.toByteArray()));
I then have a second mapper that reads the HSF, thus:
public class ImagePHashMapper extends Mapper<Text, BytesWritable, Text, Text>{
public void map(Text key, BytesWritable value, Context context) throws IOException,InterruptedException {
//get the PHash for this specific file
String PHashStr;
try {
PHashStr = calculatePhash(value.getBytes());
and calculatePhash is:
static String calculatePhash(byte[] imageData) throws NoSuchAlgorithmException {
//get the PHash for this specific data
//PHash requires inputstream rather than byte array
InputStream is = new ByteArrayInputStream(imageData);
String ph;
try {
ImagePHash ih = new ImagePHash();
ph = ih.getHash(is);
System.out.println ("file: " + is.toString() + " phash: " +ph);
} catch (Exception e) {
e.printStackTrace();
return "Internal error with ImagePHash.getHash";
}
return ph;
This all works fine, but I want calculatePhash to write out each jpeg's last modified date. I know I can use file.lastModified() to get the last modified date in a file but is there any way to get this in either map or calculatePhash? I'm a noob at Java. TIA!

Hi i think that you want is the modification time of each input File that enters in your mapper. If it is the case you just have to add a few lines to the mpkorstanje solution:
FileSystem fs = FileSystem.get(URI.create(uri), conf);
long moddificationTime = fs
.getFileStatus((FileSplit)context.getInputSplit())
.getPath()).lastModified();
With this few changes you can get the fileStatus of each inputSlipt and you can add it to your key in order to use later in your process or make a multipleOutput reduce and write somewhere else in your reduce phase.
I hope this will be usefull

Haven't used Hadoop much but I don't think you should use file.lastModified(). Hadoop abstracted the file system somewhat.
Have you tried using FileSystem.getFileStatus(path) in map? It gets you a FileStatus object that has a modification time. Something like
FileSystem fs = FileSystem.get(URI.create(uri), conf);
long moddificationTime = fs.getFileStatus(new Path(uri)).lastModified();

Use the following code snippet to get Map of all the files modified under particular directory path you provide:
private static HashMap lastModifiedFileList(FileSystem fs, Path rootDir) {
// TODO Auto-generated method stub
HashMap modifiedList = new HashMap();
try {
FileStatus[] status = fs.listStatus(rootDir);
for (FileStatus file : status) {
modifiedList.put(file.getPath(), file.getModificationTime());
}
} catch (IOException e) {
e.printStackTrace();
}
return modifiedList;
}

In Hadoop each files are consist of BLOCK.
Generally Hadoop FileSystem are referred the package org.apache.hadoop.fs.
If your input files are present in HDFS means you need to import the above package
FileSystem fs = FileSystem.get(URI.create(uri), conf);
in = fs.open(new Path(uri));
org.apache.hadoop.fs.FileStatus fileStatus=fs.getFileStatus(new Path(uri));
long modificationDate = fileStatus.getModificationTime();
Date date=new Date(modificationDate);
SimpleDateFormat df2 = new SimpleDateFormat("dd/MM/yy HH:mm:ss");
String dateText = df2.format(date);
I hope this will help you.

Append data to existing file in HDFS Java

I'm having trouble to append data to an existing file in HDFS. I want that if the file exists then append a line, if not, create a new file with the name given.
Here's my method to write into HDFS.
if (!file.exists(path)){
file.createNewFile(path);
}
FSDataOutputStream fileOutputStream = file.append(path);
BufferedWriter br = new BufferedWriter(new OutputStreamWriter(fileOutputStream));
br.append("Content: " + content + "\n");
br.close();
Actually this method writes into HDFS and create a file but as I mention is not appending.
This is how I test my method:
RunTimeCalculationHdfsWrite.hdfsWriteFile("RunTimeParserLoaderMapperTest2", "Error message test 2.2", context, null);
The first param is the name of the file, the second the message and the other two params are not important.
So anyone have an idea what I'm missing or doing wrong?

Actually, you can append to a HDFS file:
From the perspective of Client, append operation firstly calls append of DistributedFileSystem, this operation would return a stream object FSDataOutputStream out. If Client needs to append data to this file, it could calls out.write to write, and calls out.close to close.
I checked HDFS sources, there is DistributedFileSystem#append method:
FSDataOutputStream append(Path f, final int bufferSize, final Progressable progress) throws IOException
For details, see presentation.
Also you can append through command line:
hdfs dfs -appendToFile <localsrc> ... <dst>
Add lines directly from stdin:
echo "Line-to-add" | hdfs dfs -appendToFile - <dst>

Solved..!!
Append is supported in HDFS.
You just have to do some configurations and simple code as shown below :
Step 1: set dfs.support.append as true in hdfs-site.xml :
<property>
<name>dfs.support.append</name>
<value>true</value>
</property>
Stop all your daemon services using stop-all.sh and restart it again using start-all.sh
Step 2 (Optional): Only If you have a singlenode cluster , so you have to set replication factor to 1 as below :
Through command line :
./hdfs dfs -setrep -R 1 filepath/directory
Or you can do the same at run time through java code:
fsShell.setrepr((short) 1, filePath);
Step 3 : Code for Creating/appending data into the file :
public void createAppendHDFS() throws IOException {
Configuration hadoopConfig = new Configuration();
hadoopConfig.set("fs.defaultFS", hdfsuri);
FileSystem fileSystem = FileSystem.get(hadoopConfig);
String filePath = "/test/doc.txt";
Path hdfsPath = new Path(filePath);
fShell.setrepr((short) 1, filePath);
FSDataOutputStream fileOutputStream = null;
try {
if (fileSystem.exists(hdfsPath)) {
fileOutputStream = fileSystem.append(hdfsPath);
fileOutputStream.writeBytes("appending into file. \n");
} else {
fileOutputStream = fileSystem.create(hdfsPath);
fileOutputStream.writeBytes("creating and writing into file\n");
}
} finally {
if (fileSystem != null) {
fileSystem.close();
}
if (fileOutputStream != null) {
fileOutputStream.close();
}
}
}
Kindly let me know for any other help.
Cheers.!!

HDFS does not allow append operations. One way to implement the same functionality as appending is:
Check if file exists.
If file doesn't exist, then create new file & write to new file
If file exists, create a temporary file.
Read line from original file & write that same line to temporary file (don't forget the newline)
Write the lines you want to append to the temporary file.
Finally, delete the original file & move(rename) the temporary file to the original file.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Hadoop Map Task : Read the content of a specified input file - java

Related

Java: How do I access my properties file?

file not adding to DistributedCache

How to change the output file name from part-00000 in reducer to inputfile name

How do I get last modified date from a Hadoop Sequence File?

Append data to existing file in HDFS Java

Categories

Resources