How to read Hadoop sequence file using Java - java

I have a sequence file generated by Spark using saveAsObjectFile function. File content is just some int numbers. And I want to read it locally with Java. Here is my code:
FileSystem fileSystem = null;
SequenceFile.Reader in = null;
try {
fileSystem = FileSystem.get(conf);
Path path = new Path("D:\\spark_sequence_file");
in = new SequenceFile.Reader(conf, SequenceFile.Reader.file(path));
Writable key = (Writable)
ReflectionUtils.newInstance(in.getKeyClass(), conf);
BytesWritable value = new BytesWritable();
while (in.next(key, value)) {
byte[] val_byte = value.getBytes();
int val = ByteBuffer.wrap(val_byte, 0, 4).getInt();
}
} catch (IOException e) {
e.printStackTrace();
}
But I can't read it correctly; I just get all the same values, and obviously they are wrong. Here is my answer snapshot
The file head is like this:
Can anybody help me?

In Hadoop usually the Keys are of type WritableComparable and values are of type Writable. Keeping this basic concept in mind I read the Sequence File in the below way.
Configuration config = new Configuration();
Path path = new Path(PATH_TO_YOUR_FILE);
SequenceFile.Reader reader = new SequenceFile.Reader(FileSystem.get(config), path, config);
WritableComparable key = (WritableComparable) reader.getKeyClass().newInstance();
Writable value = (Writable) reader.getValueClass().newInstance();
while (reader.next(key, value))
// do some thing
reader.close();
The data issue in your case might be because of the reason you are using saveAsObjectFile() rather than using saveAsSequenceFile(String path,scala.Option<Class<? extends org.apache.hadoop.io.compress.CompressionCodec>> codec)
Please try to use the above method and see if the issue persist.

Related

Create instance of file using value of a property in properties file

I'm trying to create an instance of a file to parse html records from a property value. the problem is in the url of the file that I must put in the file properties, here is my example :
the correspondance code for reading file :
public void extraxtElementWithoutId() {
Map<String,List<List<Element>>> uniciteIds = new HashMap<String,List<List<Element>>>();
FileReader fileReader = null;
Document doc = null;
try {
fileReader = new FileReader(new ClassPathResource(FILEPROPERTYNAME).getFile());
Properties prop = new Properties();
prop.load(fileReader);
Enumeration<?> enumeration = prop.propertyNames();
List<List<Element>> fiinalReturn = null;
while (enumeration.hasMoreElements()) {
String path = (String) enumeration.nextElement();
System.out.println("Fichier en question : " + prop.getProperty(path));
URL url = getClass().getResource(prop.getProperty(path));
System.out.println(url.getPath());
File inputFile = new File(url.getPath());
doc = Jsoup.parse(inputFile, "UTF-8");
//fiinalReturn = getListofElements(doc);
//System.out.println(fiinalReturn);
fiinalReturn = uniciteIds.put("Duplicat Id", getUnicityIds(doc));
System.out.println(fiinalReturn);
}
} catch (IOException e) {
e.printStackTrace();
}finally {
try{
fileReader.close();
}catch(Exception e) {
e.printStackTrace();
}
}
}
Thank you in advance,
Best Regards.
You are making a very common mistake for line -
URL url = getClass().getResource(prop.getProperty(path));
Try with property value as ( by removing src ) - /testHtmlFile/test.html and so on. Don't change code.
UrlEnterer1=/testHtmlFile/test.html instead of preceding it with src.
prop.getProperty(path) should be as per your build path location for the file. Check your build directory as how these files are stored. These are not stored under src but directly under build directory.
This answer explains a little bit about path value for file reading from class path.
Also, as a side note ( not related to question ) , try not doing prop.getProperty(path) but directly injecting property value in your class using org.springframework.beans.factory.annotation.Value annotation.

Java writing dynamically to a String Array from InputStrem of a IOUtils, specifically copyBytes

Having read the documentation from copyBytes (of IOUtils), we can see here it's parameters:
copyBytes:
public static void copyBytes(InputStream in,
OutputStream out,
int buffSize,
boolean close) throws IOException
Copies from one stream to another.
Parameters:
in - InputStrem to read from
out - OutputStream to write to
buffSize - the size of the buffer
close - whether or not close the InputStream and OutputStream at the end. The streams are closed in the finally clause.
Throws:
IOException
So, with this information in mind- I've got a data-structure like this:
List<String> inputLinesObject = IOUtils.readLines(in, "UTF-8");
^which is what I hope would be an extensible array list of strings, that I can populate with data from the file that I'm reading with that copyBytes method.
However, here's the code I use when I call the copyBytes method:
IOUtils.copyBytes(in, inputLinesObject, 4096, false);
That place where you see inputLinesObject, that's where I'd like to put my extensible array list that can collect that data and convert it to string format- but the way I'm doing it now is not the right way- and I'm somehow stuck- I can't see the right way to collect that data in the format of an array list of strings (what is it at this point? As it comes from an inputSteam does that make it a byteArray?).
Here's the full program- it reads in files from HDFS and -is supposed to (though currently is not) output them to an array list of strings- which finally will logged to the console with System.out.println.
// this concatenates output to terminal from hdfs
public static void main(String[] args) throws IOException {
// supply this as input
String uri = args[0];
// reading in from hdfs
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FSDataInputStream in = null;
// create arraylist for hdfs file to flow into
//List<String> inputLinesObject = new ArrayList<String>();
List<String> inputLinesObject = IOUtils.readLines(in, "UTF-8");
// TODO: how to make this go to a file rather than to the System.out?
try
{
in = fs.open(new Path(uri));
// The way:
IOUtils.copyBytes(in, inputLinesObject, 4096, false);
}
finally{
IOUtils.closeStream(in);
}
Use ByteArrayOutputStream, see here:
// supply this as input
String uri = args[0];
// reading in from hdfs
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FSDataInputStream in = null;
// create arraylist for hdfs file to flow into
List<String> list = new ArrayList<String>(); // Initialize List
ByteArrayOutputStream baos = new ByteArrayOutputStream();
OutputStream os = new DataOutputStream(baos);
try
{
in = fs.open(new Path(uri));
// The way:
IOUtils.copyBytes(in, os, 4096, false);
}
finally{
IOUtils.closeStream(in);
}
byte[] data = baos.toByteArray();
String dataAsString = new String(data, "UTF-8"); // or whatever encoding
System.out.println(dataAsString);

Writing file to HDFS using Java

I'm trying to write a file to HDFS, the file get created but it is empty on the cluster, however when I run the code locally it works like a charm.
here's my code :
FSDataOutputStream recOutputWriter = null;
FileSystem fs = null;
try {
//OutputWriter = new FileWriter(outputFileName,true);
Configuration configuration = new Configuration();
fs = FileSystem.get(configuration);
Path testOutFile = new Path(outputFileName);
recOutputWriter = fs.create(testOutFile);
//outputWriter = new FileWriter(outputFileName,true);
} catch (IOException e) {
e.printStackTrace();
}
recOutputWriter.writeBytes("======================================\n");
recOutputWriter.writeBytes("OK\n");
recOutputWriter.writeBytes("======================================\n");
if (recOutputWriter != null) {
recOutputWriter.close();
}
fs.close();
did I miss something ?
In order to write data to a file after creating it on the cluster I had to add :
System.setProperty("HADOOP_USER_NAME", "vagrant");
Refrence - Writing files to Hadoop HDFS using Scala

file not adding to DistributedCache

I am running Hadoop on my local system, in eclipse environment.
I tried to put a local file from workspace into the distributed cache in driver function as:
DistributedCache.addCacheFile(new Path(
"/home/hduser/workspace/myDir/myFile").toUri(), conf);
but when I tried to access it from Mapper, it returns null.
Inside mapper, I checked to see whether file cached.
System.out.println("Cache: "+context.getConfiguration().get("mapred.cache.files"));
it prints "null", also
Path[] cacheFilesLocal = DistributedCache.getLocalCacheFiles(context.getConfiguration());
returns null.
What's going wrong?
It's because you can only add files to the Distributed Cache from HDFS not local file system. So the Path doesn't exist. Put the file on HDFS and use the HDFS path to refer to it when adding to the DistributedCache.
See DistributedCache for more information.
Add file:// in the path when you add cache file
DistributedCache.addCacheFile(new Path( "file:///home/hduser/workspace/myDir/myFile"), conf);
Try this
DRIVER Class
Path p = new Path(your/file/path);
FileStatus[] list = fs.globStatus(p);
for (FileStatus status : list) {
/*Storing file to distributed cache*/
DistributedCache.addCacheFile(status.getPath().toUri(), conf);
}
Mapper class
public void setup(Context context) throws IOException{
/*Accessing data in file */
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.get(conf);
URI[] cacheFiles = DistributedCache.getCacheFiles(conf);
/* Accessing 0 th cached file*/
Path getPath = new Path(cacheFiles[0].getPath());
/*Read data*/
BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath)));
String setupData = null;
while ((setupData = bf.readLine()) != null) {
/*Print file content*/
System.out.println("Setup Line "+setupData);
}
bf.close();
}
public void map(){
}

How do I get last modified date from a Hadoop Sequence File?

I am using a mapper that converts BinaryFiles (jpegs) to a Hadoop Sequence File (HSF):
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String uri = value.toString().replace(" ", "%20");
Configuration conf = new Configuration();
FSDataInputStream in = null;
try {
FileSystem fs = FileSystem.get(URI.create(uri), conf);
in = fs.open(new Path(uri));
java.io.ByteArrayOutputStream bout = new ByteArrayOutputStream();
byte buffer[] = new byte[1024 * 1024];
while( in.read(buffer, 0, buffer.length) >= 0 ) {
bout.write(buffer);
}
context.write(value, new BytesWritable(bout.toByteArray()));
I then have a second mapper that reads the HSF, thus:
public class ImagePHashMapper extends Mapper<Text, BytesWritable, Text, Text>{
public void map(Text key, BytesWritable value, Context context) throws IOException,InterruptedException {
//get the PHash for this specific file
String PHashStr;
try {
PHashStr = calculatePhash(value.getBytes());
and calculatePhash is:
static String calculatePhash(byte[] imageData) throws NoSuchAlgorithmException {
//get the PHash for this specific data
//PHash requires inputstream rather than byte array
InputStream is = new ByteArrayInputStream(imageData);
String ph;
try {
ImagePHash ih = new ImagePHash();
ph = ih.getHash(is);
System.out.println ("file: " + is.toString() + " phash: " +ph);
} catch (Exception e) {
e.printStackTrace();
return "Internal error with ImagePHash.getHash";
}
return ph;
This all works fine, but I want calculatePhash to write out each jpeg's last modified date. I know I can use file.lastModified() to get the last modified date in a file but is there any way to get this in either map or calculatePhash? I'm a noob at Java. TIA!
Hi i think that you want is the modification time of each input File that enters in your mapper. If it is the case you just have to add a few lines to the mpkorstanje solution:
FileSystem fs = FileSystem.get(URI.create(uri), conf);
long moddificationTime = fs
.getFileStatus((FileSplit)context.getInputSplit())
.getPath()).lastModified();
With this few changes you can get the fileStatus of each inputSlipt and you can add it to your key in order to use later in your process or make a multipleOutput reduce and write somewhere else in your reduce phase.
I hope this will be usefull
Haven't used Hadoop much but I don't think you should use file.lastModified(). Hadoop abstracted the file system somewhat.
Have you tried using FileSystem.getFileStatus(path) in map? It gets you a FileStatus object that has a modification time. Something like
FileSystem fs = FileSystem.get(URI.create(uri), conf);
long moddificationTime = fs.getFileStatus(new Path(uri)).lastModified();
Use the following code snippet to get Map of all the files modified under particular directory path you provide:
private static HashMap lastModifiedFileList(FileSystem fs, Path rootDir) {
// TODO Auto-generated method stub
HashMap modifiedList = new HashMap();
try {
FileStatus[] status = fs.listStatus(rootDir);
for (FileStatus file : status) {
modifiedList.put(file.getPath(), file.getModificationTime());
}
} catch (IOException e) {
e.printStackTrace();
}
return modifiedList;
}
In Hadoop each files are consist of BLOCK.
Generally Hadoop FileSystem are referred the package org.apache.hadoop.fs.
If your input files are present in HDFS means you need to import the above package
FileSystem fs = FileSystem.get(URI.create(uri), conf);
in = fs.open(new Path(uri));
org.apache.hadoop.fs.FileStatus fileStatus=fs.getFileStatus(new Path(uri));
long modificationDate = fileStatus.getModificationTime();
Date date=new Date(modificationDate);
SimpleDateFormat df2 = new SimpleDateFormat("dd/MM/yy HH:mm:ss");
String dateText = df2.format(date);
I hope this will help you.

Categories

Resources