How do I get last modified date from a Hadoop Sequence File?

How do I get last modified date from a Hadoop Sequence File? - java

I am using a mapper that converts BinaryFiles (jpegs) to a Hadoop Sequence File (HSF):
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String uri = value.toString().replace(" ", "%20");
Configuration conf = new Configuration();
FSDataInputStream in = null;
try {
FileSystem fs = FileSystem.get(URI.create(uri), conf);
in = fs.open(new Path(uri));
java.io.ByteArrayOutputStream bout = new ByteArrayOutputStream();
byte buffer[] = new byte[1024 * 1024];
while( in.read(buffer, 0, buffer.length) >= 0 ) {
bout.write(buffer);
}
context.write(value, new BytesWritable(bout.toByteArray()));
I then have a second mapper that reads the HSF, thus:
public class ImagePHashMapper extends Mapper<Text, BytesWritable, Text, Text>{
public void map(Text key, BytesWritable value, Context context) throws IOException,InterruptedException {
//get the PHash for this specific file
String PHashStr;
try {
PHashStr = calculatePhash(value.getBytes());
and calculatePhash is:
static String calculatePhash(byte[] imageData) throws NoSuchAlgorithmException {
//get the PHash for this specific data
//PHash requires inputstream rather than byte array
InputStream is = new ByteArrayInputStream(imageData);
String ph;
try {
ImagePHash ih = new ImagePHash();
ph = ih.getHash(is);
System.out.println ("file: " + is.toString() + " phash: " +ph);
} catch (Exception e) {
e.printStackTrace();
return "Internal error with ImagePHash.getHash";
}
return ph;
This all works fine, but I want calculatePhash to write out each jpeg's last modified date. I know I can use file.lastModified() to get the last modified date in a file but is there any way to get this in either map or calculatePhash? I'm a noob at Java. TIA!

Hi i think that you want is the modification time of each input File that enters in your mapper. If it is the case you just have to add a few lines to the mpkorstanje solution:
FileSystem fs = FileSystem.get(URI.create(uri), conf);
long moddificationTime = fs
.getFileStatus((FileSplit)context.getInputSplit())
.getPath()).lastModified();
With this few changes you can get the fileStatus of each inputSlipt and you can add it to your key in order to use later in your process or make a multipleOutput reduce and write somewhere else in your reduce phase.
I hope this will be usefull

Haven't used Hadoop much but I don't think you should use file.lastModified(). Hadoop abstracted the file system somewhat.
Have you tried using FileSystem.getFileStatus(path) in map? It gets you a FileStatus object that has a modification time. Something like
FileSystem fs = FileSystem.get(URI.create(uri), conf);
long moddificationTime = fs.getFileStatus(new Path(uri)).lastModified();

Use the following code snippet to get Map of all the files modified under particular directory path you provide:
private static HashMap lastModifiedFileList(FileSystem fs, Path rootDir) {
// TODO Auto-generated method stub
HashMap modifiedList = new HashMap();
try {
FileStatus[] status = fs.listStatus(rootDir);
for (FileStatus file : status) {
modifiedList.put(file.getPath(), file.getModificationTime());
}
} catch (IOException e) {
e.printStackTrace();
}
return modifiedList;
}

In Hadoop each files are consist of BLOCK.
Generally Hadoop FileSystem are referred the package org.apache.hadoop.fs.
If your input files are present in HDFS means you need to import the above package
FileSystem fs = FileSystem.get(URI.create(uri), conf);
in = fs.open(new Path(uri));
org.apache.hadoop.fs.FileStatus fileStatus=fs.getFileStatus(new Path(uri));
long modificationDate = fileStatus.getModificationTime();
Date date=new Date(modificationDate);
SimpleDateFormat df2 = new SimpleDateFormat("dd/MM/yy HH:mm:ss");
String dateText = df2.format(date);
I hope this will help you.

Related

How to read Hadoop sequence file using Java

I have a sequence file generated by Spark using saveAsObjectFile function. File content is just some int numbers. And I want to read it locally with Java. Here is my code:
FileSystem fileSystem = null;
SequenceFile.Reader in = null;
try {
fileSystem = FileSystem.get(conf);
Path path = new Path("D:\\spark_sequence_file");
in = new SequenceFile.Reader(conf, SequenceFile.Reader.file(path));
Writable key = (Writable)
ReflectionUtils.newInstance(in.getKeyClass(), conf);
BytesWritable value = new BytesWritable();
while (in.next(key, value)) {
byte[] val_byte = value.getBytes();
int val = ByteBuffer.wrap(val_byte, 0, 4).getInt();
}
} catch (IOException e) {
e.printStackTrace();
}
But I can't read it correctly; I just get all the same values, and obviously they are wrong. Here is my answer snapshot
The file head is like this:
Can anybody help me?

In Hadoop usually the Keys are of type WritableComparable and values are of type Writable. Keeping this basic concept in mind I read the Sequence File in the below way.
Configuration config = new Configuration();
Path path = new Path(PATH_TO_YOUR_FILE);
SequenceFile.Reader reader = new SequenceFile.Reader(FileSystem.get(config), path, config);
WritableComparable key = (WritableComparable) reader.getKeyClass().newInstance();
Writable value = (Writable) reader.getValueClass().newInstance();
while (reader.next(key, value))
// do some thing
reader.close();
The data issue in your case might be because of the reason you are using saveAsObjectFile() rather than using saveAsSequenceFile(String path,scala.Option<Class<? extends org.apache.hadoop.io.compress.CompressionCodec>> codec)
Please try to use the above method and see if the issue persist.

Java writing dynamically to a String Array from InputStrem of a IOUtils, specifically copyBytes

Having read the documentation from copyBytes (of IOUtils), we can see here it's parameters:
copyBytes:
public static void copyBytes(InputStream in,
OutputStream out,
int buffSize,
boolean close) throws IOException
Copies from one stream to another.
Parameters:
in - InputStrem to read from
out - OutputStream to write to
buffSize - the size of the buffer
close - whether or not close the InputStream and OutputStream at the end. The streams are closed in the finally clause.
Throws:
IOException
So, with this information in mind- I've got a data-structure like this:
List<String> inputLinesObject = IOUtils.readLines(in, "UTF-8");
^which is what I hope would be an extensible array list of strings, that I can populate with data from the file that I'm reading with that copyBytes method.
However, here's the code I use when I call the copyBytes method:
IOUtils.copyBytes(in, inputLinesObject, 4096, false);
That place where you see inputLinesObject, that's where I'd like to put my extensible array list that can collect that data and convert it to string format- but the way I'm doing it now is not the right way- and I'm somehow stuck- I can't see the right way to collect that data in the format of an array list of strings (what is it at this point? As it comes from an inputSteam does that make it a byteArray?).
Here's the full program- it reads in files from HDFS and -is supposed to (though currently is not) output them to an array list of strings- which finally will logged to the console with System.out.println.
// this concatenates output to terminal from hdfs
public static void main(String[] args) throws IOException {
// supply this as input
String uri = args[0];
// reading in from hdfs
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FSDataInputStream in = null;
// create arraylist for hdfs file to flow into
//List<String> inputLinesObject = new ArrayList<String>();
List<String> inputLinesObject = IOUtils.readLines(in, "UTF-8");
// TODO: how to make this go to a file rather than to the System.out?
try
{
in = fs.open(new Path(uri));
// The way:
IOUtils.copyBytes(in, inputLinesObject, 4096, false);
}
finally{
IOUtils.closeStream(in);
}

Use ByteArrayOutputStream, see here:
// supply this as input
String uri = args[0];
// reading in from hdfs
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FSDataInputStream in = null;
// create arraylist for hdfs file to flow into
List<String> list = new ArrayList<String>(); // Initialize List
ByteArrayOutputStream baos = new ByteArrayOutputStream();
OutputStream os = new DataOutputStream(baos);
try
{
in = fs.open(new Path(uri));
// The way:
IOUtils.copyBytes(in, os, 4096, false);
}
finally{
IOUtils.closeStream(in);
}
byte[] data = baos.toByteArray();
String dataAsString = new String(data, "UTF-8"); // or whatever encoding
System.out.println(dataAsString);

Scanner and Multithreading issues?

I have following code to read entire file data:
calling method(String zipFile){
ZipInputStream zis =
new ZipInputStream(new FileInputStream(zipFile));
//get the zipped file list entry
ZipEntry ze = zis.getNextEntry();
while (ze != null) {
String fileName = ze.getName();
File newFile =
new File(Constants.OUTPUT_FOLDER + File.separator +
fileName);
if (ze.isDirectory()) {
new File(newFile.getParent()).mkdirs();
} else {
new File(newFile.getParent()).mkdirs();
createBlobDomain(zFile,ze);
}
}
ze = zis.getNextEntry();
}
zis.closeEntry();
zis.close();
}
public String method(ZipFile zf, ZipEntry ze){
scan = new Scanner(zf.getInputStream(ze));
if(scan.hasNext())
fullText = scan.useDelimiter("\\A").next();
return fullText;
}
Please ignore it from compilation perspective as i removed some code not really relevant here. It works fine when run from the webapp as a single instance. But it i run it from two different browsers at the same time then i hit below exception. Please advise what could be going wrong and how to fix it.
java.util.InputMismatchException
at java.util.Scanner.throwFor(Scanner.java:840)
at java.util.Scanner.next(Scanner.java:1347)

I believe the line scan = new Scanner(zf.getInputStream(ze)); is creating the problem. What I understand from you code is scan is an instance variable which you are assigning a new Scanner with every thread. I would suggest to make it as a local variable in your method. Correct me If I misunderstood anything.
Scanner scan = new Scanner(zf.getInputStream(ze))

It looks to me that what you want to do is to copy the contents of a zip into a given folder.
Provided you use Java 7+, it's actually pretty simple to do that; this code uses java7-fs-more to help you do the job:
public static void extractZip(final String zipfile, final String dstdir)
throws IOException
{
final Map<String, ?> env = Collections.singletonMap("readonly", "true);
final Path path = Paths.get(zipfile);
final URI uri = URI.create("jar:" + path.toUri());
try (
final FileSystem zipfs = FileSystems.newFileSystem(uri, env);
) {
MoreFiles.copyRecursive(zipfs.getPath("/"), Paths.get(dstdir),
RecursionMode.FAIL_FAST);
}
}

Write some part of a bson file into another bson file in java

I have a bson file,(a.bson). I want to read this file and extract some part of it and then save these parts into another BSON file (b.bson).
Currently, I can read my source file (a.bson) using org.bson.BSONEncoder and extract my favorite parts of it (e.g., key1 and key2 for each row of source fila). Now I want to save these data in another bson file (b.bson). In fact, I need to save this data in a bson file because this file has structure a I can easily check rows have contains spacial value or not. I write this code and
import org.bson.BSONEncoder;
public static void createmyFile(File sourceFile) throws FileNotFoundException, IOException {
InputStream inputStream = new BufferedInputStream(new FileInputStream(sourceFile));
BSONDecoder decoder = new BasicBSONDecoder();
try {
while (inputStream.available() > 0) {
BSONObject bsonSingleRow = decoder.readObject(inputStream);
// ---------------------------------------------------------------------
// Write bsonSingleRow.get(key1) & bsonSingleRow.get(key2) into new file
// ---------------------------------------------------------------------
}
} catch (IOException e) {
...
}
}
Please help me to complete above code.

For example if you want 2% (select randomly) data from source file
File file = new File("a.bson");
String path = "b.bson";
BasicBSONEncoder encoder = new BasicBSONEncoder();
InputStream inputStream = new BufferedInputStream(new FileInputStream(file));
BSONDecoder decoder = new BasicBSONDecoder();
try {
while (inputStream.available() > 0) {
BSONObject bsonSingleRow = decoder.readObject(inputStream);
c = bsonSingleRow.get("yourKey").toString();
if (Math.random()> .98))
Files.write(Paths.get(path), encoder.encode(bsonSingleRow),StandardOpenOption.CREATE, StandardOpenOption.APPEND);
}
}
} catch (IOException e) {
...
}

Hadoop Map Task : Read the content of a specified input file

I'm pretty new to Hadoop environment. Recently, I run a basic mapreduce program. It was easy to run.
Now, I've a input file with following contents inside input path directory
fileName1
fileName2
fileName3
...
I need to read the lines of this file one by one and create a new File with those names (i.e fileName1, fileName2, and so on) at specified output directory.
I wrote the below map implementation, but it didn't work out
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
String fileName = value.toString();
String path = outputFilePath + File.separator + fileName;
File newFile = new File(path);
newFile.mkdirs();
newFile.createNewFile();
}
Can somebody explain me what I've missed out ?
Thanks

I think you should get started with studying the FileSystem class, I think you can only create files in the distributed filesystem. Here's a code example of where I opened a file for reading, you probably just need a FSDataOutputStream. In your mapper you can get your configuration out of the Context class.
Configuration conf = job.getConfiguration();
Path inFile = new Path(file);
try {
FileSystem fs;
fs = FileSystem.get(conf);
if (!fs.exists(inFile))
System.out.println("Unable to open settings file: "+file);
FSDataInputStream in = fs.open(inFile);
...
}

First of all get the path of the input directory inside your mapper with the help of FileSplit. Then append it to the name of the file which contains all these lines and read the lines of this file using FSDataInputStream. Something like this :
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
FileSplit fileSplit = (FileSplit)context.getInputSplit();
FileSystem fs = FileSystem.get(context.getConfiguration());
FSDataInputStream in = fs.open(new Path(fileSplit.getPath().getParent() + "/file.txt"));
while(in.available() > 0){
FSDataOutputStream out = fs.create(new Path(in.readLine()));
}
//Proceed further....
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How do I get last modified date from a Hadoop Sequence File? - java

Related

How to read Hadoop sequence file using Java

Java writing dynamically to a String Array from InputStrem of a IOUtils, specifically copyBytes

Scanner and Multithreading issues?

Write some part of a bson file into another bson file in java

Hadoop Map Task : Read the content of a specified input file

Categories

Resources