Sequence Files in Hadoop

Sequence Files in Hadoop - java

How are these sequence files generated ? I saw a link about sequence file here,
http://wiki.apache.org/hadoop/SequenceFile
Are these written using default Java serializer ? and How do I read a sequence file ?

Sequence files are generated by MapReduce tasks and and can be used as common format to transfer data between MapReduce jobs.
You can read them in the following manner:
Configuration config = new Configuration();
Path path = new Path(PATH_TO_YOUR_FILE);
SequenceFile.Reader reader = new SequenceFile.Reader(FileSystem.get(config), path, config);
WritableComparable key = (WritableComparable) reader.getKeyClass().newInstance();
Writable value = (Writable) reader.getValueClass().newInstance();
while (reader.next(key, value))
// perform some operating
reader.close();
Also you can generate sequence files by yourself using SequenceFile.Writer.
The classes used in the example are the following:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
And are contained within the hadoop-core maven dependency:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.2.1</version>
</dependency>

Thanks to Lev Khomich's answer, my problem has been solved.
However, the solution has been deprecated for a while and the new API offers more features and also easy to use.
Check out the source code of hadoop.io.SequenceFile, click here:
Configuration config = new Configuration();
Path path = new Path("/Users/myuser/sequencefile");
SequenceFile.Reader reader = new Reader(config, Reader.file(path));
WritableComparable key = (WritableComparable) reader.getKeyClass()
.newInstance();
Writable value = (Writable) reader.getValueClass().newInstance();
while (reader.next(key, value)) {
System.out.println(key);
System.out.println(value);
System.out.println("------------------------");
}
reader.close();
Extra info, here is the sample output running against the data file generated by Nutch/injector:
------------------------
https://wiki.openoffice.org/wiki/Ru/FAQ
Version: 7
Status: 1 (db_unfetched)
Fetch time: Sun Apr 13 16:12:59 MDT 2014
Modified time: Wed Dec 31 17:00:00 MST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata:
------------------------
https://www.bankhapoalim.co.il/
Version: 7
Status: 1 (db_unfetched)
Fetch time: Sun Apr 13 16:12:59 MDT 2014
Modified time: Wed Dec 31 17:00:00 MST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata:
Thanks!

Related

Getting unknown integer value in Map-reduce output file

I am working on a hadoop map-reduce program where i am not setting the mapper and reducer and not setting any other parameter to the Job configuration from my program. I did so assuming that the the Job will send the same output as the input to the output file.
But what i found that it is printing some dummy integer value in the output file with every line separated by tab(i guess).
Here is my code:
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class MinimalMapReduce extends Configured implements Tool {
public int run(String[] args) throws Exception {
Job job = new Job(getConf());
job.setJarByClass(getClass());
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) {
String argg[] = {"/Users/***/Documents/hadoop/input/input.txt",
"/Users/***/Documents/hadoop/output_MinimalMapReduce"};
try{
int exitCode = ToolRunner.run(new MinimalMapReduce(), argg);
System.exit(exitCode);
}catch(Exception e){
e.printStackTrace();
}
}
}
And here is the input:
2011 22
2011 25
2012 40
2013 35
2013 38
2014 44
2015 43
And here is the output:
0 2011 22
8 2011 25
16 2012 40
24 2013 35
32 2013 38
40 2014 44
48 2015 43
How can i get the same ouput as the input?

I did so assuming that the the Job will send the same output as the input to the output file
You were correct in assuming that. Technically, you are getting the whatever you have in the file as the output. Remember that mappers and reducers take Key-Value pair as an input.
The input to a mapper is the an input split of the file and the input to a reducer is output of the mapper(s).
But what i found that it is printing some dummy integer value in the output file with every line separated by tab
These dummy integer are nothing but the offset of that line from the start of the file. Since each row you have consists of [4 DIGITS]<space>[2 DIGITS]<new-line>, your offsets are multiple of eights.
Why are you getting this offset since you haven't defined any mapper or reducer, you might ask? This is because , a mapper will always run which will do this task of mapping each line to it's offset and is referred to as an IdentityMapper.
How can i get the same ouput as the input?
Well you can define a mapper and just map the input lines to the output and strip the offsets.
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
// Some cool logic here
}
In the above code, key contains the dummy integer value i.e. offset. And value contains the value of each line, one at a time.
You can write your own code to write the value using the context.write function and then using no reducer and setting job.setNumReduceTasks(0) to get the desired output.

I agree with the #philantrovert's answer, but here is the more details i found.
According to the Hadoop- The Definitive Guide, it is TextInputFormat which adds the offset to the line numbers. Here is the documentation about the TextInputFormat:
TextInputFormat is the default InputFormat. Each record is a line of input. The key, a LongWritable, is the byte offset within the file of the beginning of the line. The value is the contents of the line, excluding any line terminators (e.g., newline or carriage return), and is packaged as a Text object. So, a file containing the following text:
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
is divided into one split of four records. The records are interpreted as the following key-value pairs:
(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)
Clearly, the keys are not line numbers. This would be impossible to implement in general, in that a file is broken into splits at byte, not line, boundaries. Splits are processed independently. Line numbers are really a sequential notion. You have to keep a count of lines as you consume them, so knowing the line number within a split would be possible, but not within the file.
However, the offset within the file of each line is known by each split independently of the other splits, since each split knows the size of the preceding splits and just adds this onto the offsets within the split to produce a global file offset. The offset is usually sufficient for applications that need a unique identifier for each line. Combined with the file’s name, it is unique within the filesystem. Of course, if all the lines are a fixed width, calculating the line number is simply a matter of dividing the offset by the width.

sklearn2pmml and jpmml-sklearn Usage Errors

I recently came across sklearn2pmml and jpmml-sklearn when looking for a way to convert scikit-learn models to PMML. However, I've been hitting errors when trying to use the basic usage examples that I'm unable to figure out.
When attempting to usage example in sklearn2pmml, I've been receiving the following issue around casting a long as an int:
Exception in thread "main" java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
at numpy.core.NDArrayUtil.getShape(NDArrayUtil.java:66)
at org.jpmml.sklearn.ClassDictUtil.getShape(ClassDictUtil.java:92)
at org.jpmml.sklearn.ClassDictUtil.getShape(ClassDictUtil.java:76)
at sklearn.linear_model.BaseLinearClassifier.getCoefShape(BaseLinearClassifier.java:144)
at sklearn.linear_model.BaseLinearClassifier.getNumberOfFeatures(BaseLinearClassifier.java:56)
at sklearn.Classifier.createSchema(Classifier.java:50)
at org.jpmml.sklearn.Main.run(Main.java:104)
at org.jpmml.sklearn.Main.main(Main.java:87)
Traceback (most recent call last):
File "C:\Users\user\workspace\sklearn_pmml\test.py", line 40, in <module>
sklearn2pmml(iris_classifier, iris_mapper, "LogisticRegressionIris.pmml")
File "C:\Python27\lib\site-packages\sklearn2pmml\__init__.py", line 49, in sklearn2pmml
os.remove(dump)
WindowsError: [Error 32] The process cannot access the file because it is being used by another process: 'c:\\users\\user\\appdata\\local\\temp\\tmpmxyp2y.pkl'
Any suggestions as to what is going on here?
Usage code:
#
# Step 1: feature engineering
#
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import pandas
import sklearn_pandas
iris = load_iris()
iris_df = pandas.concat((pandas.DataFrame(iris.data[:, :], columns = ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"]), pandas.DataFrame(iris.target, columns = ["Species"])), axis = 1)
iris_mapper = sklearn_pandas.DataFrameMapper([
(["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"], PCA(n_components = 3)),
("Species", None)
])
iris = iris_mapper.fit_transform(iris_df)
#
# Step 2: training a logistic regression model
#
from sklearn.linear_model import LogisticRegressionCV
iris_X = iris[:, 0:3]
iris_y = iris[:, 3]
iris_classifier = LogisticRegressionCV()
iris_classifier.fit(iris_X, iris_y)
#
# Step 3: conversion to PMML
#
from sklearn2pmml import sklearn2pmml
sklearn2pmml(iris_classifier, iris_mapper, "LogisticRegressionIris.pmml")
EDIT 12/6:
After the new update, the same issue comes up farther down the line:
Dec 06, 2015 5:56:49 PM sklearn_pandas.DataFrameMapper updatePMML
INFO: Updating 1 target field and 3 active field(s)
Dec 06, 2015 5:56:49 PM sklearn_pandas.DataFrameMapper updatePMML
INFO: Mapping target field y to Species
Dec 06, 2015 5:56:49 PM sklearn_pandas.DataFrameMapper updatePMML
INFO: Mapping active field(s) [x1, x2, x3] to [Sepal.Length, Sepal.Width, Petal.Length, Petal.Width]
Traceback (most recent call last):
File "C:\Users\user\workspace\sklearn_pmml\test.py", line 40, in <module>
sklearn2pmml(iris_classifier, iris_mapper, "LogisticRegressionIris.pmml")
File "C:\Python27\lib\site-packages\sklearn2pmml\__init__.py", line 49, in sklearn2pmml
os.remove(dump)
WindowsError: [Error 32] The process cannot access the file because it is being used by another process: 'c:\\users\\user\\appdata\\local\\temp\\tmpqeblat.pkl'

JPMML-SkLearn expected that ndarray.shape is tuple of i4 (mapped to java.lang.Integer by the Pyrolite library). However, in this case it was a tuple of i8 (mapped to java.lang.Long). Hence the cast exception.
This issue has been addressed in JPMML-SkLearn commit f7c16ac2fb.
If you should encounter another exception (data translation between platforms could be tricky), then you should also open a JPMML-SkLearn issue about it.

ZipException thrown when using jogl in eclipse

This is Eclipse project build path:
files in project:
rob#work:~/git/thegame$ ll lib/linux32/
total 708
drwxr-xr-x 2 rob rob 4096 Mar 22 02:37 ./
drwxr-xr-x 4 rob rob 4096 Mar 22 02:23 ../
-rw-r--r-- 1 rob rob 8704 Mar 10 14:00 libgluegen-rt.so
-rw-r--r-- 1 rob rob 666380 Mar 11 03:22 libjogl_desktop.so
-rw-r--r-- 1 rob rob 5944 Mar 11 03:22 libnativewindow_awt.so
-rw-r--r-- 1 rob rob 26604 Mar 11 03:22 libnativewindow_x11.so
rob#work:~/git/thegame$ ll jar/
total 3308
drwxr-xr-x 3 rob rob 4096 Mar 22 02:28 ./
drwxr-xr-x 8 rob rob 4096 Mar 22 02:22 ../
-rw-r--r-- 1 rob rob 289171 Mar 10 14:00 gluegen-rt.jar
drwxr-xr-x 4 rob rob 4096 Mar 22 02:28 javadocs/
-rw-r--r-- 1 rob rob 3082066 Mar 11 03:23 jogl-all.jar
Error when trying to execute application:
Catched ZipException: error in opening zip file, while addNativeJarLibsImpl(classFromJavaJar class com.jogamp.common.os.Platform, classJarURI jar:file:/home/rob/git/thegame/jar/gluegen-rt.jar!/com/jogamp/common/os/Platform.class, nativeJarBaseName gluegen-rt-natives-linux-i586.jar): [ file:/home/rob/git/thegame/jar/gluegen-rt.jar -> file:/home/rob/git/thegame/jar/ ] + gluegen-rt-natives-linux-i586.jar -> slim: jar:file:/home/rob/git/thegame/jar/gluegen-rt-natives-linux-i586.jar!/
Catched ZipException: error in opening zip file, while addNativeJarLibsImpl(classFromJavaJar class jogamp.nativewindow.NWJNILibLoader, classJarURI jar:file:/home/rob/git/thegame/jar/jogl-all.jar!/jogamp/nativewindow/NWJNILibLoader.class, nativeJarBaseName jogl-all-natives-linux-i586.jar): [ file:/home/rob/git/thegame/jar/jogl-all.jar -> file:/home/rob/git/thegame/jar/ ] + jogl-all-natives-linux-i586.jar -> slim: jar:file:/home/rob/git/thegame/jar/jogl-all-natives-linux-i586.jar!/
Catched IOException: TempJarCache: addNativeLibs: jar:file:/home/rob/git/thegame/jar/jogl-all-natives-linux-i586.jar!/, previous load attempt failed, while addNativeJarLibsImpl(classFromJavaJar class jogamp.nativewindow.NWJNILibLoader, classJarURI jar:file:/home/rob/git/thegame/jar/jogl-all.jar!/jogamp/nativewindow/NWJNILibLoader.class, nativeJarBaseName jogl-all-natives-linux-i586.jar): [ file:/home/rob/git/thegame/jar/jogl-all.jar -> file:/home/rob/git/thegame/jar/ ] + jogl-all-natives-linux-i586.jar -> slim: jar:file:/home/rob/git/thegame/jar/jogl-all-natives-linux-i586.jar!/
Catched IOException: TempJarCache: addNativeLibs: jar:file:/home/rob/git/thegame/jar/jogl-all-natives-linux-i586.jar!/, previous load attempt failed, while addNativeJarLibsImpl(classFromJavaJar class jogamp.nativewindow.NWJNILibLoader, classJarURI jar:file:/home/rob/git/thegame/jar/jogl-all.jar!/jogamp/nativewindow/NWJNILibLoader.class, nativeJarBaseName jogl-all-natives-linux-i586.jar): [ file:/home/rob/git/thegame/jar/jogl-all.jar -> file:/home/rob/git/thegame/jar/ ] + jogl-all-natives-linux-i586.jar -> slim: jar:file:/home/rob/git/thegame/jar/jogl-all-natives-linux-i586.jar!/
Main.java
import java.awt.event.WindowAdapter;
import java.awt.event.WindowEvent;
import javax.media.opengl.GLCapabilities;
import javax.media.opengl.GLProfile;
import javax.media.opengl.awt.GLCanvas;
import javax.swing.JFrame;
public class Main
{
public static void main(String[] args)
{
// setup OpenGL Version 2
GLProfile profile = GLProfile.get(GLProfile.GL2);
GLCapabilities capabilities = new GLCapabilities(profile);
// The canvas is the widget that's drawn in the JFrame
GLCanvas glcanvas = new GLCanvas(capabilities);
glcanvas.addGLEventListener(new Renderer());
glcanvas.setSize( 300, 300 );
JFrame frame = new JFrame( "Hello World" );
frame.getContentPane().add( glcanvas);
// shutdown the program on windows close event
frame.addWindowListener(new WindowAdapter() {
public void windowClosing(WindowEvent ev) {
System.exit(0);
}
});
frame.setSize( frame.getContentPane().getPreferredSize() );
frame.setVisible( true );
}
}

When you download the JARs directly, some web browsers might wrap it into a ZIP file for "security" reasons. Rather download the 7z archive here and take the JARs it contains. You should follow these detailed instructions, it should work in Eclipse.
I remind you that the separated native libraries are no longer required in JOGL 2, rather use the JARs containing the native libraries, it is a lot less error prone, just put them into the same directory than the Java libraries relying on them (jogl-all.jar and gluegen-rt.jar).

write logs to oldest file in directory

I want to do some tweaks to my logging for my application...
I would like some help to enhance what I have below in main method:
public static void main(String[] args) {
try {
Date date = new Date();
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
Handler h = new FileHandler("../logs/MyLogFile_"
+ sdf.format(date) + ".log", true);
h.setFormatter(new SingleLineFormatter());
h.setLevel(Level.ALL);
logger.setUseParentHandlers(false);
logger.addHandler(h);
}
//...
}
It creates a log file with date stamp everytime I run the application. But I want to achieve something like this in my Unix Directory:
-rw-r--r-- 1 r787848 dev 45271 Feb 4 11:31 MyLogFile.log.06
-rw-r--r-- 1 r787848 dev 45308 Feb 5 11:36 MyLogFile.log.05
-rw-r--r-- 1 r787848 dev 44336 Feb 6 06:50 MyLogFile.log.04
-rw-r--r-- 1 r787848 dev 44379 Feb 7 08:41 MyLogFile.log.03
-rw-r--r-- 1 r787848 dev 44409 Feb 10 08:45 MyLogFile.log.02
-rw-r--r-- 1 r787848 dev 44446 Feb 11 12:36 MyLogFile.log.01
I want to define a set of lets say 6 log files to capture logging of daily run of the application. When it comes to logging, I want the application to write to the log file that is oldest, so in the above instance, running the application on Feb 12 08:45 should clear MyLogFile.log.06 and write fresh for feb 12.
How can this be achieved with java.util.logging on top of what I have. Unfortunately, I am not able to configure log4j properties and want to use java.util.logging only.

The only close approximation is to do the following:
Handler h = new FileHandler("../logs/MyLogFile_"
+ sdf.format(date) + ".log", Integer.MAX_VALUE, 6, false);
See: JDK-6350749 - Enhance FileHandler to have Daily Log Rotation capabilities.

Java - Read specific files in specific order from a folder [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
In my Java program i process a certain amount of files.
Those Files are named in this way:
thu 21 mar 2013_01.55.22_128.txt
thu 21 mar 2013_01.55.22_129.txt
thu 21 mar 2013_01.55.22_130.txt
....
sat 23 mar 2013_01.45.55_128.txt
sat 23 mar 2013_01.45.55_129.txt
sat 23 mar 2013_01.45.55_130.txt
Where the last three numbers are the cell number.
Consider that i already read in order of date the files coming from the same cell.
Consider that all the files are on the same folder.
Consider also that this problem, but for a single cell, was correctly solved on This Post
My question now is: how can i read first all the txt coming from a specific cell (e.g 128), then all the files coming from cell 129 and so on? (below: a graphic example)
thu 21 mar 2013_01.55.22_128.txt
sat 23 mar 2013_01.45.55_128.txt
...
thu 21 mar 2013_01.55.22_129.txt
sat 23 mar 2013_01.45.55_129.txt
...
thu 21 mar 2013_01.55.22_130.txt
sat 23 mar 2013_01.45.55_130.txt
I hope I was clear

You may get all files in directory using listFiles() into array then sort it using custom comparator.
File[] files = dir.istFiles();
Array.sort(files, new Comparator<File> {
#Override
public int compare(File lhs, File rhs) {
//return -1 if lhs should go before
//0 if it doesn't matter
//1 if rhs should go after
}
});

Well, you could read the folder in order to get the File objects (or maybe just file names).
Then parse the file names, extract the cell and put the files into a map whose key is the cell.
Some pseudo code:
Map<String, List<File>> filesPerCell = new LinkedHashMap<String, List<File>>();
File[] files = folder.listFiles();
for( File file : files ) {
String filename = file.getName();
String cell = ... ; //extract from filename
List<File> l = filesPerCell.get( cell );
//create new list if l is null
l.add( file );
}
for( List<File> cellList : filesPerCell.values() ) {
//do whatever you want with the files for that cell
}

You will have your file names sorted by cell number, and inside the cell, by date/time. You could do this most easily, if your file names were like this:
cellnumber_yyyymmdd_hhmmss
where cellnumber would be the same number of digits in all cases.
Otherwise you must write a custom comparator (as #RiaD writes), but it is not trivial because of the dates that must be parsed so one could decide on later/earlier.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.