Getting unknown integer value in Map-reduce output file

Getting unknown integer value in Map-reduce output file - java

I am working on a hadoop map-reduce program where i am not setting the mapper and reducer and not setting any other parameter to the Job configuration from my program. I did so assuming that the the Job will send the same output as the input to the output file.
But what i found that it is printing some dummy integer value in the output file with every line separated by tab(i guess).
Here is my code:
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class MinimalMapReduce extends Configured implements Tool {
public int run(String[] args) throws Exception {
Job job = new Job(getConf());
job.setJarByClass(getClass());
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) {
String argg[] = {"/Users/***/Documents/hadoop/input/input.txt",
"/Users/***/Documents/hadoop/output_MinimalMapReduce"};
try{
int exitCode = ToolRunner.run(new MinimalMapReduce(), argg);
System.exit(exitCode);
}catch(Exception e){
e.printStackTrace();
}
}
}
And here is the input:
2011 22
2011 25
2012 40
2013 35
2013 38
2014 44
2015 43
And here is the output:
0 2011 22
8 2011 25
16 2012 40
24 2013 35
32 2013 38
40 2014 44
48 2015 43
How can i get the same ouput as the input?

I did so assuming that the the Job will send the same output as the input to the output file
You were correct in assuming that. Technically, you are getting the whatever you have in the file as the output. Remember that mappers and reducers take Key-Value pair as an input.
The input to a mapper is the an input split of the file and the input to a reducer is output of the mapper(s).
But what i found that it is printing some dummy integer value in the output file with every line separated by tab
These dummy integer are nothing but the offset of that line from the start of the file. Since each row you have consists of [4 DIGITS]<space>[2 DIGITS]<new-line>, your offsets are multiple of eights.
Why are you getting this offset since you haven't defined any mapper or reducer, you might ask? This is because , a mapper will always run which will do this task of mapping each line to it's offset and is referred to as an IdentityMapper.
How can i get the same ouput as the input?
Well you can define a mapper and just map the input lines to the output and strip the offsets.
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
// Some cool logic here
}
In the above code, key contains the dummy integer value i.e. offset. And value contains the value of each line, one at a time.
You can write your own code to write the value using the context.write function and then using no reducer and setting job.setNumReduceTasks(0) to get the desired output.

I agree with the #philantrovert's answer, but here is the more details i found.
According to the Hadoop- The Definitive Guide, it is TextInputFormat which adds the offset to the line numbers. Here is the documentation about the TextInputFormat:
TextInputFormat is the default InputFormat. Each record is a line of input. The key, a LongWritable, is the byte offset within the file of the beginning of the line. The value is the contents of the line, excluding any line terminators (e.g., newline or carriage return), and is packaged as a Text object. So, a file containing the following text:
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
is divided into one split of four records. The records are interpreted as the following key-value pairs:
(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)
Clearly, the keys are not line numbers. This would be impossible to implement in general, in that a file is broken into splits at byte, not line, boundaries. Splits are processed independently. Line numbers are really a sequential notion. You have to keep a count of lines as you consume them, so knowing the line number within a split would be possible, but not within the file.
However, the offset within the file of each line is known by each split independently of the other splits, since each split knows the size of the preceding splits and just adds this onto the offsets within the split to produce a global file offset. The offset is usually sufficient for applications that need a unique identifier for each line. Combined with the file’s name, it is unique within the filesystem. Of course, if all the lines are a fixed width, calculating the line number is simply a matter of dividing the offset by the width.

Related

python stanford posttager, java command failed after running for some time

I am using stanford posttager toolkit to tag list of words from academic papers. Here is my codes of this part:
st = StanfordPOSTagger(stanford_tagger_path, stanford_jar_path, encoding = 'utf8', java_options = '-mx2048m')
word_tuples = st.tag(document)
document is a list of words derived from nltk.word_tokenize, they come from mormal academic papers so usually there are several thousand of words (mostly 3000 - 4000). I need to process over 10000 files so I keep calling these functions. My program words fine on a small test set with 270 files, but when the number of file gets bigger, the program gives out this error (Java heap space 2G):
raise OSError('Java command failed : ' + str(cmd))
OSError: Java command failed
Note that this error does not occur immediately after the execution, it happens after some time of running. I really don't know the reason. Is this because my 3000 - 4000 words are too much ? Thank you very much for help !(Sorry for the bad edition, the error information is too long)

Here is my solution to the code,after I too faced the error.Basically increasing JAVA heapsize solved it.
import os
java_path = "C:\\Program Files\\Java\\jdk1.8.0_102\\bin\\java.exe"
os.environ['JAVAHOME'] = java_path
from nltk.tag.stanford import StanfordPOSTagger
path_to_model = "stanford-postagger-2015-12-09/models/english-bidirectional-distsim.tagger"
path_to_jar = "stanford-postagger-2015-12-09/stanford-postagger.jar"
tagger=StanfordPOSTagger(path_to_model, path_to_jar)
tagger.java_options='-mx4096m' ### Setting higher memory limit for long sentences
sentence = 'This is testing'
print tagger.tag(sentence.split())

I assume you have tried increasing the Java stack via the Tagger settings like so
stanford.POSTagger([...], java_options="-mxSIZEm")
Cf the docs, default is 1000:
def __init__(self, [...], java_options='-mx1000m')
In order to test if it is a problem with the size of the dataset, you can tokenize your text into sentences, e.g. using the Punkt Tokenizer and output them right after tagging.

Creating training data for a Maxent classfier in Java

I am trying to create the java implementation for maxent classifier. I need to classify the sentences into n different classes.
I had a look at ColumnDataClassifier in stanford maxent classifier. But I am not able to understand how to create training data. I need training data in the form where training data includes POS Tags for words for sentence, so that the features used for classifier will be like previous word, next word etc.
I am looking for training data which has sentences with POS TAGGING and sentence class mentioned. example :
My/(POS) name/(POS) is/(POS) XYZ/(POS) CLASS
Any help will be appreciated.

If I understand it correctly, you are trying to treat sentences as a set of POS tags.
In your example, the sentence "My name is XYZ" would be represented as a set of (PRP$, NN, VBZ, NNP).
That would mean, every sentence is actually a binary vector of length 37 (because there are 36 possible POS tags according to this page + the CLASS outcome feature for the whole sentence)
This can be encoded for OpenNLP Maxent as follows:
PRP$=1 NN=1 VBZ=1 NNP=1 CLASS=SomeClassOfYours1
or simply:
PRP$ NN VBZ NNP CLASS=SomeClassOfYours1
(For working code-snippet see my answer here: Training models using openNLP maxent)
Some more sample data would be:
"By 1978, Radio City had lost its glamour, and the owners of Rockefeller Center decided to demolish the aging hall."
"In time he was entirely forgotten, many of his buildings were demolished, others insensitively altered."
"As soon as she moved out, the mobile home was demolished, the suit said."
...
This would yield samples:
IN CD NNP VBD VBN PRP$ NN CC DT NNS IN TO VB VBG CLASS=SomeClassOfYours2
IN NN PRP VBD RB VBN JJ IN PRP$ NNS CLASS=SomeClassOfYours3
IN RB PRP VBD RP DT JJ NN VBN NN CLASS=SomeClassOfYours2
...
However, I don't expect that such a classification yields good results. It would be better to make use of other structural features of a sentence, such as the parse tree or dependency tree that can be obtained using e.g. Stanford parser.
Edited on 28.3.2016:
You can also use the whole sentence as a training sample. However, be aware that:
- two sentences might contain same words but have different meaning
- there is a pretty high chance of overfitting
- you should use short sentences
- you need a huge training set
According to your example, I would encode the training samples as follows:
class=CLASS My_PRP name_NN is_VBZ XYZ_NNP
...
Notice that the outcome variable comes as the first element on each line.
Here is a fully working minimal example using opennlp-maxent-3.0.3.jar.
package my.maxent;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.zip.GZIPInputStream;
import opennlp.maxent.GIS;
import opennlp.maxent.io.GISModelReader;
import opennlp.maxent.io.SuffixSensitiveGISModelWriter;
import opennlp.model.AbstractModel;
import opennlp.model.AbstractModelWriter;
import opennlp.model.DataIndexer;
import opennlp.model.DataReader;
import opennlp.model.FileEventStream;
import opennlp.model.MaxentModel;
import opennlp.model.OnePassDataIndexer;
import opennlp.model.PlainTextFileDataReader;
public class MaxentTest {
public static void main(String[] args) throws IOException {
String trainingFileName = "training-file.txt";
String modelFileName = "trained-model.maxent.gz";
// Training a model from data stored in a file.
// The training file contains one training sample per line.
DataIndexer indexer = new OnePassDataIndexer( new FileEventStream(trainingFileName));
MaxentModel trainedMaxentModel = GIS.trainModel(100, indexer); // 100 iterations
// Storing the trained model into a file for later use (gzipped)
File outFile = new File(modelFileName);
AbstractModelWriter writer = new SuffixSensitiveGISModelWriter((AbstractModel) trainedMaxentModel, outFile);
writer.persist();
// Loading the gzipped model from a file
FileInputStream inputStream = new FileInputStream(modelFileName);
InputStream decodedInputStream = new GZIPInputStream(inputStream);
DataReader modelReader = new PlainTextFileDataReader(decodedInputStream);
MaxentModel loadedMaxentModel = new GISModelReader(modelReader).getModel();
// Now predicting the outcome using the loaded model
String[] context = {"is_VBZ", "Gaby_NNP"};
double[] outcomeProbs = loadedMaxentModel.eval(context);
String outcome = loadedMaxentModel.getBestOutcome(outcomeProbs);
System.out.println("=======================================");
System.out.println(outcome);
System.out.println("=======================================");
}
}
And some dummy training data (stored as training-file.txt):
class=Male My_PRP name_NN is_VBZ John_NNP
class=Male My_PRP name_NN is_VBZ Peter_NNP
class=Female My_PRP name_NN is_VBZ Anna_NNP
class=Female My_PRP name_NN is_VBZ Gaby_NNP
This yields the following output:
Indexing events using cutoff of 0
Computing event counts... done. 4 events
Indexing... done.
Sorting and merging events... done. Reduced 4 events to 4.
Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 4
Number of Outcomes: 2
Number of Predicates: 7
...done.
Computing model parameters ...
Performing 100 iterations.
1: ... loglikelihood=-2.772588722239781 0.5
2: ... loglikelihood=-2.4410105407571203 1.0
...
99: ... loglikelihood=-0.16111520541752372 1.0
100: ... loglikelihood=-0.15953272940719138 1.0
=======================================
class=Female
=======================================

Attempting an hbase bulkloading job, the reducer uses a bloom filter that complains about unordered input

I'm doing a large scale hbase import using a map-reduce job I set up like so.
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(Put.class);
job.setMapperClass(BulkMapper.class);
job.setOutputFormatClass(HFileOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(inputPath));
FileOutputFormat.setOutputPath(job, new Path(outputPath));
HFileOutputFormat.configureIncrementalLoad(job, hTable); //This creates a text file that will be full of put statements, should take 10 minutes or so
boolean suc = job.waitForCompletion(true);
It uses a mapper that I make myself and HFileOutputFormat.configureIncrementalLoad sets up a reducer. I've done proofs of concepts with this setup before, however when I ran it on a large dataset it died in the reducer with this error:
Error: java.io.IOException: Non-increasing Bloom keys: BLMX2014-02-03nullAdded after BLMX2014-02-03nullRemoved at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.appendGeneralBloomfilter(StoreFile.java:934) at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:970) at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:168) at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:124) at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:576) at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105) at org.apache.hadoop.hbase.mapreduce.PutSortReducer.reduce(PutSortReducer.java:78) at org.apache.hadoop.hbase.mapreduce.PutSortReducer.reduce(PutSortReducer.java:43) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:645) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:405) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157) Container killed by the ApplicationMaster. Container killed on request. Exit code is 143
I thought hadoop was supposed to guarantee sorted input into the reducer, if so why am I having this issue and is there anything I can do to avoid it?

I'm deeply annoyed that this worked, the problem was in the way I was keying my map output. I replaced what I used to have for output with this:
ImmutableBytesWritable HKey = new ImmutableBytesWritable(put.getRow());
context.write(HKey, put);
Basically the key I was using and the key to the put statement were slightly different which cause the reducer to receive put statements out of order.

log analyze: finding lines by time difference

I have a long log file generated with log4j, 10 threads writing to log.
I am looking for log analyzer tool that could find lines where user waited for a long time (i.e where the difference between log entries for the same thread is more than a minute).
P.S I am trying to use OtrosLogViewer, but it gives filtering by certain values (for example, by thread ID), and does not compare between lines.
PPS
the new version of OtrosLogViewer has a "Delta" column that calculates the difference between adj log lines (in ms)
thank you

This simple Python script may be enough. For testing, I analized my local Apache log, which BTW uses the Common Log Format so you may even reuse it as-is. I simply compute the difference between two subsequent requests, and print the request line for deltas exceeding a certain threshold (1 second in my test). You may want to encapsulate the code in a function which also accepts a parameter with the thread ID, so you can filter further
#!/usr/bin/env python
import re
from datetime import datetime
THRESHOLD = 1
last = None
for line in open("/var/log/apache2/access.log"):
# You may insert here something like
# if not re.match(THREAD_ID, line):
# continue
# Python does not support %z, hence the [:-6]
current = datetime.strptime(
re.search(r"\[([^]]+)]", line).group(1)[:-6],
"%d/%b/%Y:%H:%M:%S")
if last != None and (current - last).seconds > THRESHOLD:
print re.search('"([^"]+)"', line).group(1)
last = current

Based on #Raffaele answer, I made some fixes to work on any log file (skipping lines that doesn't begin with the requested date, e.g. Jenkins console log).
In addition, added Max / Min Threshold to filter out lines base on duration limits.
#!/usr/bin/env python
import re
from datetime import datetime
MIN_THRESHOLD = 80
MAX_THRESHOLD = 100
regCompile = r"\w+\s+(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d).*"
filePath = "C:/Users/user/Desktop/temp/jenkins.log"
lastTime = None
lastLine = ""
with open(filePath, 'r') as f:
for line in f:
regexp = re.search(regCompile, line)
if regexp:
currentTime = datetime.strptime(re.search(regCompile, line).group(1), "%Y-%m-%d %H:%M:%S")
if lastTime != None:
duration = (currentTime - lastTime).seconds
if duration >= MIN_THRESHOLD and duration <= MAX_THRESHOLD:
print ("#######################################################################################################################################")
print (lastLine)
print (line)
lastTime = currentTime
lastLine = line
f.closed

Apache Chainsaw has a time delta column.

Java + regex how to check such a string "LOAD_filesourceB-01012008_000058.dat" for type and number(last 6 digits)

how to implement such a requirement via regexp?
I have a list of filenames as String's.
LOAD_filesourceA-01012008-00001.dat
LOAD_filesourceB-01012008-00001.dat
LOAD_filesourceB-01012008-00003.dat
LOAD_filesourceA-01012008-00004.dat
LOAD_filesourceA-01012008-000055.dat
LOAD_filesourceB-01012008_000055.dat
...
LOAD_filesourceB-01012008_000058.dat
etc
after loading each file, that file gets moved into an archive directory... and I log the file type and load number(last 6 chars in filename)
I have 2 pieces of info:
1- whether the file I wish to load is of type A or B
2- the last loaded file number as integer
based on these, I would like to get the file name of the next file, i.e. that is of the same file type and the load number(= the last 6 digits before . ".dat" section) should be the next available number. say loaded was 12, then I will search for 13, if not available 14, 15 etc.. till I process all files in that directory.
just given a string like "LOAD_filesourceB-01012008_000058.dat" can I check that this is file type B and assuming last loaded file number was 57, it satisfies being number 58 requirement. (> 57 I mean)

LOAD_filesource(A|B)-[0-9]+-([0-9])+.dat
A or B will end up in group 1, the number of the file in group 2. Then parse group 2 as a decimal integer.

See this:
public class Match {
Pattern pattern = Pattern.compile("LOAD_filesource(A|B)-[0-9]{8}[_-]([0-9]{5,6})\\.dat");
String files[] = {
"LOAD_filesourceA-01012008-00001.dat",
"LOAD_filesourceB-01012008-00001.dat",
"LOAD_filesourceB-01012008-00003.dat",
"LOAD_filesourceA-01012008-00004.dat",
"LOAD_filesourceA-01012008-000055.dat",
"LOAD_filesourceB-01012008_000055.dat",
"LOAD_filesourceB-01012008_000058.dat"
};
public static void main(String[] args) {
new Match().run();
}
private void run() {
for (String file : files) {
Matcher matcher = pattern.matcher(file);
System.out.print(String.format("%s %b %s %s\n", file, matcher.matches(), matcher.group(1), matcher.group(2)));
}
}
}
with this output:
LOAD_filesourceA-01012008-00001.dat true A 00001
LOAD_filesourceB-01012008-00001.dat true B 00001
LOAD_filesourceB-01012008-00003.dat true B 00003
LOAD_filesourceA-01012008-00004.dat true A 00004
LOAD_filesourceA-01012008-000055.dat true A 000055
LOAD_filesourceB-01012008_000055.dat true B 000055
LOAD_filesourceB-01012008_000058.dat true B 000058

I don't know if its intentional or not, but you have listed two different formats, one that uses a hyphen as the final separator and one that uses an underscore. If both are really supported, you would want:
LOAD_filesource(A|B)-[0-9]+[_-]([0-9])+.dat
Also, your six digit number is sometimes five digits (e.g. the 00001 in LOAD_filesourceA-...-00001.dat), but the above regular expression only requires at least one digit be present.
Depending on how many files you're going to attempt to examine, you might be better off loading up a directory listing rather than randomly checking to see if a file exists. With an appropriate compare method, sorting your list could give you your files in an easy-to-work-with order.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.