I am working on a project involving Hadoop and Mahout libraries. I have to use SequenceFile.Writer to write data to a file but I am getting a nullpointer exception when I am trying to use SequenceFile. To better understand my problem I have written a test code which is re-creating the problem and also the error message. I am also adding the code to generate the sample data.
Here first I am generating a sample data based on some distribution in the MyUtil class. Then passing the sample data to do canopy clustering(in the test Class) using Mahout's canopy clustering library. Then trying to write the centriods produced by the canopy clustering algorithm to a file using the SequenceFile.Writer. This is where I am getting the null pointer exception(When Creating the Sequence File Writer)
Thanks in advance for your help.
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.SequenceFile.Writer;
import org.apache.mahout.clustering.canopy.Canopy;
import org.apache.mahout.clustering.canopy.CanopyClusterer;
import org.apache.mahout.clustering.canopy.CanopyDriver;
import org.apache.mahout.common.distance.EuclideanDistanceMeasure;
import org.apache.mahout.math.Vector;
public class Test {
public static void main(String[] args) throws IOException{
List<Vector> sampleData = new ArrayList<Vector>();
MyUtil.generateSamples(sampleData, 400, 1, 1, 2);
MyUtil.generateSamples(sampleData, 400, 1, 0, .5);
MyUtil.generateSamples(sampleData, 400, 0, 2, .1);
#SuppressWarnings("deprecation")
List<Canopy> canopies = CanopyClusterer.createCanopies(sampleData,
new EuclideanDistanceMeasure(), 3.0, 1.5);
Configuration conf = new Configuration();
File testData = new File("testData/points");
if(!testData.exists()){
testData.mkdir();
}
Path path = new Path("testData/points/file1");
SequenceFile.Writer writer = SequenceFile.createWriter(conf,
SequenceFile.Writer.file(path),
SequenceFile.Writer.keyClass(LongWritable.class),
SequenceFile.Writer.valueClass(Text.class));
for(Canopy canopy: canopies){
System.out.println("Canopy ID: "+canopy.getId()+" centers "+
canopy.getCenter().toString());
writer.append(new LongWritable(canopy.getId()),
new Text(canopy.getCenter().toString()));
}
writer.close();
}
}
MyUtil.generateSamples is just generating the sample data(I have also added the code below). And the error message the above code is throwing is
Exception in thread "main" java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:467)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:1071)
at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.<init>(SequenceFile.java:1371)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:272)
at Test.main(Test.java:39)
To Generate the sample data
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import org.apache.mahout.math.DenseVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.random.Normal;
public class MyUtil {
public static void generateSamples(List<Vector> vectors, int num,
double mx, double my, double sd){
Normal xDist = new Normal(mx, sd);
Normal yDist = new Normal(my, sd);
for(int i=0; i<num; i++){
vectors.add(new DenseVector(new double[]{xDist.sample(), yDist.sample()}));
}
}
}
}
Related
I have the following class to perform PCA on a arff file. I have added the Weka jar to my project but I am still getting an error saying DataSource cannot be resolved and I don't know what to do to resolve it. Can anyone suggest what could be wrong?
package project;
import weka.core.Instances;
import weka.core.converters.ArffLoader;
import weka.core.converters.ConverterUtils;
import weka.core.converters.ConverterUtils.DataSource;
import weka.core.converters.TextDirectoryLoader;
import weka.gui.visualize.Plot2D;
import weka.gui.visualize.PlotData2D;
import weka.gui.visualize.VisualizePanel;
import java.awt.BorderLayout;
import java.io.File;
import java.util.ArrayList;
import javax.swing.JFrame;
import org.math.plot.FrameView;
import org.math.plot.Plot2DPanel;
import org.math.plot.PlotPanel;
import org.math.plot.plots.ScatterPlot;
import weka.attributeSelection.PrincipalComponents;
import weka.attributeSelection.Ranker;
public class PCA {
public static void main(String[] args) {
try {
// Load the Data.
DataSource source = new DataSource("../data/ingredients.arff");
Instances data = source.getDataSet();
// Perform PCA.
PrincipalComponents pca = new PrincipalComponents();
pca.setVarianceCovered(1.0);
//pca.setCenterData(true);
pca.setNormalize(true);
pca.setTransformBackToOriginal(false);
pca.buildEvaluator(data);
// Show transform data into eigenvector basis.
Instances transformedData = pca.transformedData();
System.out.println(transformedData);
} catch (Exception e) {
e.printStackTrace();
}
}
}
I'm writing a simple Java client code to add values in HBase table. I'm using put.add(byte[] columnFamily, byte[] columnQualifier, byte[] value), but this method is deprecated in new HBase API. Can anyone please help in what is the way of doing it using new Put API?
Using maven I have downloaded jar for HBase version 1.2.0.
I'm using the following code :
package com.NoSQL;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.Cell;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Admin;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Table;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
public class PopulatingData {
public static void main(String[] args) throws IOException{
String table = "Employee";
Logger.getRootLogger().setLevel(Level.WARN);
Configuration conf = HBaseConfiguration.create();
Connection con = ConnectionFactory.createConnection(conf);
Admin admin = con.getAdmin();
if(admin.tableExists(TableName.valueOf(table))) {
Table htable = con.getTable(TableName.valueOf(table));
/*********** adding a new row ***********/
// adding a row key
Put p = new Put(Bytes.toBytes("row1"));
p.add(Bytes.toBytes("ContactDetails"), Bytes.toBytes("Mobile"), Bytes.toBytes("9876543210"));
p.add(Bytes.toBytes("ContactDetails"), Bytes.toBytes("Email"), Bytes.toBytes("abhc#gmail.com"));
p.add(Bytes.toBytes("Personal"), Bytes.toBytes("Name"), Bytes.toBytes("Abhinav Rawat"));
p.add(Bytes.toBytes("Personal"), Bytes.toBytes("Age"), Bytes.toBytes("21"));
p.add(Bytes.toBytes("Personal"), Bytes.toBytes("Gender"), Bytes.toBytes("M"));
p.add(Bytes.toBytes("Employement"), Bytes.toBytes("Company"), Bytes.toBytes("UpGrad"));
p.add(Bytes.toBytes("Employement"), Bytes.toBytes("DOJ"), Bytes.toBytes("11:06:2018"));
p.add(Bytes.toBytes("Employement"), Bytes.toBytes("Designation"), Bytes.toBytes("ContentStrategist"));
htable.put(p);
/**********************/
System.out.print("Table is Populated");`enter code here`
}else {
System.out.println("The HBase Table named "+table+" doesn't exists.");
}
System.out.println("Returnning Main");
}
}
Use addColumn() method :
Put put = new Put(Bytes.toBytes(rowKey));
put.addColumn(NAME_FAMILY, NAME_COL_QUALIFIER, name);
Please refer more details in below javadoc :
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Put.html
I am trying to implement a fuction that checks whether a png image contains a human face. I am trying to use OpenImaj and noticed that it has 4 detectors (Identity, Haar etc)
Appreciate if anyone can share a relevant code snippet
import java.io.File;
import java.io.IOException;
import java.util.Iterator;
import java.util.List;
import org.openimaj.image.DisplayUtilities;
import org.openimaj.image.FImage;
import org.openimaj.image.ImageUtilities;
import org.openimaj.image.MBFImage;
import org.openimaj.image.colour.RGBColour;
import org.openimaj.image.colour.Transforms;
import org.openimaj.image.processing.face.detection.DetectedFace;
import org.openimaj.image.processing.face.detection.FaceDetector;
import org.openimaj.image.processing.face.detection.HaarCascadeDetector;
import org.openimaj.math.geometry.shape.Rectangle;
public class App {
public static void main(String[] args) throws Exception, IOException {
final MBFImage image = ImageUtilities.readMBF(new File("d:\\java\\face\\bin.jpeg"));
FaceDetector<DetectedFace, FImage> fd = new HaarCascadeDetector(200);
List<DetectedFace> faces = fd.detectFaces(Transforms.calculateIntensity(image));
System.out.println("# Found faces, one per line.");
System.out.println("# <x>, <y>, <width>, <height>");
for (Iterator<DetectedFace> iterator = faces.iterator(); iterator.hasNext();) {
DetectedFace face = iterator.next();
Rectangle bounds = face.getBounds();
image.drawShape(face.getBounds(), RGBColour.RED);
// System.out.println(bounds.x + ";" + bounds.y + ";" + bounds.width + ";" +
// bounds.height);
}
DisplayUtilities.display(image);
}
}
I am trying to get the predictions on test set using evaluateModel function, however evaluation.evaluateModel(classifier, newTest,output) throws an exception.
Exception in thread "main" weka.core.WekaException: No dataset
structure provided!
import weka.classifiers.Evaluation;
import weka.core.Attribute;
import weka.core.Instances;
import weka.core.converters.ConverterUtils.DataSource;
import weka.classifiers.Evaluation;
import weka.core.converters.ConverterUtils.DataSource;
import weka.attributeSelection.CfsSubsetEval;
import weka.attributeSelection.ASSearch;
import weka.attributeSelection.BestFirst;
import weka.classifiers.functions.LinearRegression;
import weka.classifiers.meta.AttributeSelectedClassifier;
import weka.filters.supervised.attribute.AttributeSelection;
import weka.classifiers.evaluation.output.prediction.CSV;
public void evaluateTest() throws Exception
{
DataSource train = new DataSource(trainingData.toString());
Instances traininstances = train.getDataSet();
Attribute attr=traininstances.attribute("regressionLabel");
int trainindex=attr.index();
traininstances.setClassIndex(trainindex);
DataSource test = new DataSource(testData.toString());
Instances testinstances = test.getDataSet();
Attribute testattr=testinstances.attribute(regressionLabel);
int testindex=testattr.index();
testinstances.setClassIndex(testindex);
AttributeSelection filter = new AttributeSelection();
weka.classifiers.AbstractClassifier classifier ;
filter.setSearch(this.search);
filter.setEvaluator(this.eval);
filter.setInputFormat(traininstances); // initializing the filter once with training set
Instances newTrain = AttributeSelection.useFilter( traininstances, filter); // configures the Filter based on train instances and returns filtered instances
Instances newTest = AttributeSelection.useFilter(testinstances, filter);
classifier= new LinearRegression();
classifier.buildClassifier(newTrain);
StringBuffer buffer = new StringBuffer();
CSV output = new CSV();
output.setBuffer(buffer);
output.setOutputFile(predictFile);
Evaluation evaluation = new Evaluation(newTrain);
evaluation.evaluateModel(classifier, newTest,output);
}
The same thing works with evaluation.crossValidateModel.
I am trying to train en-ner-location.bin file using opennlp in java The thing is i got the training text file in the following format
<START:location> Fontana <END>
<START:location> Palo Verde <END>
<START:location> Picacho <END>
and i trained the file using the following code
import java.io.BufferedOutputStream;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;
import java.util.Collections;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.NameSample;
import opennlp.tools.namefind.NameSampleDataStream;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.Span;
public class TrainNames {
#SuppressWarnings("deprecation")
public void TrainNames() throws IOException{
File fileTrainer=new File("citytrain.txt");
File output=new File("en-ner-location.bin");
ObjectStream<String> lineStream = new PlainTextByLineStream(new FileInputStream(fileTrainer), "UTF-8");
ObjectStream<NameSample> sampleStream = new NameSampleDataStream(lineStream);
System.out.println("lineStream = " + lineStream);
TokenNameFinderModel model = NameFinderME.train("en", "location", sampleStream, Collections.<String, Object>emptyMap(), 1, 0);
BufferedOutputStream modelOut = null;
try {
modelOut = new BufferedOutputStream(new FileOutputStream(output));
model.serialize(modelOut);
} finally {
if (modelOut != null)
modelOut.close();
}
}
}
I got no errors or warnings but when i try to get a city name from a string like this cnt="John is planning to specialize in Electrical Engineering in UC Fontana and pursue a career with IBM."; It returns the whole string
anybody could tell me why...??
Welcome to SO! Looks like you need more context around each location annotation. I believe right now openNLP thinks you are training it to find words (any word) because your training data has only one word. You need to annotate locations within whole sentences and you will need at least a few hundred samples to start seeing good results.
See this answer as well:
How I train an Named Entity Recognizer identifier in OpenNLP?