Cascading - merge 2 aggregations

Cascading - merge 2 aggregations - java

I have the following problem whicj I am trying to solve with cascading: I have csv file of records with the structure: o,a,f,i,c
I need to to aggregate the records by o,a,f and to sum the i's and c's per group.
For example:
100,200,300,5,1
100,200,300,6,2
101,201,301,20,5
101,201,301,21,6
should yield:
100,200,300,11,3
101,201,301,41,11
I could not understand how to merge the 2 Every instances that I have (can I aggregate both fields in the same time?).
Do you have any idea?
Yosi
public class CascMain {
public static void main(String[] args){
Scheme sourceScheme = new TextLine(new Fields("line"));
Tap source = new Lfs(sourceScheme, "/tmp/casc/group.csv");
Scheme sinkScheme = new TextDelimited(new Fields("o", "a", "f", "ti", "tc"), ",");
Tap sink = new Lfs(sinkScheme, "/tmp/casc/output/", SinkMode.REPLACE);
Pipe assembly = new Pipe("agg-pipe");
Function function = new RegexSplitter(new Fields("o", "a", "f", "i", "c"), ",");
assembly = new Each(assembly, new Fields("line"), function);
Pipe groupAssembly = new GroupBy("group", assembly, new Fields("o", "a", "f"));
Sum impSum = new Sum(new Fields("ti"));
Pipe i = new Every(groupAssembly, new Fields("i"), impSum);
Sum clickSum = new Sum(new Fields("tc"));
Pipe c = new Every(groupAssembly, new Fields("c"), clickSum);
// WHAT SHOULD I DO HERE
Properties properties = new Properties();
FlowConnector.setApplicationJarClass(properties, CascMain.class);
FlowConnector flowConnector = new FlowConnector(properties);
Flow flow = flowConnector.connect("agg", source, sink, assembly);
flow.complete();
}
}

Use AggregateBy to aggregate multiple fields at the same time:
SumBy impSum = new SumBy(new Fields("i"), new Fields("ti"), long.class);
SumBy clickSum = new SumBy(new Fields("c"), new Fields("tc"), long.class);
assembly = new AggregateBy("totals", Pipe.pipes(assembly), new Fields("o", "a", "f"), 2, impSum, clickSum);

Related

Custom DataProvider Nattable

I create nattable the following way. But I can get access to the cells only through getters and setters in my Student class. How else can I access cells? Should I create my own BodyDataProvider or use IDataProvider? If it is true, could someone give some examples of implementing such providers?
final ColumnGroupModel columnGroupModel = new ColumnGroupModel();
ColumnHeaderLayer columnHeaderLayer;
String[] propertyNames = { "name", "groupNumber", "examName", "examMark" };
Map<String, String> propertyToLabelMap = new HashMap<String, String>();
propertyToLabelMap.put("name", "Full Name");
propertyToLabelMap.put("groupNumber", "Group");
propertyToLabelMap.put("examName", "Name");
propertyToLabelMap.put("examMark", "Mark");
DefaultBodyDataProvider<Student> bodyDataProvider = new DefaultBodyDataProvider<Student>(students,
propertyNames);
ColumnGroupBodyLayerStack bodyLayer = new ColumnGroupBodyLayerStack(new DataLayer(bodyDataProvider),
columnGroupModel);
DefaultColumnHeaderDataProvider defaultColumnHeaderDataProvider = new DefaultColumnHeaderDataProvider(
propertyNames, propertyToLabelMap);
DefaultColumnHeaderDataLayer columnHeaderDataLayer = new DefaultColumnHeaderDataLayer(
defaultColumnHeaderDataProvider);
columnHeaderLayer = new ColumnHeaderLayer(columnHeaderDataLayer, bodyLayer, bodyLayer.getSelectionLayer());
ColumnGroupHeaderLayer columnGroupHeaderLayer = new ColumnGroupHeaderLayer(columnHeaderLayer,
bodyLayer.getSelectionLayer(), columnGroupModel);
columnGroupHeaderLayer.addColumnsIndexesToGroup("Exams", 2, 3);
columnGroupHeaderLayer.setGroupUnbreakable(2);
final DefaultRowHeaderDataProvider rowHeaderDataProvider = new DefaultRowHeaderDataProvider(bodyDataProvider);
DefaultRowHeaderDataLayer rowHeaderDataLayer = new DefaultRowHeaderDataLayer(rowHeaderDataProvider);
ILayer rowHeaderLayer = new RowHeaderLayer(rowHeaderDataLayer, bodyLayer, bodyLayer.getSelectionLayer());
final DefaultCornerDataProvider cornerDataProvider = new DefaultCornerDataProvider(
defaultColumnHeaderDataProvider, rowHeaderDataProvider);
DataLayer cornerDataLayer = new DataLayer(cornerDataProvider);
ILayer cornerLayer = new CornerLayer(cornerDataLayer, rowHeaderLayer, columnGroupHeaderLayer);
GridLayer gridLayer = new GridLayer(bodyLayer, columnGroupHeaderLayer, rowHeaderLayer, cornerLayer);
NatTable table = new NatTable(shell, gridLayer, true);

As answered in your previous question How do I fix NullPointerException and putting data into NatTable, this is explained in the NatTable Getting Started Tutorial.
If you need some sample code try the NatTable Examples Application
And from knowing your previous question, your data structure does not work in a table, as you have nested objects where the child objects are stored in an array. So this is more a tree and not a table.

How to pass csv mapped bean class to Dataset

I wrote code to read a csv file and map all the columns to a bean class.
Now, I'm trying to set these values to a Dataset and getting an issue.
7/08/30 16:33:58 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: object is not an instance of declaring class
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
If I try to set the values manually it works fine
public void run(String t, String u) throws FileNotFoundException {
JavaRDD<String> pairRDD = sparkContext.textFile("C:/temp/L1_result.csv");
JavaPairRDD<String,String> rowJavaRDD = pairRDD.mapToPair(new PairFunction<String, String, String>() {
public Tuple2<String,String> call(String rec) throws FileNotFoundException {
String[] tokens = rec.split(";");
String[] vals = new String[tokens.length];
for(int i= 0; i < tokens.length; i++){
vals[i] =tokens[i];
}
return new Tuple2<String, String>(tokens[0], tokens[1]);
}
});
ColumnPositionMappingStrategy cpm = new ColumnPositionMappingStrategy();
cpm.setType(funds.class);
String[] csvcolumns = new String[]{"portfolio_id", "portfolio_code"};
cpm.setColumnMapping(csvcolumns);
CSVReader csvReader = new CSVReader(new FileReader("C:/temp/L1_result.csv"));
CsvToBean csvtobean = new CsvToBean();
List csvDataList = csvtobean.parse(cpm, csvReader);
for (Object dataobject : csvDataList) {
funds fund = (funds) dataobject;
System.out.println("Portfolio:"+fund.getPortfolio_id()+ " code:"+fund.getPortfolio_code());
}
/* funds b0 = new funds();
b0.setK("k0");
b0.setSomething("sth0");
funds b1 = new funds();
b1.setK("k1");
b1.setSomething("sth1");
List<funds> data = new ArrayList<funds>();
data.add(b0);
data.add(b1);*/
System.out.println("Portfolio:" + rowJavaRDD.values());
//manual set works fine ///
// Dataset<Row> fundDf = SQLContext.createDataFrame(data, funds.class);
Dataset<Row> fundDf = SQLContext.createDataFrame(rowJavaRDD.values(), funds.class);
fundDf.printSchema();
fundDf.write().option("mergeschema", true).parquet("C:/test");
}
The line below is giving an issue: using rowJavaRDD.values():
Dataset<Row> fundDf = SQLContext.createDataFrame(rowJavaRDD.values(), funds.class);
what is the resolution to this? whatever values Im column mapping should be passed here, but how this needs to be done. Any idea really helps me.

Dataset fundDf = SQLContext.createDataFrame(csvDataList, funds.class);
Passing list worked!

How to pass numerical and categorical features to RandomForestRegressor in Apache Spark: MLlib in Java?

How to pass numerical and categorical features to RandomForestRegressor in Apache Spark: MLlib in Java?
I'm able to do it with numerical or categorical, but I don't know how to implement it together.
My working code is as follows (only numerical features used for prediction)
String[] featureNumericalCols = new String[]{
"squareM",
"timeTimeToPragueCityCenter",
};
String[] featureStringCols = new String[]{ //not used
"type",
"floor",
"disposition",
};
VectorAssembler assembler = new VectorAssembler().setInputCols(featureNumericalCols).setOutputCol("features");
Dataset<Row> numericalData = assembler.transform(data);
numericalData.show();
RandomForestRegressor rf = new RandomForestRegressor().setLabelCol("price")
.setFeaturesCol("features");
// Chain indexer and forest in a Pipeline
Pipeline pipeline = new Pipeline()
.setStages(new PipelineStage[]{assembler, rf});
// Train model. This also runs the indexer.
PipelineModel model = pipeline.fit(trainingData);
// Make predictions.
Dataset<Row> predictions = model.transform(testData);

For anyone out there, this is the solution:
StringIndexer typeIndexer = new StringIndexer()
.setInputCol("type")
.setOutputCol("typeIndex");
preparedData = typeIndexer.fit(preparedData).transform(preparedData);
StringIndexer floorIndexer = new StringIndexer()
.setInputCol("floor")
.setOutputCol("floorIndex");
preparedData = floorIndexer.fit(preparedData).transform(preparedData);
StringIndexer dispositionIndexer = new StringIndexer()
.setInputCol("disposition")
.setOutputCol("dispositionIndex");
preparedData = dispositionIndexer.fit(preparedData).transform(preparedData);
String[] featureCols = new String[]{
"squareM",
"timeTimeToPragueCityCenter",
"floorIndex",
"floorIndex",
"dispositionIndex"
};
VectorAssembler assembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features");
preparedData = assembler.transform(preparedData);
// ... some more impelemtation details
RandomForestRegressor rf = new RandomForestRegressor().setLabelCol("price")
.setFeaturesCol("features");
return rf.fit(preparedData);

RandomForest with Weka in Java

I am working on a project and I need some examples how to implement RandomForest in Java with weka? I did it with IBk(), it worked. If I do it with RandomForest in the same way, it does not work.
Does anyone have a simple example for me how to implement RandomForest and how to get probability for each class (i did it with IBk withclassifier.distributionForInstance(instance) Function and it returned me probabilities for each class). How can I do it for RandomForest? I will need to get probability of every tree and to combine it?
//example
ConverrterUtils.DataSource source = new ConverterUtils.DataSource ("..../edit.arff);
Instances dataset = source.getDataSet();
dataset.setClassIndex(dataset.numAttributes() - 1);
IBk classifier = new IBk(5); classifier.buildClassifier(dataset);
Instance instance = new SparseInstance(2);
instance.setValue(0, 65) //example data
instance.setValue(1, 120); //example data
double[] prediction = classifier.distributionForInstance(instance);
//now I get the probability for the first class
System.out.println("Prediction for the first class is: "+prediction[0]);

You can calculate the the infogain while buidling the Model in the RandomForest. It is much slower and requires alot of memory while buidling model. I am not so sure about the documentation. you can add options or setValues while buiilding the model.
//numFolds in number of crossvalidations usually between 1-10
//br is your bufferReader
Instances trainData = new Instances(br);
trainData.setClassIndex(trainData.numAttributes() - 1);
RandomForest rf = new RandomForest();
rf.setNumTrees(50);
//You can set the options here
String[] options = new String[2];
options[0] = "-R";
rf.setOptions(options);
rf.buildClassifier(trainData);
weka.filters.supervised.attribute.AttributeSelection as = new weka.filters.supervised.attribute.AttributeSelection();
Ranker ranker = new Ranker();
InfoGainAttributeEval infoGainAttrEval = new InfoGainAttributeEval();
as.setEvaluator(infoGainAttrEval);
as.setSearch(ranker);
as.setInputFormat(trainData);
trainData = Filter.useFilter(trainData, as);
Evaluation evaluation = new Evaluation(trainData);
evaluation.crossValidateModel(rf, trainData, numFolds, new Random(1));
// Using HashMap to store the infogain values of the attributes
int count = 0;
Map<String, Double> infogainscores = new HashMap<String, Double>();
for (int i = 0; i < trainData.numAttributes(); i++) {
String t_attr = trainData.attribute(i).name();
//System.out.println(i+trainData.attribute(i).name());
double infogain = infoGainAttrEval.evaluateAttribute(i);
if(infogain != 0){
//System.out.println(t_attr + "= "+ infogain);
infogainscores.put(t_attr, infogain);
count = count+1;
}
}
//iterating over the hashmap
Iterator it = infogainscores.entrySet().iterator();
while (it.hasNext()) {
Map.Entry pair = (Map.Entry)it.next();
System.out.println(pair.getKey()+" = "+pair.getValue());
System.out.println(pair.getKey()+" = "+pair.getValue());
it.remove(); // avoids a ConcurrentModificationException
}

crosstab and crossDataSet

I'm trying to create a crossTab with three lines and n columns. I used a CrosstabDataset and JRBeanCollectionDataSource to show my Data. My problem is I can only show the last object in my CollectionDataSource I don't have an access to the data in my crossDataSet.
NB:
I used JRDesignCrosstab (java Code) to create the crossTab.
public static JRDesignCrosstab CrosstabPanel(String parameterName , JasperDesign jasperDesign, JRDesignDataset subDataset) throws JRException {
// parameter
JRDesignParameter parameter = new JRDesignParameter();
parameter.setName(parameterName);
parameter.setValueClass(java.lang.Object.class);
jasperDesign.addParameter(parameter);
subDataset.addParameter(parameter);
//Gross Tab
JRDesignCrosstab crosstab = new JRDesignCrosstab();
crosstab.setX(-90);
crosstab.setY(-4);
crosstab.setWidth(600);
crosstab.setHeight(400);
//Expression :
JRDesignExpression expression = new JRDesignExpression("$P{"+parameterName+"}");
//CrosstabDataset
JRDesignCrosstabDataset dataSet = new JRDesignCrosstabDataset();
//datasetrun
JRDesignDatasetRun dsr = new JRDesignDatasetRun();
dsr.setDatasetName(subDataset.getName());
dsr.setDataSourceExpression(expression);
//datasetrun into CrosstabDataset
dataSet.setResetType(ResetTypeEnum.NONE);
dataSet.setDatasetRun(dsr);
crosstab.setDataset(dataSet);
//Bucket Row
JRDesignCrosstabBucket bucket = new JRDesignCrosstabBucket();
JRDesignExpression expressionField = new JRDesignExpression();
expressionField.setText("$F{commissionSimPaye}");
bucket.setValueClassName("net.sf.jasperreports.engine.DataSource");
bucket.setExpression(expressionField);
//Row Group;
JRDesignCrosstabRowGroup rowGroup = new JRDesignCrosstabRowGroup();
rowGroup.setName("rowGroup");
rowGroup.setBucket(bucket);
rowGroup.setWidth(68*2+1);
rowGroup.setTotalPosition(CrosstabTotalPositionEnum.END);
crosstab.addRowGroup(rowGroup);
//Bucket Second Row
bucket = new JRDesignCrosstabBucket();
expressionField = new JRDesignExpression();
expressionField.setText("$F{commissionSimPaye}");
bucket.setValueClassName("net.sf.jasperreports.engine.ReportContext");
bucket.setExpression(expressionField);
//Row Group;
rowGroup = new JRDesignCrosstabRowGroup();
rowGroup.setName("secondRowGroup");
rowGroup.setBucket(bucket);
rowGroup.setWidth(68*2+1);
rowGroup.setTotalPosition(CrosstabTotalPositionEnum.END);
crosstab.addRowGroup(rowGroup);
//Bucket Column
bucket = new JRDesignCrosstabBucket();
expressionField = new JRDesignExpression();
expressionField.setText("$F{commissionSimCalcule}");
bucket.setValueClassName("java.lang.Object");
bucket.setExpression(expressionField);
//ColumnGroup
JRDesignCrosstabColumnGroup ColumnGroup = new JRDesignCrosstabColumnGroup();
ColumnGroup.setName("columnGroup");
ColumnGroup.setBucket(bucket);
ColumnGroup.setHeight(60);
ColumnGroup.setTotalPosition(CrosstabTotalPositionEnum.END);
crosstab.addColumnGroup(ColumnGroup);
JRDesignExpression expressionMesaure = new JRDesignExpression();
expressionMesaure.setText("$F{commissionSimCalcule}");
JRDesignCrosstabMeasure measure = new JRDesignCrosstabMeasure();
measure.setName("ColumContent"+0);
measure.setValueExpression(expressionMesaure);
measure.setValueClassName("java.lang.Object");
crosstab.addMeasure(measure);
expressionMesaure = new JRDesignExpression();
expressionMesaure.setText("$F{commissionSimPaye}");
measure = new JRDesignCrosstabMeasure();
measure.setName("ColumContent"+1);
measure.setValueExpression(expressionMesaure);
measure.setValueClassName("java.lang.Object");
crosstab.addMeasure(measure);
expressionMesaure = new JRDesignExpression();
expressionMesaure.setText("$F{commissionSimAPaye}");
measure = new JRDesignCrosstabMeasure();
measure.setName("ColumContent"+2);
measure.setValueExpression(expressionMesaure);
measure.setValueClassName("java.lang.Object");
crosstab.addMeasure(measure);
//contenu de la cellule
JRDesignTextField textField = new JRDesignTextField();
JRDesignCrosstabCell cell = new JRDesignCrosstabCell();
JRDesignExpression expressionTextField = new JRDesignExpression();
JRDesignCellContents cellContents = new JRDesignCellContents();
textField.setX(0);
textField.setY(0);
textField.setWidth(68);
textField.setHeight(20);
textField.setHorizontalAlignment(HorizontalAlignEnum.RIGHT);
textField.getLineBox().getLeftPen().setLineWidth(1);
textField.getLineBox().getTopPen().setLineWidth(1);
textField.getLineBox().getRightPen().setLineWidth(1);
textField.getLineBox().getBottomPen().setLineWidth(1);
cell.setHeight(20);
cell.setWidth(68);
expressionTextField.setText("$V{ColumContent"+0+"}");
textField.setExpression(expressionTextField);
cellContents.addElement(textField);
cell.setContents(cellContents);
crosstab.addCell(cell);
return crosstab;
}

problem solved, I worked on the structure of JRBeanCollectionDataSource

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Cascading - merge 2 aggregations - java

Related

Custom DataProvider Nattable

How to pass csv mapped bean class to Dataset

How to pass numerical and categorical features to RandomForestRegressor in Apache Spark: MLlib in Java?

RandomForest with Weka in Java

crosstab and crossDataSet

Categories

Resources