I am very new to Spark Machine Learning just an 3 day old novice and I'm basically trying to predict some data using Logistic Regression algorithm in spark via Java. I have referred few sites and documentation and came up with the code and i am trying to execute it but facing an issue.
So i have pre-processed the data and have used vector assembler to club all the relevant columns into one and i am trying to fit the model and facing an issue.
public class Sparkdemo {
static SparkSession session = SparkSession.builder().appName("spark_demo")
.master("local[*]").getOrCreate();
#SuppressWarnings("empty-statement")
public static void getData() {
Dataset<Row> inputFile = session.read()
.option("header", true)
.format("csv")
.option("inferschema", true)
.csv("C:\\Users\\WildJasmine\\Downloads\\NKI_cleaned.csv");
inputFile.show();
String[] columns = inputFile.columns();
int beg = 16, end = columns.length - 1;
String[] featuresToDrop = new String[end - beg + 1];
System.arraycopy(columns, beg, featuresToDrop, 0, featuresToDrop.length);
System.out.println("rows are\n " + Arrays.toString(featuresToDrop));
Dataset<Row> dataSubset = inputFile.drop(featuresToDrop);
String[] arr = {"Patient", "ID", "eventdeath"};
Dataset<Row> X = dataSubset.drop(arr);
X.show();
Dataset<Row> y = dataSubset.select("eventdeath");
y.show();
//Vector Assembler concept for merging all the cols into a single col
VectorAssembler assembler = new VectorAssembler()
.setInputCols(X.columns())
.setOutputCol("features");
Dataset<Row> dataset = assembler.transform(X);
dataset.show();
StringIndexer labelSplit = new StringIndexer().setInputCol("features").setOutputCol("label");
Dataset<Row> data = labelSplit.fit(dataset)
.transform(dataset);
data.show();
Dataset<Row>[] splitsX = data.randomSplit(new double[]{0.8, 0.2}, 42);
Dataset<Row> trainingX = splitsX[0];
Dataset<Row> testX = splitsX[1];
LogisticRegression lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8);
LogisticRegressionModel lrModel = lr.fit(trainingX);
Dataset<Row> prediction = lrModel.transform(testX);
prediction.show();
}
public static void main(String[] args) {
getData();
}}
Below image is my dataset,
dataset
Error message:
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: The input column features must be either string type or numeric type, but got org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7.
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.feature.StringIndexerBase$class.validateAndTransformSchema(StringIndexer.scala:86)
at org.apache.spark.ml.feature.StringIndexer.validateAndTransformSchema(StringIndexer.scala:109)
at org.apache.spark.ml.feature.StringIndexer.transformSchema(StringIndexer.scala:152)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.feature.StringIndexer.fit(StringIndexer.scala:135)
My end result is I need a predicted value using the features column.
Thanks in advance.
That error occurs when the input field of your dataframe for which you want to apply the StringIndexer transformation is a Vector. In the Spark documentation https://spark.apache.org/docs/latest/ml-features#stringindexer you can see that the input column is a string. This transformer performs a distinct to that column and creates a new column with integers that correspond to each different string value. It does not work for vectors.
I have dynamic set of columns in my Spark dataset. I want to pass array of columns instead of separate columns. How can we write the UDF function so that, it accepts array of columns.
I have tried passing sequence of strings, but it is failing.
static UDF1<Seq<String>, String> udf = new UDF1<Seq<String>, String>() {
#Override
public String call(Seq<String> t1) throws Exception {
return t1.toString();
}
};
private static Column generate(Dataset<Row> dataset, SparkSession ss) {
ss.udf().register("generate", udf, DataTypes.StringType);
StructField[] columnsStructType = dataset.schema().fields();
List<Column> columnList = new ArrayList<>();
for (StructField structField : columnsStructType) {
columnList.add(dataset.col(structField.name()));
}
return functions.callUDF("generate", convertListToSeq(columnList));
}
private static Seq<Column> convertListToSeq(List<Column> inputList) {
return JavaConverters.asScalaIteratorConverter(inputList.iterator()).asScala().toSeq();
}
I am getting following error message when I tried to invoke generate function
Exception in thread "main" org.apache.spark.sql.AnalysisException: Invalid number of arguments for function generate. Expected: 1; Found: 14;
at org.apache.spark.sql.UDFRegistration.builder$27(UDFRegistration.scala:763)
at org.apache.spark.sql.UDFRegistration.$anonfun$register$377(UDFRegistration.scala:766)
at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:115)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1273)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$143.$anonfun$applyOrElse$66(Analyzer.scala:1329)
at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$143.applyOrElse(Analyzer.scala:1329)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$143.applyOrElse(Analyzer.scala:1312)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:256)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:256)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:261)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:326)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261)
at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsDown$1(QueryPlan.scala:83)
at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:105)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:105)
at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:116)
at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:121)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:233)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:58)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:51)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike.map(TraversableLike.scala:233)
at scala.collection.TraversableLike.map$(TraversableLike.scala:226)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:121)
at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:126)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:83)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:74)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13.applyOrElse(Analyzer.scala:1312)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13.applyOrElse(Analyzer.scala:1310)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$3(AnalysisHelper.scala:90)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$1(AnalysisHelper.scala:90)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp(AnalysisHelper.scala:86)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp$(AnalysisHelper.scala:84)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$.apply(Analyzer.scala:1310)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$.apply(Analyzer.scala:1309)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:87)
at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:122)
at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:118)
at scala.collection.immutable.List.foldLeft(List.scala:85)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:76)
at scala.collection.immutable.List.foreach(List.scala:388)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:127)
at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:121)
at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:106)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3407)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1336)
at org.apache.spark.sql.Dataset.withColumns(Dataset.scala:2253)
at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:2220)
In short: you should use the array method to combine your columns into one structure before passing that into your UDF.
This code should work (it's the actual working code after some refactoring).
//
// The UDF function implementation
//
static String myFunc(Seq<Object> values) {
Iterator<Object> iterator = values.iterator();
while (iterator.hasNext()) {
Object object = iterator.next();
// Do something with your column value
}
return ...;
}
//
// UDF registration; `sc` here is the Spark SQL context
//
sc.udf().register("myFunc", (UDF1<Seq<Object>, String>) myFunc, DataTypes.StringType);
//
// Calling the UDF; note the `array` method
//
Dataset<Row> ds = ...;
Seq<Column> columns = JavaConversions.asScalaBuffer(Stream
.of(ds.schema().fields())
.map(f -> col(f.name()))
.collect(Collectors.toList()));
ds = ds.withColumn("myColumn", callUDF("myFunc", array(columns)));
I'm making a sudoku solver in java, using a small prolog kb at it's core. The prolog "sudoku" rule requires a prolog list of lists. In java I have an int[][] with the sudoku values.
I've made the Query run succesfully with a prolog list of lists
e.g. Query q1 = new Query("problem(1, Rows), sudoku(Rows)."); where Rows is a prolog list of lists,
but I need to also make it run with a Java int[][]
e.g. Query q1 = new Query("sudoku", intArrayTerm);
The relevant java code:
int s00 = parseTextField(t00);
int s01 = parseTextField(t01);
...
int s87 = parseTextField(t87);
int s88 = parseTextField(t88);
int[] row0 = {s00, s10, s20, s30, s40, s50, s60, s70, s80};
...
int[] row8 = {s08, s18, s28, s38, s48, s58, s68, s78, s88};
int[][] allRows = {row0, row1, row2, row3, row4, row5, row6, row7, row8};
Term rowsTerm = Util.intArrayArrayToList(allRows);
Query q0 = new Query("consult", new Term[]{new Atom("/home/mark/Documents/JavaProjects/SudokuSolver/src/com/company/sudoku.pl")});
System.out.println("consult " + (q0.hasSolution() ? "succeeded" : "failed"));
// Query q1 = new Query("problem(1, Rows), sudoku(Rows).");
Query q1 = new Query("sudoku", rowsTerm);
System.out.println("sudoku " + (q1.hasSolution() ? "succeeded" : "failed"));
Map<String, Term> rowsTermMap = q1.oneSolution();
Term solvedRowsTerm = (rowsTermMap.get("Rows"));
parseSolvedRowsTerm(solvedRowsTerm);
the prolog code:
sudoku(Rows) :-
length(Rows, 9), maplist(same_length(Rows), Rows),
append(Rows, Vs), Vs ins 1..9,
maplist(all_distinct, Rows),
transpose(Rows, Columns),
maplist(all_distinct, Columns),
Rows = [A,B,C,D,E,F,G,H,I],
blocks(A, B, C), blocks(D, E, F), blocks(G, H, I).
blocks([], [], []).
blocks([A,B,C|Bs1], [D,E,F|Bs2], [G,H,I|Bs3]) :-
all_distinct([A,B,C,D,E,F,G,H,I]),
blocks(Bs1, Bs2, Bs3).
problem(1, [[_,_,_, _,_,_, _,_,_],
[_,_,_, _,_,3, _,8,5],
[_,_,1, _,2,_, _,_,_],
[_,_,_, 5,_,7, _,_,_],
[_,_,4, _,_,_, 1,_,_],
[_,9,_, _,_,_, _,_,_],
[5,_,_, _,_,_, _,7,3],
[_,_,2, _,1,_, _,_,_],
[_,_,_, _,4,_, _,_,9]]).
the functions parseTextField and parseSolvedRowsTerm, actually the whole program, works fine with the commented-out Query q1, but not with the not-commented-out Query q1
solved it!
added an extra argument to q1
Credits to github.com/zlumyo, I stole his BuildMatrix for convenience and his code gave me the idea for the extra argument.
Query q1 = new Query("sudoku("+ buildMatrix(allRows) +", Result)");
'Buildmatrix' is basically just a StringBuilder helper function:
private String buildMatrix(int[][] cells) { // build matrix as string
StringBuilder result = new StringBuilder("[");
ArrayList<String> strList = new ArrayList<>();
for (int[] i : cells) {
strList.add(buildList(i));
}
result.append(String.join(",", strList));
result.append("]");
return result.toString();
}
private String buildList(int[] line) { // build matrix as string
StringBuilder result = new StringBuilder("[");
ArrayList<String> intList = new ArrayList<>();
for (int i : line) {
String stringval;
if(i == 0){
stringval = "_";
}else{
stringval = String.valueOf(i);
} // if statement is a small adaptation to the version github.com/zlumyo made, because my prolog sudoku had a slightly different format for the list.
intList.add(stringval);
}
result.append(String.join(",", intList));
result.append("]");
return result.toString();
}
The prolog code hasn't changed a lot, just an extra argument and 1 extra line.
sudoku(Rows, Result) :-
length(Rows, 9), maplist(same_length(Rows), Rows),
append(Rows, Vs), Vs ins 1..9,
maplist(all_distinct, Rows),
transpose(Rows, Columns),
maplist(all_distinct, Columns),
Rows = [A,B,C,D,E,F,G,H,I],
blocks(A, B, C), blocks(D, E, F), blocks(G, H, I),
Rows = Result. %extra line
I have below data which I fetched from database by using Hibernate NamedQuery
TXN_ID END_DATE
---------- ------------
121 15-JUN-16
122 15-JUN-16
123 16-MAY-16
Each row data can be store in Java class Object.
Now I want to combined data depending on the END_DATE. If END_DATE are same then merge TXN_ID data.
From the above data output would be :
TXN_ID END_DATE
---------- ------------
121|122 15-JUN-16
123 16-MAY-16
I want to do this program in java. What is the easy program for that?
Using the accepted function printMap, to iterate through the hashmap in order to see if output is correct.
With the code below:
public static void main(String[] args) {
String[][] b = {{"1","15-JUN-16"},{"2","16-JUN-16"},{"3","13-JUN-16"},{"4","16-JUN-16"},{"5","17-JUN-16"}};
Map<String, String> mapb = new HashMap<String,String>();
for(int j=0; j<b.length; j++){
String c = mapb.get(b[j][1]);
if(c == null)
mapb.put(b[j][1], b[j][0]);
else
mapb.put(b[j][1], c+" "+b[j][0]);
}
printMap(mapb);
}
You get the following output:
13-JUN-16 = 3
16-JUN-16 = 2 4
17-JUN-16 = 5
15-JUN-16 = 1
I think this will solve your problem.
With hibernate you can put query result in a list of object
Query q = session.createSQLQuery( sql ).addEntity(ObjDataQuery.class);
List<ObjDataQuery> res = q.list();
Now you can create an hashmap to storage final result, to populate this object you can iterate over res
Map<String, String> finalResult= new HashMap<>();
for (int i=0; i<res.size(); i++){
if (finalResult.get(res.get(i).date!=null){
//new element
finalResult.put(res.get(i).date,res.get(i).txn)
} else {
//update element
finalResult.put(res.get(i).date,
finalResult.get(res.get(i).date) + res.get(i).txn)
}
}
I've not tested it by logic should be correct.
Another way is to change the query to obtain direct the final result (in oracle see LISTAGG)