I wrote code to read a csv file and map all the columns to a bean class.
Now, I'm trying to set these values to a Dataset and getting an issue.
7/08/30 16:33:58 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: object is not an instance of declaring class
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
If I try to set the values manually it works fine
public void run(String t, String u) throws FileNotFoundException {
JavaRDD<String> pairRDD = sparkContext.textFile("C:/temp/L1_result.csv");
JavaPairRDD<String,String> rowJavaRDD = pairRDD.mapToPair(new PairFunction<String, String, String>() {
public Tuple2<String,String> call(String rec) throws FileNotFoundException {
String[] tokens = rec.split(";");
String[] vals = new String[tokens.length];
for(int i= 0; i < tokens.length; i++){
vals[i] =tokens[i];
}
return new Tuple2<String, String>(tokens[0], tokens[1]);
}
});
ColumnPositionMappingStrategy cpm = new ColumnPositionMappingStrategy();
cpm.setType(funds.class);
String[] csvcolumns = new String[]{"portfolio_id", "portfolio_code"};
cpm.setColumnMapping(csvcolumns);
CSVReader csvReader = new CSVReader(new FileReader("C:/temp/L1_result.csv"));
CsvToBean csvtobean = new CsvToBean();
List csvDataList = csvtobean.parse(cpm, csvReader);
for (Object dataobject : csvDataList) {
funds fund = (funds) dataobject;
System.out.println("Portfolio:"+fund.getPortfolio_id()+ " code:"+fund.getPortfolio_code());
}
/* funds b0 = new funds();
b0.setK("k0");
b0.setSomething("sth0");
funds b1 = new funds();
b1.setK("k1");
b1.setSomething("sth1");
List<funds> data = new ArrayList<funds>();
data.add(b0);
data.add(b1);*/
System.out.println("Portfolio:" + rowJavaRDD.values());
//manual set works fine ///
// Dataset<Row> fundDf = SQLContext.createDataFrame(data, funds.class);
Dataset<Row> fundDf = SQLContext.createDataFrame(rowJavaRDD.values(), funds.class);
fundDf.printSchema();
fundDf.write().option("mergeschema", true).parquet("C:/test");
}
The line below is giving an issue: using rowJavaRDD.values():
Dataset<Row> fundDf = SQLContext.createDataFrame(rowJavaRDD.values(), funds.class);
what is the resolution to this? whatever values Im column mapping should be passed here, but how this needs to be done. Any idea really helps me.
Dataset fundDf = SQLContext.createDataFrame(csvDataList, funds.class);
Passing list worked!
Related
I am trying to load the data into below table. I am able to load the data in "array_data".
But how to load the data in nested array "inside_array".I have tried the commented part to load the data in inside_array array but it did not work.
enter image description here
Here is my code.-
Pipeline p = Pipeline.create(options);
org.apache.beam.sdk.values.PCollection<TableRow> output = p.apply(org.apache.beam.sdk.transforms.Create.of("temp"))
.apply("O/P",ParDo.of(new DoFn<String, TableRow>() {
/**
*
*/
private static final long serialVersionUID = 307542945272055650L;
#ProcessElement
public void processElemet(ProcessContext c) {
TableRow row = new TableRow();
row.set("name","Jack");
row.set("phone","9874563210");
TableRow ip = new TableRow().set("address", "M G Road").set("email","abc#gmail.com");
TableRow ip1 = new TableRow().set("address","F C Road").set("email","xyz#gmail.com");
java.util.List<TableRow> metadata = new ArrayList<TableRow>();
metadata.add(ip);
metadata.add(ip1);
row.set("array_data",metadata);
LOG.info("O/P:"+row);
c.output(row);
}}));
output.apply("Write to table",BigQueryIO.writeTableRows().withoutValidation().to("AA.nested_array")
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
p.run();
Anyone has any clue or suggestion.Thanks in advance.
To Handle the nested array using dataflow create a seprate List and add it into your main array of tablerow.
Here I tried this way and I got the expected output.
Pipeline p = Pipeline.create(options);
org.apache.beam.sdk.values.PCollection output = p.apply(org.apache.beam.sdk.transforms.Create.of("temp"))
.apply("O/P",ParDo.of(new DoFn<String, TableRow>() {
#ProcessElement
public void processElemet(ProcessContext c) {
TableRow row = new TableRow();
row.set("name","Jack");
row.set("phone","9874563210");
List<TableRow> listDest = new ArrayList<>();
TableRow t=new TableRow().set("detail1","one" ).set("detail2", "two");
TableRow t1=new TableRow().set("detail1","three" ).set("detail2", "four");
listDest.add(t);
listDest.add(t1);
TableRow ip = new TableRow().set("address", "M G Road").set("email","abc#gmail.com").set("inside_array", listDest);
TableRow ip1 = new TableRow().set("address","F C Road").set("email","xyz#gmail.com").set("inside_array", listDest);
java.util.List<TableRow> metadata = new ArrayList<TableRow>();
metadata.add(ip);
metadata.add(ip1);
row.set("array_data",metadata);
LOG.info("O/P:"+row);
c.output(row);
}}));
Adding the image of table with data as well.
hope It will helpful if anyone is working on the same kind of table.
I am very new to Spark Machine Learning just an 3 day old novice and I'm basically trying to predict some data using Logistic Regression algorithm in spark via Java. I have referred few sites and documentation and came up with the code and i am trying to execute it but facing an issue.
So i have pre-processed the data and have used vector assembler to club all the relevant columns into one and i am trying to fit the model and facing an issue.
public class Sparkdemo {
static SparkSession session = SparkSession.builder().appName("spark_demo")
.master("local[*]").getOrCreate();
#SuppressWarnings("empty-statement")
public static void getData() {
Dataset<Row> inputFile = session.read()
.option("header", true)
.format("csv")
.option("inferschema", true)
.csv("C:\\Users\\WildJasmine\\Downloads\\NKI_cleaned.csv");
inputFile.show();
String[] columns = inputFile.columns();
int beg = 16, end = columns.length - 1;
String[] featuresToDrop = new String[end - beg + 1];
System.arraycopy(columns, beg, featuresToDrop, 0, featuresToDrop.length);
System.out.println("rows are\n " + Arrays.toString(featuresToDrop));
Dataset<Row> dataSubset = inputFile.drop(featuresToDrop);
String[] arr = {"Patient", "ID", "eventdeath"};
Dataset<Row> X = dataSubset.drop(arr);
X.show();
Dataset<Row> y = dataSubset.select("eventdeath");
y.show();
//Vector Assembler concept for merging all the cols into a single col
VectorAssembler assembler = new VectorAssembler()
.setInputCols(X.columns())
.setOutputCol("features");
Dataset<Row> dataset = assembler.transform(X);
dataset.show();
StringIndexer labelSplit = new StringIndexer().setInputCol("features").setOutputCol("label");
Dataset<Row> data = labelSplit.fit(dataset)
.transform(dataset);
data.show();
Dataset<Row>[] splitsX = data.randomSplit(new double[]{0.8, 0.2}, 42);
Dataset<Row> trainingX = splitsX[0];
Dataset<Row> testX = splitsX[1];
LogisticRegression lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8);
LogisticRegressionModel lrModel = lr.fit(trainingX);
Dataset<Row> prediction = lrModel.transform(testX);
prediction.show();
}
public static void main(String[] args) {
getData();
}}
Below image is my dataset,
dataset
Error message:
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: The input column features must be either string type or numeric type, but got org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7.
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.feature.StringIndexerBase$class.validateAndTransformSchema(StringIndexer.scala:86)
at org.apache.spark.ml.feature.StringIndexer.validateAndTransformSchema(StringIndexer.scala:109)
at org.apache.spark.ml.feature.StringIndexer.transformSchema(StringIndexer.scala:152)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.feature.StringIndexer.fit(StringIndexer.scala:135)
My end result is I need a predicted value using the features column.
Thanks in advance.
That error occurs when the input field of your dataframe for which you want to apply the StringIndexer transformation is a Vector. In the Spark documentation https://spark.apache.org/docs/latest/ml-features#stringindexer you can see that the input column is a string. This transformer performs a distinct to that column and creates a new column with integers that correspond to each different string value. It does not work for vectors.
I have dynamic set of columns in my Spark dataset. I want to pass array of columns instead of separate columns. How can we write the UDF function so that, it accepts array of columns.
I have tried passing sequence of strings, but it is failing.
static UDF1<Seq<String>, String> udf = new UDF1<Seq<String>, String>() {
#Override
public String call(Seq<String> t1) throws Exception {
return t1.toString();
}
};
private static Column generate(Dataset<Row> dataset, SparkSession ss) {
ss.udf().register("generate", udf, DataTypes.StringType);
StructField[] columnsStructType = dataset.schema().fields();
List<Column> columnList = new ArrayList<>();
for (StructField structField : columnsStructType) {
columnList.add(dataset.col(structField.name()));
}
return functions.callUDF("generate", convertListToSeq(columnList));
}
private static Seq<Column> convertListToSeq(List<Column> inputList) {
return JavaConverters.asScalaIteratorConverter(inputList.iterator()).asScala().toSeq();
}
I am getting following error message when I tried to invoke generate function
Exception in thread "main" org.apache.spark.sql.AnalysisException: Invalid number of arguments for function generate. Expected: 1; Found: 14;
at org.apache.spark.sql.UDFRegistration.builder$27(UDFRegistration.scala:763)
at org.apache.spark.sql.UDFRegistration.$anonfun$register$377(UDFRegistration.scala:766)
at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:115)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1273)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$143.$anonfun$applyOrElse$66(Analyzer.scala:1329)
at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$143.applyOrElse(Analyzer.scala:1329)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$143.applyOrElse(Analyzer.scala:1312)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:256)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:256)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:261)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:326)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261)
at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsDown$1(QueryPlan.scala:83)
at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:105)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:105)
at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:116)
at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:121)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:233)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:58)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:51)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike.map(TraversableLike.scala:233)
at scala.collection.TraversableLike.map$(TraversableLike.scala:226)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:121)
at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:126)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:83)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:74)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13.applyOrElse(Analyzer.scala:1312)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13.applyOrElse(Analyzer.scala:1310)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$3(AnalysisHelper.scala:90)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$1(AnalysisHelper.scala:90)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp(AnalysisHelper.scala:86)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp$(AnalysisHelper.scala:84)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$.apply(Analyzer.scala:1310)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$.apply(Analyzer.scala:1309)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:87)
at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:122)
at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:118)
at scala.collection.immutable.List.foldLeft(List.scala:85)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:76)
at scala.collection.immutable.List.foreach(List.scala:388)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:127)
at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:121)
at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:106)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3407)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1336)
at org.apache.spark.sql.Dataset.withColumns(Dataset.scala:2253)
at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:2220)
In short: you should use the array method to combine your columns into one structure before passing that into your UDF.
This code should work (it's the actual working code after some refactoring).
//
// The UDF function implementation
//
static String myFunc(Seq<Object> values) {
Iterator<Object> iterator = values.iterator();
while (iterator.hasNext()) {
Object object = iterator.next();
// Do something with your column value
}
return ...;
}
//
// UDF registration; `sc` here is the Spark SQL context
//
sc.udf().register("myFunc", (UDF1<Seq<Object>, String>) myFunc, DataTypes.StringType);
//
// Calling the UDF; note the `array` method
//
Dataset<Row> ds = ...;
Seq<Column> columns = JavaConversions.asScalaBuffer(Stream
.of(ds.schema().fields())
.map(f -> col(f.name()))
.collect(Collectors.toList()));
ds = ds.withColumn("myColumn", callUDF("myFunc", array(columns)));
Scenario
I have two string lists containing text file path, List a, List b.
I want to to cartesian product of list a,b to achieve a cartesian dataframe comparison.
The way I am trying is first do cartesian product,
transfer it to pairRdd and then on foreach apply operation.
List<String> a = Lists.newList("/data/1.text",/data/2.text","/data/3.text");
List<String> b = Lists.newList("/data/4.text",/data/5.text","/data/6.text");
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
List<Tuple2<String,String>> cartesian = cartesian(a,b);
jsc.parallelizePairs(cartesian).filter(new Function<Tuple2<String, String>, Boolean>() {
#Override public Boolean call(Tuple2<String, String> tup) throws Exception {
Dataset<Row> text1 = spark.read().text(tup._1); <-- this throw NullPointerException
Dataset<Row> text2 = spark.read().text(tup._2);
return text1.first()==text2.first(); <-- this is an indicative function only
});
Even I can use spark to do cartesian as
JavaRDD<Column> sourceRdd = jsc.parallelize(a);
JavaRDD<Column> allRdd = jsc.parallelize(b);
sourceRdd.cache().cartesian(allRdd).filter(new Function<Tuple2<String, String>, Boolean>() {
#Override public Boolean call(Tuple2<Column, Column> tup) throws Exception {
Dataset<Row> text1 = spark.read().text(tup._1); <-- same issue
Dataset<Row> text2 = spark.read().text(tup._2);
return text1.first()==text2.first();
}
});
Please suggest good approach to handle this.
Not sure if I completely understood your problem. Here is the sample for Cartesian using Spark and Java.
public class CartesianDemo {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("CartesianDemo").setMaster("local");
JavaSparkContext jsc = new JavaSparkContext(conf);
//list
List<String> listOne = Arrays.asList("one", "two", "three", "four", "five");
List<String> listTwo = Arrays.asList("ww", "xx", "yy", "zz");
//RDD
JavaRDD<String> rddOne = jsc.parallelize(listOne);
JavaRDD<String> rddTwo = jsc.parallelize(listTwo);
//Cartesian
JavaPairRDD<String, String> cartesianRDD = rddOne.cartesian(rddTwo);
//print
cartesianRDD.foreach(data -> {
System.out.println("X=" + data._1() + " Y=" + data._2());
});
//stop
jsc.stop();
jsc.close();
}
}
public static void main(String[] args)
{
String input="jack=susan,kathy,bryan;david=stephen,jack;murphy=bruce,simon,mary";
String[][] family = new String[50][50];
//assign family and children to data by ;
StringTokenizer p = new StringTokenizer (input,";");
int no_of_family = input.replaceAll("[^;]","").length();
no_of_family++;
System.out.println("family= "+no_of_family);
String[] data = new String[no_of_family];
int i=0;
while(p.hasMoreTokens())
{
data[i] = p.nextToken();
i++;
}
for (int j=0;j<no_of_family;j++)
{
family[j][0] = data[j].split("=")[0];
//assign child to data by commas
StringTokenizer v = new StringTokenizer (data[j],",");
int no_of_child = data[j].replaceAll("[^,]","").length();
no_of_child++;
System.out.println("data from input = "+data[j]);
for (int k=1;k<=no_of_child;k++)
{
family[j][k]= data[j].split("=")[1].split(",");
System.out.println(family[j][k]);
}
}
}
i have a list of family in input string and i seperate into a family and i wanna do it in double array family[i][j].
my goal is:
family[0][0]=1st father's name
family[0][1]=1st child name
family[0][2]=2nd child name and so on...
family[0][0]=jack
family[0][1]=susan
family[0][2]=kathy
family[0][3]=bryan
family[1][0]=david
family[1][1]=stephen
family[1][2]=jack
family[2][0]=murphy
family[2][1]=bruce
family[2][2]=simon
family[2][3]=mary
but i got the error as title: in compatible types
found:java.lang.String[]
required:java.lang.String
family[j][k]= data[j].split("=")[1].split(",");
what can i do?i need help
nyone know how to use StringTokenizer for this input?
Trying to understand why you can't just use split for your nested operation as well.
For example, something like this should work just fine
for (int j=0;j<no_of_family;j++)
{
String[] familySplit = data[j].split("=");
family[j][0] = familySplit[0];
String[] childrenSplit = familySplit[1].split(",");
for (int k=0;k<childrenSplit.length;k++)
{
family[j][k+1]= childrenSplit[k];
}
}
You are trying to assign an array of strings to a string. Maybe this will make it more clear?
String[] array = data.split("=")[1].split(",");
Now, if you want the first element of that array you can then do:
family[j][k] = array[0];
I always avoid to use arrays directly. They are hard to manipulate versus dynamic list. I implemented the solution using a Map of parent to a list of childrens Map<String, List<String>> (read Map<Parent, List<Children>>).
public static void main(String[] args) {
String input = "jack=susan,kathy,bryan;david=stephen,jack;murphy=bruce,simon,mary";
Map<String, List<String>> parents = new Hashtable<String, List<String>>();
for ( String family : input.split(";")) {
final String parent = family.split("=")[0];
final String allChildrens = family.split("=")[1];
List<String> childrens = new Vector<String>();
for (String children : allChildrens.split(",")) {
childrens.add(children);
}
parents.put(parent, childrens);
}
System.out.println(parents);
}
The output is this:
{jack=[susan, kathy, bryan], murphy=[bruce, simon, mary], david=[stephen, jack]}
With this method you can directory access to a parent using the map:
System.out.println(parents.get("jack"));
and this output:
[susan, kathy, bryan]