How to read and write a custom class from parquet file - java

I am trying to write a parquet read/write class for a certain class type using DataFrame/datasets
class schema:
class A {
long count;
List<B> listOfValues;
}
class B {
String id;
long count;
}
code :
String path = "some path";
List<A> entries = somerandomAentries();
JavaRDD<A> rdd = sc.parallelize(entries, 1);
DataFrame df = sqlContext.createDataFrame(rdd, A.class);
df.write().parquet(path);
DataFrame newDataDF = sqlContext.read().parquet(path);
newDataDF.show();
when i try to run this, it throws an error. what am I missing here? Do I need to provide a schema for the whole class while creating data frames
error:
Caused by: scala.MatchError: B(Id=abc, count=0) (of class B)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:169)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:153)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
at org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1358)
at org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1358)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1.apply(SQLContext.scala:1358)
at org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1.apply(SQLContext.scala:1356)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:263)
... 8 more

You are getting error because nested JavaBeans are not supported in Spark 1.6 version. Please see https://spark.apache.org/docs/1.6.0/sql-programming-guide.html#inferring-the-schema-using-reflection
Currently, Spark SQL does not support JavaBeans that contain nested or contain complex types such as Lists or Arrays.

Related

ClassCastException with Stacktrace Hazelcast version 4.2.5 using ReplicatedMap

Using Hazelcast version 4.2.5 in a webapp deployed on Tomcat on Kubernetes. We're frequently("every 5 seconds") seeing ClassCastException with a stacktrace in the application logs.
Here's the ClassCastException :
java.lang.ClassCastException: class java.lang.String cannot be cast to class com.hazelcast.internal.serialization.impl.HeapData (java.lang.String is in module java.base of loader 'bootstrap'; com.hazelcast.internal.serialization.impl.HeapData is in unnamed module of loader org.apache.catalina.loader.ParallelWebappClassLoader #2f04993d)
27-Oct-2022 22:57:56.357 WARNING [hz.rogueUsers.cached.thread-2] com.hazelcast.internal.metrics.impl.MetricsCollectionCycle.null Collecting metrics from source com.hazelcast.replicatedmap.impl.ReplicatedMapService failed
at com.hazelcast.internal.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:102)
at com.hazelcast.internal.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:76)
at java.base/java.lang.Thread.run(Thread.java:834)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at com.hazelcast.internal.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:217)
at com.hazelcast.spi.impl.executionservice.impl.DelegateAndSkipOnConcurrentExecutionDecorator$DelegateDecorator.run(DelegateAndSkipOnConcurrentExecutionDecorator.java:77)
at com.hazelcast.internal.metrics.impl.MetricsService.collectMetrics(MetricsService.java:154)
at com.hazelcast.internal.metrics.impl.MetricsService.collectMetrics(MetricsService.java:160)
at com.hazelcast.internal.metrics.impl.MetricsRegistryImpl.collect(MetricsRegistryImpl.java:316)
at com.hazelcast.internal.metrics.impl.MetricsCollectionCycle.collectDynamicMetrics(MetricsCollectionCycle.java:88)
at com.hazelcast.replicatedmap.impl.ReplicatedMapService.provideDynamicMetrics(ReplicatedMapService.java:387)
at com.hazelcast.replicatedmap.impl.ReplicatedMapService.getStats(ReplicatedMapService.java:357)
at com.hazelcast.replicatedmap.impl.ReplicatedMapService.getLocalReplicatedMapStats(ReplicatedMapService.java:197)
at com.hazelcast.replicatedmap.impl.LocalReplicatedMapStatsProvider.getLocalReplicatedMapStats(LocalReplicatedMapStatsProvider.java:85)
Here's how we're setting up Hazelcast.
private static HazelcastInstance setupHazelcastConfig() {
Config config = new Config();
config.setInstanceName("rogueUsers");
NetworkConfig network = config.getNetworkConfig();
network.setPort(5701).setPortCount(20);
network.setPortAutoIncrement(true);
JoinConfig join = network.getJoin();
join.getMulticastConfig().setEnabled(true);
// join.getTcpIpConfig()
// .setEnabled(true);
HazelcastInstance hz = Hazelcast.getOrCreateHazelcastInstance(config);
ReplicatedMapConfig replicatedMapConfig =
config.getReplicatedMapConfig("rogueUsers");
replicatedMapConfig.setInMemoryFormat(InMemoryFormat.BINARY);
replicatedMapConfig.setAsyncFillup(true);
replicatedMapConfig.setStatisticsEnabled(true);
replicatedMapConfig.setSplitBrainProtectionName("splitbrainprotection-name");
ReplicatedMap<String, String> map = hz.getReplicatedMap("rogueUsers");
map.addEntryListener(new RogueEntryListener());
return hz;
}
Is this a configuration issue ?
How do I fix this ?
Thanks very much,
The exception is being thrown from the following line:
if (isBinary) {
memoryUsage += ((HeapData) record.getValueInternal()).getHeapCost(); <-- exception
}
which is line 85 of com.hazelcast.replicatedmap.impl.LocalReplicatedMapStats class. The condition being checked is as the following:
boolean isBinary = (replicatedMapConfig.getInMemoryFormat() == InMemoryFormat.BINARY);
so basically, it is related to the format you are saving the data (from the config above you have chosen BINARY).
However, I don't think you are following it correctly since you do the following: ReplicatedMap<String, String> map = hz.getReplicatedMap("rogueUsers"); in your config.
From the Javadoc of com.hazelcast.internal.serialization.Data class:
Data is basic unit of serialization. It stores binary form of an object serialized by SerializationService.toData(Object).
Therefore, try editing your config to this:
ReplicatedMap<Data, Data> map = hz.getReplicatedMap("rogueUsers");

How to create a Spark UDF in Java which accepts array of Strings?

This question has been asked here for Scala, and it does not help me as I am working with Java API. I have been literally throwing everything and the kitchen sink at it, so this was my approach:
List<String> sourceClasses = new ArrayList<String>();
//Add elements
List<String> targetClasses = new ArrayList<String>();
//Add elements
dataset = dataset.withColumn("Transformer", callUDF(
"Transformer",
lit((String[])sourceClasses.toArray())
.cast(DataTypes.createArrayType(DataTypes.StringType)),
lit((String[])targetClasses.toArray())
.cast(DataTypes.createArrayType(DataTypes.StringType))
));
And for my UDF declaration:
public class Transformer implements UDF2<Seq<String>, Seq<String>, String> {
// #SuppressWarnings("deprecation")
public String call(Seq<String> sourceClasses, Seq<String> targetClasses)
throws Exception {
When I run the code, the execution does not proceed past the UDF call, which is expected because I am not being able to match up the types. Please help me in this regard.
EDIT
I tried the solution suggested by #Oli. However, I got the following exception:
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$261: (array<string>, array<string>) => string)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.collection.immutable.Seq
at com.esrx.dqm.uuid.UUIDTransformerEngine$1.call(UUIDTransformerEngine.java:1)
at org.apache.spark.sql.UDFRegistration$$anonfun$261.apply(UDFRegistration.scala:774)
... 22 more
This line specifically seems to be indicative of a problem:
Caused by: java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.collection.immutable.Seq
From what I understand from the type of your UDF, you are trying to create a UDF that takes two arrays as inputs and returns a string.
In java, that's a bit painful but manageable.
Let's say that you want to join both arrays and link them with the word AND. You could define the UDF as follows:
UDF2 my_udf2 = new UDF2<WrappedArray<String>, WrappedArray<String>, String>() {
public String call(WrappedArray<String> a1, WrappedArray a2) throws Exception {
ArrayList<String> l1 = new ArrayList(JavaConverters
.asJavaCollectionConverter(a1)
.asJavaCollection());
ArrayList<String> l2 = new ArrayList(JavaConverters
.asJavaCollectionConverter(a2)
.asJavaCollection());
return l1.stream().collect(Collectors.joining(",")) +
" AND " +
l2.stream().collect(Collectors.joining(","));
}
};
Note that you need to use scala WrappedArray in the signature in the method and transform them in the body of the method with JavaConverters to be able to manipulate them in Java. Here are the required import just in case.
import scala.collection.mutable.WrappedArray;
import scala.collection.JavaConverters;
Then you can register the UDF can use it with Spark. To be able to use it, I created a sample dataframe and two dummy arrays from the 'id' column. Note that it can also work with the lit function as you were trying to do in your question.
spark.udf().register("my_udf2", my_udf2, DataTypes.StringType);
String[] data = {"abcd", "efgh", "ijkl"};
spark.range(3)
.withColumn("id", col("id").cast("string"))
.withColumn("array", functions.array(col("id"), col("id")))
.withColumn("string_of_arrays",
functions.callUDF("my_udf2", col("array"), lit(data)))
.show(false);
which yields:
+---+------+----------------------+
|id |array |string_of_arrays |
+---+------+----------------------+
|0 |[0, 0]|0,0 AND abcd,efgh,ijkl|
|1 |[1, 1]|1,1 AND abcd,efgh,ijkl|
|2 |[2, 2]|2,2 AND abcd,efgh,ijkl|
+---+------+----------------------+
In Spark >= 2.3, you could also do it like this:
UserDefinedFunction my_udf2 = udf(
(WrappedArray<String> s1, WrappedArray<String> s2) -> "some_string",
DataTypes.StringType
);
df.select(my_udf2.apply(col("a1"), col("a2")).show(false);

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'named_struct()' due to data type mismatch:

I ran the spark application where i joined two datsets and formed one dataset,and using an Encoder I converted Dataset<Row> into Dataset<T> format.
Encoder looks as follows:
Encoder<RuleParamsBean> encoder = Encoders.bean(RuleParamsBean.class);
Dataset<RuleParamsBean> ds = new Dataset<RuleParamsBean>(sparkSession, finalJoined.logicalPlan(), encoder);
Dataset<RuleParamsBean> validateDataset = ds.map(rulesParamBean -> validateTransaction(rulesParamBean),encoder);
validateDataset.show();
And after the map operation over the dataset i am getting the error as follows:
Dataset<RuleParamsBean> ds = new Dataset<RuleParamsBean>(sparkSession, finalJoined.logicalPlan(), encoder);
Error Log
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'named_struct()' due to data type mismatch: input to function named_struct requires at least one argument;;
Relation[TXN_DETAIL_ID#0,TXN_HEADER_ID#1,TXN_SOURCE_CD#2,TXN_REC_TYPE_CD#3,TXN_DTTM#4,EXT_TXN_NBR#5,CUST_REF_NBR#6,CIS_DIVISION#7,ACCT_ID#8,TXN_VOL#9,TXN_AMT#10,CURRENCY_CD#11,MANUAL_SW#12,USER_ID#13,HOW_TO_USE_TXN_FLG#14,MESSAGE_CAT_NBR#15,MESSAGE_NBR#16,UDF_CHAR_1#17,UDF_CHAR_2#18,UDF_CHAR_3#19,UDF_CHAR_4#20,UDF_CHAR_5#21,UDF_CHAR_6#22,UDF_CHAR_7#23,... 102 more fields] JDBCRelation(CI_TXN_DETAIL_STG_DUMMY) [numPartitions=1]
Relation[ACCT_ID#377,ACCT_NBR_TYPE_CD#378,ACCT_NBR#379,VERSION#380,PRIM_SW#381] JDBCRelation(CI_ACCT_NBR_DUMMY) [numPartitions=1]
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:120)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:120)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:125)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:125)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:95)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:172)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:178)
at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)
at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300)
at org.apache.spark.sql.Dataset.map(Dataset.scala:2569)
at com.sample.Transformation.main(Transformation.java:100)
For me the issue was because of unsupported type. I was using LocalDate which was not supported in Spark 2.X I guess.(I think they included support for it in version 3)
I simply changed it from LocalDate to TimeStamp and it worked. Look if this is the case for you as well? Any type which is in your POJO which is not supported?

Java Spark : Spark Bug Workaround for Datasets Joining with unknow Join Column Names

I am using Spark 2.3.1 with Java.
I have encountered what (I think), is this known bug of Spark.
Here is my code :
public Dataset<Row> compute(Dataset<Row> df1, Dataset<Row> df2, List<String> columns){
Seq<String> columns_seq = JavaConverters.asScalaIteratorConverter(columns.iterator()).asScala().toSeq();
final Dataset<Row> join = df1.join(df2, columns_seq);
join.show()
join.withColumn("newColumn", abs(col("value1").minus(col("value2")))).show();
return join;
}
I call my code like this :
Dataset<Row> myNewDF = compute(MyDataset1, MyDataset2, Arrays.asList("field1","field2","field3","field4"));
Note : MyDataset1 and MyDataset2 are two datasets that come from the same Dataset MyDataset0 with multiple different transformations.
On the join.show() line, I get the following error :
2018-08-03 18:48:43 - ERROR main Logging$class - - - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 235, Column 21: Expression "project_isNull_2" is not an rvalue
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 235, Column 21: Expression "project_isNull_2" is not an rvalue
at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
at org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7170)
at org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5332)
at org.codehaus.janino.UnitCompiler.access$9400(UnitCompiler.java:212)
at org.codehaus.janino.UnitCompiler$13$1.visitAmbiguousName(UnitCompiler.java:5287)
at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4053)
...
2018-08-03 18:48:47 - WARN main Logging$class - - - Whole-stage codegen disabled for plan (id=7):
But it does not stop the execution and still displays the content of the dataset.
Then, on the line join.withColumn("newColumn", abs(col("value1").minus(col("value2")))).show();
I get the error :
Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s) 'value2,'value1 missing from field6#16,field7#3,field8#108,field5#0,field9#4,field10#28,field11#323,value1#298,field12#131,day#52,field3#119,value2#22,field2#35,field1#43,field4#144 in operator 'Project [field1#43, field2#35, field3#119, field4#144, field5#0, field6#16, value2#22, field7#3, field9#4, field10#28, day#52, field8#108, field12#131, value1#298, field11#323, abs(('value1 - 'value2)) AS newColumn#2579]. Attribute(s) with the same name appear in the operation: value2,value1. Please check if the right attribute(s) are used.;;
'Project [field1#43, field2#35, field3#119, field4#144, field5#0, field6#16, value2#22, field7#3, field9#4, field10#28, day#52, field8#108, field12#131, value1#298, field11#323, abs(('value1 - 'value2)) AS newColumn#2579]
+- AnalysisBarrier
...
This error end the program.
The workaround proposed Mijung Kim on the Jira Issue is to create a Dataset clone thanks to toDF(Columns). But in my case, where the column names used for the join are not known in advance (I only have a List), I can't use this workaround.
Is there another way to get around this very annoying bug ?
Try to call this method:
private static Dataset<Row> cloneDataset(Dataset<Row> ds) {
List<Column> filterColumns = new ArrayList<>();
List<String> filterColumnsNames = new ArrayList<>();
scala.collection.Iterator<StructField> it = ds.exprEnc().schema().toIterator();
while (it.hasNext()) {
String columnName = it.next().name();
filterColumns.add(ds.col(columnName));
filterColumnsNames.add(columnName);
}
ds = ds.select(JavaConversions.asScalaBuffer(filterColumns).seq()).toDF(scala.collection.JavaConverters.asScalaIteratorConverter(filterColumnsNames.iterator()).asScala().toSeq());
return ds;
}
on both datasets just before the join like this :
df1 = cloneDataset(df1);
df2 = cloneDataset(df2);
final Dataset<Row> join = df1.join(df2, columns_seq);
// or ( based on Nakeuh comment )
final Dataset<Row> join = cloneDataset(df1.join(df2, columns_seq));

java lang Class Cast Exception

I have written a code which can reduce the grammatical boundaries for a text, but when I run the program this exception comes up
java.lang.ClassCastException
here is the class that i run,
public class paerser {
public static void main (String [] arg){
LexicalizedParser lp = new LexicalizedParser("grammar/englishPCFG.ser.gz");
lp.setOptionFlags("-maxLength", "500", "-retainTmpSubcategories");
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
String text = "John, who was the CEO of a company, played golf.";
edu.stanford.nlp.trees.Tree parse = lp.apply(Arrays.asList(text));
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();
System.out.println(tdl);
}
}
Updated,
here is the full stack trace ...
Loading parser from serialized file grammar/englishPCFG.ser.gz ... done [1.5 sec].
Following exception caught during parsing:
java.lang.ClassCastException: java.lang.String cannot be cast to edu.stanford.nlp.ling.HasWord
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.parse(ExhaustivePCFGParser.java:346)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.parse(LexicalizedParser.java:386)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.apply(LexicalizedParser.java:304)
at paerser.main(paerser.java:19)
Recovering using fall through strategy: will construct an (X ...) tree.
Exception in thread "main" java.lang.ClassCastException: java.lang.String cannot be cast to edu.stanford.nlp.ling.HasWord
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.apply(LexicalizedParser.java:317)
at paerser.main(paerser.java:19)
Stacktrace shows that ExhaustivePCFGParser's parse method is being used. It expects a List of HasWord objects. You are passing a list of String. Hence, the exception.
public boolean parse(List<? extends HasWord> sentence) { // ExhaustivePCFGParser

Categories

Resources