I am building a project modeled on this project. The key difference is, I want to output, conditionally, a message using the messages from the joined topics. As opposed to the example project, where an aggregation is performed. I am struggling to use Serde for JSON messages and so, I have simplified the message structure as follows.
t1 (KStream) - a plain text value.
t2 (KTable) - a plain text value separated by a ;.
t3 (KStream) - a CSV string.
I am publishing messages using kafkacat with the -k option to set a key e.g. k1. The problem I am facing is: I don't see any output in t3.
This is my TopologyProducer.java.
#Produces
public Topology buildTopology() {
StreamsBuilder builder = new StreamsBuilder();
ObjectMapperSerde<stream1> stream1 = new ObjectMapperSerde<>(stream1.class);
ObjectMapperSerde<topic1> topic1 = new ObjectMapperSerde<>(topic1.class);
ObjectMapperSerde<output1> output1 = new ObjectMapperSerde<>(output1.class);
GlobalKTable<String, topic1> topic1 = builder.globalTable(
t2,
Consumed.with(Serdes.String(), topic1));
builder.stream(t1,
Consumed.with(Serdes.String(), stream1))
.join(t2,
(paramName, paramValue) -> paramName,
(paramValue, paramLimits) -> {
// Add some logic to return conditionally
return new output1("paramName", 0.0, 0.0, true);
})
.to(t3,
Produced.with(Serdes.String(), output1));
return builder.build();
}
}
The Java version I had in my Dockerfile was wrong.
When I inspected the container logs, I saw an error about the difference in version of Java used to compile (higher) vs running (lower). I chose the simpler of two i.e. used a more recent version of Java to run the application (than, adjusting the Java version for local mvn build). The error can be traced to the following instruction as documented here.
The Dockerfile created by Quarkus by default needs one adjustment for the aggregator application in order to run the Kafka Streams pipeline. To do so, edit the file aggregator/src/main/docker/Dockerfile.jvm and replace the line FROM fabric8/java-alpine-openjdk8-jre with FROM fabric8/java-centos-openjdk8-jdk.
I edited my Dockerfile to use FROM registry.access.redhat.com/ubi8/openjdk-17:1.11 and have the application running.
Related
currently I am running into an issue but do not understand why this is happning. I have implemented a Java function which uses the Databricks Autoloader to readstream all parquet files from an azure blob storage and "write" it in a dataframe (Dataset because it is in Java written). The code is executed from an Jar which I build in Java and running as a Job on a Shared Cluster.
Code:
Dataset<Row> newdata= spark.readStream().format("cloudFiles")
.option("cloudFiles.subscriptionId", storagesubscriptionid)
.option("cloudFiles.format", "parquet")
.option("cloudFiles.tenantId", sptenantid)
.option("cloudFiles.clientId", spappid)
.option("cloudFiles.clientSecret", spsecret)
.option("cloudFiles.resourceGroup", storageresourcegroup)
.option("cloudFiles.connectionString", storagesasconnectionstring)
// .option("cloudFiles.useNotifications", "true")
.schema(dfsample.schema()).option("cloudFiles.includeExistingFiles", "true").load(filePath);
newdata.show();
But unfortunatelly I get the following exception:
WARN SQLExecution: Error executing delta metering
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();
cloudFiles
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.throwError(UnsupportedOperationChecker.scala:447)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.$anonfun$checkForBatch$1(UnsupportedOperationChecker.scala:38)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.$anonfun$checkForBatch$1$adapted(UnsupportedOperationChecker.scala:36)
What makes me wonder is, that the exactly same code is running fine inside a Databricks Notebook written in Scala:
val df1 = spark.readStream.format("cloudFiles").option("cloudFiles.useNotifications", "true").option("cloudFiles.subscriptionId", storagesubscriptionid)
.option("cloudFiles.format", "parquet")
.option("cloudFiles.tenantId", sptenantid)
.option("cloudFiles.clientId", spappid)
.option("cloudFiles.clientSecret", spsecret)
.option("cloudFiles.resourceGroup", storageresourcegroup)
.option("cloudFiles.connectionString", storagesasconnectionstring)
.option("cloudFiles.useNotifications", "true")
.option("cloudFiles.subscriptionId", storagesubscriptionid).schema(df_schema).option("cloudFiles.includeExistingFiles", "false").load(filePath);
display(df1);
I expect a Dataset object containing all the new data from the blobstorage parquet files in schema: id1:int, id2:int, content:binary
So finally, I have found a way to get Autoloader working inside my Java Jar.
As Vincent already commented you have to combine readstream with a writestream.
So I am simply writing the files which have been detected by the autoloader, to a Azure Data Lake.
spark.readStream().format("cloudFiles")
.option("cloudFiles.subscriptionId", STORAGE_SUBSCRIPTION_ID)
.option("cloudFiles.format", "parquet")
.option("cloudFiles.tenantId", SP_TENANT_ID)
.option("cloudFiles.clientId", SP_APPLICATION_ID)
.option("cloudFiles.clientSecret", SP_CLIENT_SECRET)
.option("cloudFiles.resourceGroup", STORAGE_RESOURCE_GROUP)
.option("cloudFiles.connectionString", STORAGE_SAS_CONNECTION_STRING)
.option("cloudFiles.includeExistingFiles", "true")
.option("cloudFiles.useNotifications", "true")
.schema(DF_SCHEMA)
.load(BLOB_STORAGE_LANDING_ZONE_PATH)
.writeStream()
.format("delta")
.option("checkpointLocation", DELTA_TABLE_RAW_DATA_CHECKPOINT_PATH)
.option("mergeSchema", "true")
.trigger(Trigger.Once())
.outputMode("append")
.start(DELTA_TABLE_RAW_DATA_PATH).awaitTermination();
This works fine with Java when you need to run a Jar as Databricks Jobs.
But to be honest I am still wondering why, from inside a Notebook, I don't have to use writestream in scala language to receive new files from the autoloader.
I was trying to read data using MongoDb spark connector, and want to partition the dataset on a key, reading from mongoD standalone instance. I was looking at the doc of mongoDb spark, and it mentions of various partitioner classes. I was trying to use MongoSamplePartitioner class but it only reads on just 1 partition. MongoPaginateByCountPartitioner class as well partitions to a fixed 66 partitions. This is even when I am configuring "samplesPerPartition" and "numberOfPartitions" in both of these cases respectively. I need to use readConfig created via a map. My code:
SparkSession sparkSession = SparkSession.builder().appName("sampleRecords")
.config("spark.driver.host", "2g")
.config("spark.driver.host", "127.0.0.1")
.master("local[4]").getOrCreate();
Map<String, String> readOverrides = new HashMap<>();
readOverrides.put("uri", "mongodb://mongo-root:password#127.0.0.1:27017/importedDb.myNewCollection?authSource=admin");
readOverrides.put("numberOfPartitions", "16");
readOverrides.put("partitioner", "MongoPaginateByCountPartitioner");
ReadConfig readConfig = ReadConfig.create(readOverrides);
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkSession.sparkContext());
Dataset<Row> dataset = MongoSpark.load(jsc, readConfig).toDF();
System.out.println(dataset.count()); //24576
System.out.println(dataset.rdd().getNumPartitions()); //66
Using sample partitioner returns 1 partition everytime
Am I missing something here? Please help.
PS - I am reading 24576 records, mongoD version v4.0.10, mongo spark connector 2.3.1, java 8
Edit:
I got it to work, needed to give properties like so partitionerOptions.samplesPerPartition in the map. But I am still facing issue, partitionerOptions.samplesPerPartition : "1000", in MongoSamplePartitioner only returns 1 partition. Any suggestions?
Number of Partitions can be configured for MongoPaginateByCountPartitioner.
Supposing that we need to configure the target number of partitions to 16...
Please add partitionerOptions.numberOfPartitions -> 16 in the properties rather than only numberOfPartitions -> 16.
Am getting started with Spark, and ran into issue trying to implement the simple example for map function. The issue is with the definition of 'parallelize' in the new version of Spark. Can someone share example of how to use it, since the following way is giving error for insufficient arguments.
Spark Version : 2.3.2
Java : 1.8
SparkSession session = SparkSession.builder().appName("Compute Square of Numbers").config("spark.master","local").getOrCreate();
SparkContext context = session.sparkContext();
List<Integer> seqNumList = IntStream.rangeClosed(10, 20).boxed().collect(Collectors.toList());
JavaRDD<Integer> numRDD = context.parallelize(seqNumList, 2);
Compiletime Error Message : The method expects 3 arguments
I do not get what the 3rd argument should be like? As per the documentation, it's supposed to be
scala.reflect.ClassTag<T>
But how to even define or use it?
Please do not suggest using JavaSparkContext, as i wanted to know how to get this approach to work with using generic SparkContext.
Ref : https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/SparkContext.html#parallelize-scala.collection.Seq-int-scala.reflect.ClassTag-
Here is the code which worked for me finally. Not the best way to achieve the result, but was a way to explore the API for me
SparkSession session = SparkSession.builder().appName("Compute Square of Numbers")
.config("spark.master", "local").getOrCreate();
SparkContext context = session.sparkContext();
List<Integer> seqNumList = IntStream.rangeClosed(10, 20).boxed().collect(Collectors.toList());
RDD<Integer> numRDD = context
.parallelize(JavaConverters.asScalaIteratorConverter(seqNumList.iterator()).asScala()
.toSeq(), 2, scala.reflect.ClassTag$.MODULE$.apply(Integer.class));
numRDD.toJavaRDD().foreach(x -> System.out.println(x));
session.stop();
If you don't want to deal with providing the extra two parameters using sparkConext, you can also use JavaSparkContext.parallelize(), which only needs an input list:
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.rdd.RDD;
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
final RDD<Integer> rdd = jsc.parallelize(seqNumList).map(num -> {
// your implementation
}).rdd();
I'm quite new to Spark and was looking for some guidance :-)
What's the typical way in which a Java MVC application communicates with Spark? To simplify things, let's say I want to count the words in a certain file whose name is provided via GET request to my server.
My initial approach was to open the context and implement the transformations/ computations in a class inside my MVC application. That means that at runtime I would have to come up with an uber jar of spark-core. The problem is that:
The uber jar weights 80mb
I am facing the same problem (akka.version) than in: apache spark: akka version error by build jar with all dependencies
I can have a go with shade to solve it but have the feeling this is not the way to go.
Maybe the "provided" scope in Maven would help me but I'm using ant.
Should my application - as suggested in the page - have already one jar with the implementation (devoid of any spark libraries) and use the spark-submit every time I receive a request. I guess it would leave the results somewhere.
Am I missing any middle-of-the-road approach?
Using spark-submit each time is kind of heavy weight, I'd recommend using a long running Spark Context of some sort. I think the "middle of the road" option that you might be looking for is to have your job use something like the IBM Spark Kernel, Zepplin, or the Spark Job Server from Ooyala.
There is a good practice to use middleware service deployed on a top of Spark which manages it’s contexts, job failures spark vesions and a lot of other things to consider.
I would recommend Mist. It implements Spark as a Service and creates a unified API layer for building enterprise solutions and services on top of a Big Data lake.
Mist supports Scala and Python jobs execution.
The quick start is following:
Add Mist wrapper into your Spark job:
Scala example:
object SimpleContext extends MistJob {
override def doStuff(context: SparkContext, parameters: Map[String, Any]): Map[String, Any] = {
val numbers: List[BigInt] = parameters("digits").asInstanceOf[List[BigInt]]
val rdd = context.parallelize(numbers)
Map("result" -> rdd.map(x => x * 2).collect())
}
}
Python example:
import mist
class MyJob:
def __init__(self, job):
job.sendResult(self.doStuff(job))
def doStuff(self, job):
val = job.parameters.values()
list = val.head()
size = list.size()
pylist = []
count = 0
while count < size:
pylist.append(list.head())
count = count + 1
list = list.tail()
rdd = job.sc.parallelize(pylist)
result = rdd.map(lambda s: 2 * s).collect()
return result
if __name__ == "__main__":
job = MyJob(mist.Job())
Run Mist service:
Build the Mist
git clone https://github.com/hydrospheredata/mist.git
cd mist
./sbt/sbt -DsparkVersion=1.5.2 assembly # change version according to your installed spark
Create configuration file
mist.spark.master = "local[*]"
mist.settings.threadNumber = 16
mist.http.on = true
mist.http.host = "0.0.0.0"
mist.http.port = 2003
mist.mqtt.on = false
mist.recovery.on = false
mist.contextDefaults.timeout = 100 days
mist.contextDefaults.disposable = false
mist.contextDefaults.sparkConf = {
spark.default.parallelism = 128
spark.driver.memory = "10g"
spark.scheduler.mode = "FAIR"
}
Run
spark-submit --class io.hydrosphere.mist.Mist \
--driver-java-options "-Dconfig.file=/path/to/application.conf" \ target/scala-2.10/mist-assembly-0.2.0.jar
Try curl from terminal:
curl --header "Content-Type: application/json" -X POST http://192.168.10.33:2003/jobs --data '{"jarPath":"/vagrant/examples/target/scala-2.10/mist_examples_2.10-0.2.0.jar", "className":"SimpleContext$","parameters":{"digits":[1,2,3,4,5,6,7,8,9,0]}, "external_id":"12345678","name":"foo"}'
I'm trying to write a computation in Flink which requires two phases.
In the first phase I start from a text file, and perform some parameter estimation, obtaining as a result a Java object representing a statistical model of the data.
In the second phase, I'd like to use this object to generate data for a simulation.
I'm unsure how to do this. I tried with a LocalCollectionOutputFormat, and it works locally, but when I deploy the job on a cluster, I get a NullPointerException - which is not really surprising.
What is the Flink way of doing this?
Here is my code:
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
GlobalConfiguration.includeConfiguration(configuration);
// Phase 1: read file and estimate model
DataSource<Tuple4<String, String, String, String>> source = env
.readCsvFile(args[0])
.types(String.class, String.class, String.class, String.class);
List<Tuple4<Bayes, Bayes, Bayes, Bayes>> bayesResult = new ArrayList<>();
// Processing here...
....output(new LocalCollectionOutputFormat<>(bayesResult));
env.execute("Bayes");
DataSet<BTP> btp = env
.createInput(new BayesInputFormat(bayesResult.get(0)))
// Phase 2: BayesInputFormat generates data for further calculations
// ....
This is the exception I get:
Error: The program execution failed: java.lang.NullPointerException
at org.apache.flink.api.java.io.LocalCollectionOutputFormat.close(LocalCollectionOutputFormat.java:86)
at org.apache.flink.runtime.operators.DataSinkTask.invoke(DataSinkTask.java:176)
at org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:257)
at java.lang.Thread.run(Thread.java:745)
org.apache.flink.client.program.ProgramInvocationException: The program execution failed: java.lang.NullPointerException
at org.apache.flink.api.java.io.LocalCollectionOutputFormat.close(LocalCollectionOutputFormat.java:86)
at org.apache.flink.runtime.operators.DataSinkTask.invoke(DataSinkTask.java:176)
at org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:257)
at java.lang.Thread.run(Thread.java:745)
at org.apache.flink.client.program.Client.run(Client.java:328)
at org.apache.flink.client.program.Client.run(Client.java:294)
at org.apache.flink.client.program.Client.run(Client.java:288)
at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:55)
at it.list.flink.test.Test01.main(Test01.java:62)
...
With the latest release (0.9-milestone-1) a collect() method was added to Flink
public List<T> collect()
which fetches a DataSet<T> as List<T> to the driver program. collect() will also trigger an immediate execution of the program (don't need to call ExecutionEnvironment.execute()). Right now, there is size limitation for data sets of about 10 MB.
If you do not evaluate the models in the driver program, you can also chain both programs together and emit the model to the side by attaching a data sink. This will be more efficient, because the data won't do the round-trip over the client machine.
If you're using Flink prior to 0.9 you may use the following snippet to collect your dataset to a local collection:
val dataJavaList = new ArrayList[K]
val outputFormat = new LocalCollectionOutputFormat[K](dataJavaList)
dataset.output(outputFormat)
env.execute("collect()")
Where K is the type of object you want to collect