Reuse results of first computation in second computation

Reuse results of first computation in second computation - java

I'm trying to write a computation in Flink which requires two phases.
In the first phase I start from a text file, and perform some parameter estimation, obtaining as a result a Java object representing a statistical model of the data.
In the second phase, I'd like to use this object to generate data for a simulation.
I'm unsure how to do this. I tried with a LocalCollectionOutputFormat, and it works locally, but when I deploy the job on a cluster, I get a NullPointerException - which is not really surprising.
What is the Flink way of doing this?
Here is my code:
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
GlobalConfiguration.includeConfiguration(configuration);
// Phase 1: read file and estimate model
DataSource<Tuple4<String, String, String, String>> source = env
.readCsvFile(args[0])
.types(String.class, String.class, String.class, String.class);
List<Tuple4<Bayes, Bayes, Bayes, Bayes>> bayesResult = new ArrayList<>();
// Processing here...
....output(new LocalCollectionOutputFormat<>(bayesResult));
env.execute("Bayes");
DataSet<BTP> btp = env
.createInput(new BayesInputFormat(bayesResult.get(0)))
// Phase 2: BayesInputFormat generates data for further calculations
// ....
This is the exception I get:
Error: The program execution failed: java.lang.NullPointerException
at org.apache.flink.api.java.io.LocalCollectionOutputFormat.close(LocalCollectionOutputFormat.java:86)
at org.apache.flink.runtime.operators.DataSinkTask.invoke(DataSinkTask.java:176)
at org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:257)
at java.lang.Thread.run(Thread.java:745)
org.apache.flink.client.program.ProgramInvocationException: The program execution failed: java.lang.NullPointerException
at org.apache.flink.api.java.io.LocalCollectionOutputFormat.close(LocalCollectionOutputFormat.java:86)
at org.apache.flink.runtime.operators.DataSinkTask.invoke(DataSinkTask.java:176)
at org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:257)
at java.lang.Thread.run(Thread.java:745)
at org.apache.flink.client.program.Client.run(Client.java:328)
at org.apache.flink.client.program.Client.run(Client.java:294)
at org.apache.flink.client.program.Client.run(Client.java:288)
at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:55)
at it.list.flink.test.Test01.main(Test01.java:62)
...

With the latest release (0.9-milestone-1) a collect() method was added to Flink
public List<T> collect()
which fetches a DataSet<T> as List<T> to the driver program. collect() will also trigger an immediate execution of the program (don't need to call ExecutionEnvironment.execute()). Right now, there is size limitation for data sets of about 10 MB.
If you do not evaluate the models in the driver program, you can also chain both programs together and emit the model to the side by attaching a data sink. This will be more efficient, because the data won't do the round-trip over the client machine.

If you're using Flink prior to 0.9 you may use the following snippet to collect your dataset to a local collection:
val dataJavaList = new ArrayList[K]
val outputFormat = new LocalCollectionOutputFormat[K](dataJavaList)
dataset.output(outputFormat)
env.execute("collect()")
Where K is the type of object you want to collect

Related

Quarkus - KStream and KTable join does not output messages

I am building a project modeled on this project. The key difference is, I want to output, conditionally, a message using the messages from the joined topics. As opposed to the example project, where an aggregation is performed. I am struggling to use Serde for JSON messages and so, I have simplified the message structure as follows.
t1 (KStream) - a plain text value.
t2 (KTable) - a plain text value separated by a ;.
t3 (KStream) - a CSV string.
I am publishing messages using kafkacat with the -k option to set a key e.g. k1. The problem I am facing is: I don't see any output in t3.
This is my TopologyProducer.java.
#Produces
public Topology buildTopology() {
StreamsBuilder builder = new StreamsBuilder();
ObjectMapperSerde<stream1> stream1 = new ObjectMapperSerde<>(stream1.class);
ObjectMapperSerde<topic1> topic1 = new ObjectMapperSerde<>(topic1.class);
ObjectMapperSerde<output1> output1 = new ObjectMapperSerde<>(output1.class);
GlobalKTable<String, topic1> topic1 = builder.globalTable(
t2,
Consumed.with(Serdes.String(), topic1));
builder.stream(t1,
Consumed.with(Serdes.String(), stream1))
.join(t2,
(paramName, paramValue) -> paramName,
(paramValue, paramLimits) -> {
// Add some logic to return conditionally
return new output1("paramName", 0.0, 0.0, true);
})
.to(t3,
Produced.with(Serdes.String(), output1));
return builder.build();
}
}

The Java version I had in my Dockerfile was wrong.
When I inspected the container logs, I saw an error about the difference in version of Java used to compile (higher) vs running (lower). I chose the simpler of two i.e. used a more recent version of Java to run the application (than, adjusting the Java version for local mvn build). The error can be traced to the following instruction as documented here.
The Dockerfile created by Quarkus by default needs one adjustment for the aggregator application in order to run the Kafka Streams pipeline. To do so, edit the file aggregator/src/main/docker/Dockerfile.jvm and replace the line FROM fabric8/java-alpine-openjdk8-jre with FROM fabric8/java-centos-openjdk8-jdk.
I edited my Dockerfile to use FROM registry.access.redhat.com/ubi8/openjdk-17:1.11 and have the application running.

Unrecognized pipeline stage name: '$setOnInsert'

Using a standalone MongoDB instance in version 4.4.1 with a Java client that connects using the latest driver (org.mongodb:mongodb-driver-sync:4.1.1), I am getting an error when calling findOneAndUpdate with the $setOnInsert operator.
Here is the query used:
final List<Bson> updates = new ArrayList<>();
updates.add(Updates.set("data", "test"));
updates.add(Updates.setOnInsert("firstSeenTime", new Date()));
final Document updatedDocument =
this.visitorsCollection.findOneAndUpdate(
eq("userId", "u1"), updates, new FindOneAndUpdateOptions().returnDocument(ReturnDocument.AFTER).upsert(true));
The error:
Exception in thread "main" com.mongodb.MongoCommandException: Command
failed with error 40324 (Location40324): 'Unrecognized pipeline stage
name: '$setOnInsert'' on server A.B.C.D:XXXXX. The full
response is {"ok": 0.0, "errmsg": "Unrecognized pipeline stage name:
'$setOnInsert'", "code": 40324, "codeName": "Location40324"} at
com.mongodb.internal.connection.ProtocolHelper.getCommandFailureException(ProtocolHelper.java:175)
at
com.mongodb.internal.connection.InternalStreamConnection.receiveCommandMessageResponse(InternalStreamConnection.java:359)
at
com.mongodb.internal.connection.InternalStreamConnection.sendAndReceive(InternalStreamConnection.java:280)
at
com.mongodb.internal.connection.UsageTrackingInternalConnection.sendAndReceive(UsageTrackingInternalConnection.java:100)
at
com.mongodb.internal.connection.DefaultConnectionPool$PooledConnection.sendAndReceive(DefaultConnectionPool.java:490)
at
com.mongodb.internal.connection.CommandProtocolImpl.execute(CommandProtocolImpl.java:71)
at
com.mongodb.internal.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:255)
at
com.mongodb.internal.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:202)
at
com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:118)
at
com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:110)
at
com.mongodb.internal.operation.CommandOperationHelper$13.call(CommandOperationHelper.java:712)
at
com.mongodb.internal.operation.OperationHelper.withReleasableConnection(OperationHelper.java:620)
at
com.mongodb.internal.operation.CommandOperationHelper.executeRetryableCommand(CommandOperationHelper.java:705)
at
com.mongodb.internal.operation.CommandOperationHelper.executeRetryableCommand(CommandOperationHelper.java:697)
at
com.mongodb.internal.operation.BaseFindAndModifyOperation.execute(BaseFindAndModifyOperation.java:69)
at
com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:195)
at
com.mongodb.client.internal.MongoCollectionImpl.executeFindOneAndUpdate(MongoCollectionImpl.java:785)
at
com.mongodb.client.internal.MongoCollectionImpl.findOneAndUpdate(MongoCollectionImpl.java:765)
If I get rid of the Updates.setOnInsert(...) call, then the update works but not as I would like. My purpose is to set some fields based on whether the document to update exists or not. Looking at the documentation, $setOnInsert should be supported:
https://docs.mongodb.com/manual/reference/operator/update/#id1
Any idea about what is wrong?

The problem here is there are 2 forms of findOneAndUpdate. The second argument can be either:
a document containing update operator expressions
an array containing $set, $unset, and $replaceRoot aggregation stages
Since you are creating updates as an ArrayList, findOneAndUpdate is trying to process it as an aggregation pipeline, which does not recognize a $setOneInsert stage.
You need to build updates as a Document for the update operators to be recognized. Following your example, you can simply wrap the list with Updates.combine(updates) and pass it to findOneAndUpdate as the second parameter.

How to set AvroCoder with KafkaIO and Apache Beam with Java

I'm trying to create a pipeline that streams data from a Kafka topic to google's Bigquery. Data in the topic is in Avro.
I call the apply function 3 times. Once to read from Kafka, once to extract record and once to write to Bigquery. Here is the main part of the code:
pipeline
.apply("Read from Kafka",
KafkaIO
.<byte[], GenericRecord>read()
.withBootstrapServers(options.getKafkaBrokers().get())
.withTopics(Utils.getListFromString(options.getKafkaTopics()))
.withKeyDeserializer(
ConfluentSchemaRegistryDeserializerProvider.of(
options.getSchemaRegistryUrl().get(),
options.getSubject().get())
)
.withValueDeserializer(
ConfluentSchemaRegistryDeserializerProvider.of(
options.getSchemaRegistryUrl().get(),
options.getSubject().get()))
.withoutMetadata()
)
.apply("Extract GenericRecord",
MapElements.into(TypeDescriptor.of(GenericRecord.class)).via(KV::getValue)
)
.apply(
"Write data to BQ",
BigQueryIO
.<GenericRecord>write()
.optimizedWrites()
.useBeamSchema()
.useAvroLogicalTypes()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withSchemaUpdateOptions(ImmutableSet.of(BigQueryIO.Write.SchemaUpdateOption.ALLOW_FIELD_ADDITION))
//Temporary location to save files in GCS before loading to BQ
.withCustomGcsTempLocation(options.getGcsTempLocation())
.withNumFileShards(options.getNumShards().get())
.withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
.withMethod(FILE_LOADS)
.withTriggeringFrequency(Utils.parseDuration(options.getWindowDuration().get()))
.to(new TableReference()
.setProjectId(options.getGcpProjectId().get())
.setDatasetId(options.getGcpDatasetId().get())
.setTableId(options.getGcpTableId().get()))
);
When running, i get the following error:
Exception in thread "main" java.lang.IllegalStateException: Unable to return a default Coder for Extract GenericRecord/Map/ParMultiDo(Anonymous).output [PCollection]. Correct one of the following root causes: No Coder has been manually specified; you may do so using .setCoder().
Inferring a Coder from the CoderRegistry failed: Unable to provide a Coder for org.apache.avro.generic.GenericRecord.
Building a Coder using a registered CoderProvider failed.
How do I set the coder to properly read Avro?

There are at least three approaches to this:
Set the coder inline:
pipeline.apply("Read from Kafka", ....)
.apply("Dropping key", Values.create())
.setCoder(AvroCoder.of(Schema schemaOfGenericRecord))
.apply("Write data to BQ", ....);
Note that the key is dropped because its unused, with this you wont need MapElements any more.
Register the coder in the pipeline's instance of CoderRegistry:
pipeline.getCoderRegistry().registerCoderForClass(GenericRecord.class, AvroCoder.of(Schema genericSchema));
Get the coder from the schema registry via:
ConfluentSchemaRegistryDeserializerProvider.getCoder(CoderRegistry registry)
https://beam.apache.org/releases/javadoc/2.22.0/org/apache/beam/sdk/io/kafka/ConfluentSchemaRegistryDeserializerProvider.html#getCoder-org.apache.beam.sdk.coders.CoderRegistry-

SparkContext parallelize invocation example in java

Am getting started with Spark, and ran into issue trying to implement the simple example for map function. The issue is with the definition of 'parallelize' in the new version of Spark. Can someone share example of how to use it, since the following way is giving error for insufficient arguments.
Spark Version : 2.3.2
Java : 1.8
SparkSession session = SparkSession.builder().appName("Compute Square of Numbers").config("spark.master","local").getOrCreate();
SparkContext context = session.sparkContext();
List<Integer> seqNumList = IntStream.rangeClosed(10, 20).boxed().collect(Collectors.toList());
JavaRDD<Integer> numRDD = context.parallelize(seqNumList, 2);
Compiletime Error Message : The method expects 3 arguments
I do not get what the 3rd argument should be like? As per the documentation, it's supposed to be
scala.reflect.ClassTag<T>
But how to even define or use it?
Please do not suggest using JavaSparkContext, as i wanted to know how to get this approach to work with using generic SparkContext.
Ref : https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/SparkContext.html#parallelize-scala.collection.Seq-int-scala.reflect.ClassTag-

Here is the code which worked for me finally. Not the best way to achieve the result, but was a way to explore the API for me
SparkSession session = SparkSession.builder().appName("Compute Square of Numbers")
.config("spark.master", "local").getOrCreate();
SparkContext context = session.sparkContext();
List<Integer> seqNumList = IntStream.rangeClosed(10, 20).boxed().collect(Collectors.toList());
RDD<Integer> numRDD = context
.parallelize(JavaConverters.asScalaIteratorConverter(seqNumList.iterator()).asScala()
.toSeq(), 2, scala.reflect.ClassTag$.MODULE$.apply(Integer.class));
numRDD.toJavaRDD().foreach(x -> System.out.println(x));
session.stop();

If you don't want to deal with providing the extra two parameters using sparkConext, you can also use JavaSparkContext.parallelize(), which only needs an input list:
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.rdd.RDD;
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
final RDD<Integer> rdd = jsc.parallelize(seqNumList).map(num -> {
// your implementation
}).rdd();

Compress map output result exception in hadoop program

In Hadoop program, I tried to compress the map result, I wrote the following code:
conf.setBoolean("mapred.compress.map.output",true);
conf.setClass("mapred.map.output.compression.codec",GzipCodec.class,CompressionCodec.class);
and run it, I got the below exception, anybody know the reason?
WARN mapred.LocalJobRunner: job_local1149103367_0001
java.io.IOException: not a gzip file
at org.apache.hadoop.io.compress.zlib.BuiltInGzipDecompressor.processBasicHeader(BuiltInGzipDecompressor.java:495)
at org.apache.hadoop.io.compress.zlib.BuiltInGzipDecompressor.executeHeaderState(BuiltInGzipDecompressor.java:256)
at org.apache.hadoop.io.compress.zlib.BuiltInGzipDecompressor.decompress(BuiltInGzipDecompressor.java:185)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:91)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:72)
at java.io.DataInputStream.readByte(DataInputStream.java:265)
at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
at org.apache.hadoop.mapred.IFile$Reader.positionToNextRecord(IFile.java:400)
at org.apache.hadoop.mapred.IFile$Reader.nextRawKey(IFile.java:425)
at org.apache.hadoop.mapred.Merger$Segment.nextRawKey(Merger.java:323)
at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:613)
at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:558)
at org.apache.hadoop.mapred.Merger.merge(Merger.java:70)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:385)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:445)
today, I tested it again, I found that if the put the 2 lines before the job object was created,
Job job = new Job(conf, "MyCounter");
the error will happen, if after that, no error will occur, why this happen?

are you using MRv1 or MRv2. If you are using MRv2 then use the following job config.
config.setBoolean("mapreduce.output.fileoutputformat.compress", true);
config.setClass("mapreduce.output.fileoutputformat.compress.codec",GzipCodec.class,CompressionCodec.class);
additionally you can set
config.set("mapreduce.output.fileoutputformat.compress.type",CompressionType.NONE.toString());
BLOCK|NONE|RECORD are three types of compression.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reuse results of first computation in second computation - java

Related

Quarkus - KStream and KTable join does not output messages

Unrecognized pipeline stage name: '$setOnInsert'

How to set AvroCoder with KafkaIO and Apache Beam with Java

SparkContext parallelize invocation example in java

Compress map output result exception in hadoop program

Categories

Resources