Apache flink Wikipedia edit analytics with Scala

Apache flink Wikipedia edit analytics with Scala - java

I'm trying to rewrite the wikipedia edit stream analytics in Apache Flink tutorials to Scala from https://ci.apache.org/projects/flink/flink-docs-release-1.2/quickstart/run_example_quickstart.html
The code from the tutorial is
import org.apache.flink.api.common.functions.FoldFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.connectors.wikiedits.WikipediaEditEvent;
import org.apache.flink.streaming.connectors.wikiedits.WikipediaEditsSource;
public class WikipediaAnalysis {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<WikipediaEditEvent> edits = see.addSource(new WikipediaEditsSource());
KeyedStream<WikipediaEditEvent, String> keyedEdits = edits
.keyBy(new KeySelector<WikipediaEditEvent, String>() {
#Override
public String getKey(WikipediaEditEvent event) {
return event.getUser();
}
});
DataStream<Tuple2<String, Long>> result = keyedEdits
.timeWindow(Time.seconds(5))
.fold(new Tuple2<>("", 0L), new FoldFunction<WikipediaEditEvent, Tuple2<String, Long>>() {
#Override
public Tuple2<String, Long> fold(Tuple2<String, Long> acc, WikipediaEditEvent event) {
acc.f0 = event.getUser();
acc.f1 += event.getByteDiff();
return acc;
}
});
result.print();
see.execute();
}
}
below is my attempt in scala
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.connectors.wikiedits.{WikipediaEditEvent, WikipediaEditsSource}
import org.apache.flink.streaming.api.windowing.time.Time
object WikipediaAnalytics extends App{
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val edits = env.addSource(new WikipediaEditsSource());
val keyedEdits = edits.keyBy(event => event.getUser)
val result = keyedEdits.timeWindow(Time.seconds(5)).fold(("", 0L), (we: WikipediaEditEvent, t: (String, Long)) =>
(we.getUser, t._2 + we.getByteDiff))
}
which is more or less a word to word conversion to scala, based on which the type of the val result should be DataStream[(String, Long)] but the actual type inferred after fold() is no where close.
Please help identify what is wrong with the scala code
EDIT1: made the below changes, using the currying schematic of fold[R] and the type now confirms to the expected type, but still couldn't get hold of the reason though
val result_1: (((String, Long), WikipediaEditEvent) => (String, Long)) => DataStream[(String, Long)] =
keyedEdits.timeWindow(Time.seconds(5)).fold(("", 0L))
val result_2: DataStream[(String, Long)] = result_1((t: (String, Long), we: WikipediaEditEvent ) =>
(we.getUser, t._2 + we.getByteDiff))

The problem seems to be with the fold, you have to have a closing bracket after your accumulator initialValue. When you fix that, the code will fail to compile because it doesn't have TypeInformation available for WikipediaEditEvent. The easiest way to resolve that is to import more of flink scala API. See below for a full example:
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.wikiedits.WikipediaEditsSource
import org.apache.flink.streaming.api.windowing.time.Time
object WikipediaAnalytics extends App {
val see: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val edits = see.addSource(new WikipediaEditsSource())
val userEditsVolume: DataStream[(String, Int)] = edits
.keyBy(_.getUser)
.timeWindow(Time.seconds(5))
.fold(("", 0))((acc, event) => (event.getUser, acc._2 + event.getByteDiff))
userEditsVolume.print()
see.execute("Wikipedia User Edit Volume")
}

Related

Azure spring boot function - Exception: UnsupportedOperationException: At the moment only Tuple-based function are supporting multiple arguments

getting below error while using tuple based input output arguments on azure spring boot function..
[2021-08-19T10:22:51.771Z] Caused by: java.lang.UnsupportedOperationException: At the moment only Tuple-based function are supporting multiple arguments
[2021-08-19T10:22:51.776Z] at org.springframework.cloud.function.context.catalog.SimpleFunctionRegistry$FunctionInvocationWrapper.parseMultipleValueArguments(SimpleFunctionRegistry.java:879)
[2021-08-19T10:22:51.778Z] at org.springframework.cloud.function.context.catalog.SimpleFunctionRegistry$FunctionInvocationWrapper.convertMultipleOutputArgumentTypeIfNecesary(SimpleFunctionRegistry.java:1180)
[2021-08-19T10:22:51.780Z] at org.springframework.cloud.function.context.catalog.SimpleFunctionRegistry$FunctionInvocationWrapper.convertOutputIfNecessary(SimpleFunctionRegistry.java:1012)
[2021-08-19T10:22:51.781Z] at org.springframework.cloud.function.context.catalog.SimpleFunctionRegistry$FunctionInvocationWrapper.apply(SimpleFunctionRegistry.java:492)
[2021-08-19T10:22:51.784Z] at org.springframework.cloud.function.adapter.azure.FunctionInvoker.handleRequest(FunctionInvoker.java:122)
[2021-08-19T10:22:51.785Z] at com.att.trace.function.HelloHandler.execute(HelloHandler.java:49)
[2021-08-19T10:22:51.786Z] ... 16 more
Sourcecode:
package com.att.trace.function;
import java.util.Optional;
import com.att.trace.function.model.User;
import com.microsoft.azure.functions.ExecutionContext;
import com.microsoft.azure.functions.HttpMethod;
import com.microsoft.azure.functions.HttpRequestMessage;
import com.microsoft.azure.functions.HttpResponseMessage;
import com.microsoft.azure.functions.HttpStatus;
import com.microsoft.azure.functions.annotation.AuthorizationLevel;
import com.microsoft.azure.functions.annotation.FunctionName;
import com.microsoft.azure.functions.annotation.HttpTrigger;
import org.springframework.cloud.function.adapter.azure.FunctionInvoker;
import reactor.util.function.Tuple2;
import reactor.util.function.Tuple3;
import reactor.util.function.Tuples;
public class HelloHandler extends FunctionInvoker<Tuple2<String, Integer>, Tuple3<String, Boolean, Integer>> {
#FunctionName("hello")
public HttpResponseMessage execute(#HttpTrigger(name = "request", methods = { HttpMethod.GET,
HttpMethod.POST }, authLevel = AuthorizationLevel.ANONYMOUS) HttpRequestMessage<Optional<User>> request,
ExecutionContext context) {
User user = request.getBody().filter((u -> u.getName() != null))
.orElseGet(() -> new User(request.getQueryParameters().getOrDefault("name", "world")));
context.getLogger().info("Greeting user name: " + user.getName());
Tuple2<String, Integer> base = Tuples.of("one", 1);
Tuple3<String, Boolean, Integer> push_to = handleRequest(base, context);
System.out.println("push_to=" + push_to);
return request.createResponseBuilder(HttpStatus.OK).body("ok")
.header("Content-Type", "application/json").build();
}
}
package com.att.trace.function;
import java.util.function.Function;
import org.springframework.stereotype.Component;
import reactor.util.function.Tuple2;
import reactor.util.function.Tuple3;
import reactor.util.function.Tuples;
#Component
public class Hello implements Function<Tuple2<String, Integer>, Tuple3<String, Boolean, Integer>>{
private static Logger log = LoggerFactory.getLogger(Hello.class);
public Tuple3<String, Boolean, Integer> apply(Tuple2<String, Integer> objects) {
System.out.println("objects.getT1()=" + objects.getT1());
System.out.println("objects.getT2()=" + objects.getT2());
Tuple3<String, Boolean, Integer> output = Tuples.of("one",false,1);
return output;
// return mono.map(user -> new Greeting("Hello, " + user.getName() + "!\n"));
}
}
fyi using below spring boot starter. looks like already using latest version
https://learn.microsoft.com/en-us/azure/developer/java/spring-framework/getting-started-with-spring-cloud-function-in-azure

You are probably using Tuples from wrong namespace
public class HelloHandler extends FunctionInvoker<Tuple2<String, Integer>, Tuple3<String, Boolean, Integer>> {
Make sure that you use reactor.util.function.Tuple
The other thing not making sense to me is FunctionInvoker<Tuple2<String, Integer>, Tuple3<String, Boolean, Integer>> that in function invoker you saying that you expect Tuple2 but in your execution method you map User
execute(#HttpTrigger(name = "request", methods = { HttpMethod.GET,
HttpMethod.POST }, authLevel = AuthorizationLevel.ANONYMOUS) HttpRequestMessage<Optional<User>> request
They should match

Kinesis firehose data transformation using Java

When using Java Lambda function to do a kinesis data firehose transformation , getting the below error. The below is my transformed JSON look like
{
"records": [
{
"recordId": "49586022990098427206724983301551059982279766660054253570000000",
"result": "Ok",
"data": "ZXlKMGFXTnJaWEpmYzNsdFltOXNJam9pVkVWVFZEY2lMQ0FpYzJWamRHOXlJam9pU0VWQlRGUklRMEZTUlNJc0lDSmphR0Z1WjJVaQ0KT2kwd0xqQTFMQ0FpY0hKcFkyVWlPamcwTGpVeGZRbz0="
}
]
}
error in the kinesis console is like
Invalid output structure: Please check your function and make sure the processed records contain valid result status of Dropped, Ok, or ProcessingFailed
Anyone have an idea on this , i could not find an example code using Java on the kinesis data transformation
https://docs.aws.amazon.com/firehose/latest/dev/data-transformation.html
This document says about the output structure

I just got done struggling through this in scala (java compatible). The key is to use the return type: com.amazonaws.services.lambda.runtime.events.KinesisAnalyticsInputPreprocessingResponse
import java.nio.ByteBuffer
import com.amazonaws.services.lambda.runtime.events.KinesisAnalyticsInputPreprocessingResponse._
import com.amazonaws.services.lambda.runtime.events.{KinesisAnalyticsInputPreprocessingResponse, KinesisFirehoseEvent}
import com.amazonaws.services.lambda.runtime.{Context, LambdaLogger, RequestHandler}
import scala.collection.JavaConversions._
import scala.language.implicitConversions
class Handler extends RequestHandler[KinesisFirehoseEvent, KinesisAnalyticsInputPreprocessingResponse] {
override def handleRequest(in: KinesisFirehoseEvent, context: Context): KinesisAnalyticsInputPreprocessingResponse = {
val logger: LambdaLogger = context.getLogger
val records = in.getRecords
val tranformed = records.flatMap(record => {
try {
val changed = record.getData.array()
//do some sort of transform
val rec = new Record(record.getRecordId, Result.Ok, ByteBuffer.wrap(changed))
Some(rec)
} catch {
case e: Exception => {
logger.log(e.toString)
Some(new Record(record.getRecordId, Result.Dropped, record.getData))
}
}
})
val response = new KinesisAnalyticsInputPreprocessingResponse()
response.setRecords(tranformed.toList)
response
}
}

A java example:
import com.amazonaws.services.lambda.runtime.Context;
import com.amazonaws.services.lambda.runtime.RequestHandler;
import com.amazonaws.services.lambda.runtime.events.KinesisAnalyticsInputPreprocessingResponse;
import com.amazonaws.services.lambda.runtime.events.KinesisFirehoseEvent;
import com.fasterxml.jackson.databind.ObjectMapper;
import lombok.RequiredArgsConstructor;
import lombok.extern.log4j.Log4j2;
import reactor.core.publisher.Flux;
import reactor.core.publisher.Mono;
#Log4j2
#RequiredArgsConstructor
public class FirehoseHandler implements RequestHandler<KinesisFirehoseEvent, KinesisAnalyticsInputPreprocessingResponse> {
private final ObjectMapper mapper;
#Override
public KinesisAnalyticsInputPreprocessingResponse handleRequest(KinesisFirehoseEvent kinesisFirehoseEvent, Context context) {
return Flux.fromIterable(kinesisFirehoseEvent.getRecords())
.flatMap(this::transformRecord)
.collectList()
.map(KinesisAnalyticsInputPreprocessingResponse::new)
.block();
}
private Mono<KinesisAnalyticsInputPreprocessingResponse.Record> transformRecord(KinesisFirehoseEvent.Record record) {
return Mono.just(record.getData())
.map(StandardCharsets.UTF_8::decode)
.flatMap(data -> Mono.fromCallable(() -> doYourOwnThing(data)))
.map(StandardCharsets.UTF_8::encode)
.map(data -> new KinesisAnalyticsInputPreprocessingResponse.Record(record.getRecordId(), KinesisAnalyticsInputPreprocessingResponse.Result.Ok, data))
.onErrorResume(e -> Mono.just(new KinesisAnalyticsInputPreprocessingResponse.Record(record.getRecordId(), KinesisAnalyticsInputPreprocessingResponse.Result.ProcessingFailed, record.getData())));
}
}

how to solve the following in apache spark

Consider a retail scenario where an array of (K,V) input holds the (product name,price) as show below. Value of every Key need to be subtracted with 500 for discount offer
Use Spark logics to achieve the above requirement,
Input
{(Jeans,2000),(Smart phone,10000),(Watch,3000)}
Expected Outputenter code here
{(Jeans,1500),(Smart phone,9500),(Watch,2500)}
I have tried the below code I'm getting errors please help me to fix them
import java.util.Arrays;
import java.util.Iterator;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;
public class PairRDDAgg {
public static void main(String[] args) {
// TODO Auto-generated method stub
SparkConf conf = new
SparkConf().setAppName("Line_Count").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> input =
sc.textFile("C:/Users/xxxx/Documents/retail.txt");
JavaPairRDD<String, Integer> counts = input.mapValues(new Function() {
/**
*
*/
private static final long serialVersionUID = 1L;
public Integer call(Integer i) {
return (i-500);
}
});
System.out.println(counts.collect());
sc.close();
}
}

Use mapValues() function
An example for your scenario would be
rdd.mapValues(x => x-500);

You can try this:
scala> val dataset = spark.createDataset(Seq(("Jeans",2000),("Smart phone",10000),("Watch",3000)))
dataset: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int]
scala> dataset.map ( x => (x._1, x._2 - 500) ).show
+-----------+----+
| _1| _2|
+-----------+----+
| Jeans|1500|
|Smart phone|9500|
| Watch|2500|
+-----------+----+

Load CSV files with Apache Spark - Java API

I am new to Spark and working my way through the Java and Scala API and I was interested to put together two examples with a view of comparing both languages in terms of conciseness and readability.
Here is my Scala version:
import java.io.StringReader
import au.com.bytecode.opencsv.CSVReader
import org.apache.spark.{SparkConf, SparkContext}
object LoadCSVScalaExample {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("MyLoadCSVScalaExampleApplication").setMaster("local[*]")
val sc = new SparkContext(conf)
val input = sc.textFile("D:\\MOCK_DATA_spark.csv")
val result = input.map { line => val reader = new CSVReader(new StringReader(line));
reader.readNext()
}
print("This is the total count " + result.count())
}
}
Whereas this is the Java counterpart:
import au.com.bytecode.opencsv.CSVReader;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import java.io.StringReader;
public class LoadCSVJavaExample implements Function<String, String[]> {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("MyLoadCSVJavaExampleApp").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> csvFile = sc.textFile("D:\\MOCK_DATA_spark.csv");
JavaRDD<String[]> csvData = csvFile.map(new LoadCSVJavaExample());
System.out.println("This prints the total count " + csvData.count());
}
public String[] call(String line) throws Exception {
CSVReader reader = new CSVReader(new StringReader(line));
return reader.readNext();
}
}
I am not sure, however, whether the Java example is actually correct? Can I pick your brain on it? I know I could use the Databrickd Spark csv library, instead, but I was wondering whether the current example is correct and how it can be further improved upon.
Thank you for your help,
I.

Put and Get JSON object from Nashorn to MongoDB

i want to store JSON Objects from js-scripts running in Nashorn in a MongoDB and get them back again.
The provided API functions look like:
db.put("key", {"mykey":[1,2,3]})
var result = db.get("key")
There are two issues i don't know how to deal with them:
On the Java-side i'm getting a ScriptObjectMirror that implements Map. So if a got a JSON-Object with an Array inside it's already broken here.
e.g. {"key":[1,2,3]} -> {"key": {"0":1, "1":2, "2": 3}}
When JSON object is read from DB it's not possible to JSON.stringify the object. It just returns undefined. Isn' threre any possiblity to inject a JSON object from Java into Nashorn so that is compatible for JSON.stringify?
Do you have any suggestions for my problem?
Thanks

From JS, you can use var jsonResult = Java.asJSONCompatible(result) to get back a custom wrapper where JS Arrays are exposed as Java Lists.
From Java, you can use ScriptObjectMirror.wrapAsJSONCompatible(obj).
Hope that helps.

Your first solutions, from JS to Java, works as expected. The problem is that we are providing an API for other developers and we don't want to tell them to use Java.asJSONCompatible in the JS-Code.
I can't get the 2nd solution, from Java to JS, work.
Here's my Test:
import jdk.nashorn.api.scripting.ScriptObjectMirror;
import org.junit.Assert;
import org.junit.Test;
import javax.script.ScriptEngine;
import javax.script.ScriptEngineManager;
import javax.script.ScriptException;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
public class NashornJsonConversionTest {
ScriptEngineManager scriptEngineManager = new ScriptEngineManager();
ScriptEngine scriptEngine = scriptEngineManager.getEngineByName("nashorn");
private class MyJsonWrapper extends HashMap<String, Object>{
}
#Test
public void shouldStringifyCorrectly() throws ScriptException {
String exepectedStringified = "{\"value\":1}";
MyJsonWrapper myJsonWrapper = new MyJsonWrapper();
myJsonWrapper.put("value", 1);
scriptEngine.put("jsonObject", ScriptObjectMirror.wrapAsJSONCompatible(myJsonWrapper, null));
Assert.assertEquals(1, scriptEngine.eval("jsonObject.value"));
String result = (String) scriptEngine.eval("JSON.stringify(jsonObject)");
Assert.assertEquals(exepectedStringified, result);
//Expected :{"value":1}
//Actual :null
}
#Test
public void shouldGetJSONWithArrayAsList() throws ScriptException {
Map<String, Object> result = (Map<String, Object>) scriptEngine.eval("Java.asJSONCompatible({value:[1,2,3]})");
List<Integer> values = (List<Integer>) result.get("value");
Assert.assertEquals(values.size(), 3);
// works as expected
}
}

you have to convert the object to a "plain" object in nashorn. Currently your script engine see a mirror.
Here is a jsfiddle: http://jsfiddle.net/Bernicc/00g6acp1/
/**
* This function convert an object coming from nashorn to a plain object.
* Instead of using <code>JSON.parse(value);</code> this method does only create new instances for objects.
* There is no need to parse a number, string or boolean.
*
* #param value the object which should be returned as clean object
* #returns {Object} the clean object
*/
toCleanObject: function (value) {
switch (typeof value) {
case "object":
var ret = {};
for (var key in value) {
ret[key] = this.toCleanObject(value[key]);
}
return ret;
default: //number, string, boolean, null, undefined
return value;
}
}
Above is your test with a minified version of the function. For this example wrapAsJSONCompatible is not necessary.
import jdk.nashorn.api.scripting.ScriptObjectMirror;
import org.junit.Assert;
import org.junit.Test;
import javax.script.ScriptEngine;
import javax.script.ScriptEngineManager;
import javax.script.ScriptException;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
public class NashornJsonConversionTest {
ScriptEngineManager scriptEngineManager = new ScriptEngineManager();
ScriptEngine scriptEngine = scriptEngineManager.getEngineByName("nashorn");
private class MyJsonWrapper extends HashMap<String, Object>{
}
#Test
public void shouldStringifyCorrectly() throws ScriptException {
String exepectedStringified = "{\"value\":1}";
MyJsonWrapper myJsonWrapper = new MyJsonWrapper();
myJsonWrapper.put("value", 1);
// for this example wrapAsJSONCompatible is not necessary
//scriptEngine.put("jsonObject", ScriptObjectMirror.wrapAsJSONCompatible(myJsonWrapper, null));
scriptEngine.put("jsonObject", myJsonWrapper);
scriptEngine.eval("function toCleanObject(t){switch(typeof t){case\"object\":var e={};for(var n in t)e[n]=toCleanObject(t[n]);return e;default:return t}}");
scriptEngine.eval("jsonObject = toCleanObject(jsonObject);");
Assert.assertEquals(1, scriptEngine.eval("jsonObject.value"));
String result = (String) scriptEngine.eval("JSON.stringify(jsonObject)");
Assert.assertEquals(exepectedStringified, result);
//Expected :{"value":1}
//Actual :null
}
#Test
public void shouldGetJSONWithArrayAsList() throws ScriptException {
Map<String, Object> result = (Map<String, Object>) scriptEngine.eval("Java.asJSONCompatible({value:[1,2,3]})");
List<Integer> values = (List<Integer>) result.get("value");
Assert.assertEquals(values.size(), 3);
// works as expected
}
}
Kind regards from Berlin, Bernard

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache flink Wikipedia edit analytics with Scala - java

Related

Azure spring boot function - Exception: UnsupportedOperationException: At the moment only Tuple-based function are supporting multiple arguments

Kinesis firehose data transformation using Java

how to solve the following in apache spark

Load CSV files with Apache Spark - Java API

Put and Get JSON object from Nashorn to MongoDB

Categories

Resources