How to properly construct Akka Graph - java

I want to create a graph that has a source and that source is linked to a broadcast which fanout through two flows and then the output is zipped to a sink.
I did almost everything, but I have two problems:
The builder is not accepting my FanIn shape
I am providing a sink but it is required a shape sink and I don't know how to get that
public static void main(String[] args) {
ActorSystem system = ActorSystem.create("test");
ActorMaterializer materializer = ActorMaterializer.create(system);
Source<Integer, NotUsed> source = Source.range(1, 100);
Flow<Integer, Integer, NotUsed> flow1 = Flow.of(Integer.class).map(i -> i + 1);
Flow<Integer, Integer, NotUsed> flow2 = Flow.of(Integer.class).map(i -> i * 2);
Sink<List<Integer>, CompletionStage<Integer>> sink = Sink.fold(0, ((arg1, arg2) -> {
int value = arg1.intValue();
for (Integer i : arg2) {
value += i.intValue();
}
return value;
}));
RunnableGraph<Integer> graph = RunnableGraph.fromGraph(GraphDSL.create(
(builder) -> {
UniformFanOutShape fanOutShape = builder.add(Broadcast.create(2));
UniformFanInShape fanInShape = builder.add(Zip.create());
return builder.from(builder.add(source))
.viaFanOut(fanOutShape)
.via(builder.add(flow1))
.via(builder.add(flow2))
.viaFanIn(fanInShape)
.to(sink);
}
));
}
any help is appreciated

You are failing to map the out ports from broadcast to the specific sub flows (flow1 and flow2) and similarly you need to map the specific flows (flow1 and flow2) coming together in zip stage to the specific port of a zip stage.
Also i think it is not clear what is expected from the flow you are writing. zip stage will return you a tuple (int, int), so output of zip in the stream would lead to stream of tuples. But your sink which is supposed to be added after zip does not accept a stream of tuples but stream of Integers
public static void main(String[] args) {
ActorSystem system = ActorSystem.create("test");
ActorMaterializer materializer = ActorMaterializer.create(system);
Source<Integer, NotUsed> source = Source.range(1, 100);
Flow<Integer, Integer, NotUsed> flow1 = Flow.of(Integer.class).map(i -> i + 1);
Flow<Integer, Integer, NotUsed> flow2 = Flow.of(Integer.class).map(i -> i * 2);
//create a new zip stage which accepts
//Zip<?, ?, ?> zip1 =
final FanInShape2<Integer, Integer, Pair<Integer, Integer>> zip = builder.add(Zip.create());
Sink<List<Integer>, CompletionStage<Integer>> sink = Sink.fold(0, ((arg1, arg2) -> {
int value = arg1.intValue();
for (Integer i : arg2) {
value += i.intValue();
}
return value;
}));
RunnableGraph<Integer> graph = RunnableGraph.fromGraph(GraphDSL.create(flow1, flow2, sink,
(builder, flow1, flow2, sink) -> {
UniformFanOutShape fanOutShape = builder.add(Broadcast.create(2));
UniformFanInShape fanInShape = builder.add(Zip.create());
builder.from(builder.add(source))
.viaFanOut(fanOutShape)
builder
.from(broadcast.out(0))
.via(builder.add(flow1))
.toInlet(zip.in0());
builder
.from(broadcast.out(1))
.via(builder.add(flow2))
.toInlet(zip.in1());
builder
.from(zip.out()).toInlet(sink)
}
));
}
You can check the below link for more examples.
https://github.com/Cs4r/akka-examples/blob/master/src/main/java/cs4r/labs/akka/examples/ConstructingGraphs.java

Related

Flink S3 StreamingFileSink not writing files to S3

I am doing a POC for writing data to S3 using Flink. The program does not give a error. However I do not see any files being written in S3 either.
Below is the code
public class StreamingJob {
public static void main(String[] args) throws Exception {
// set up the streaming execution environment
final String outputPath = "s3a://testbucket-s3-flink/data/";
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//Enable checkpointing
env.enableCheckpointing();
//S3 Sink
final StreamingFileSink<String> sink = StreamingFileSink
.forRowFormat(new Path(outputPath), new SimpleStringEncoder<String>("UTF-8"))
.build();
//Source is a local kafka
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "kafka:9094");
properties.setProperty("group.id", "test");
DataStream<String> input = env.addSource(new FlinkKafkaConsumer<String>("queueing.transactions", new SimpleStringSchema(), properties));
input.flatMap(new Tokenizer()) // Tokenizer for generating words
.keyBy(0) // Logically partition the stream for each word
.timeWindow(Time.minutes(1)) // Tumbling window definition
.sum(1) // Sum the number of words per partition
.map(value -> value.f0 + " count: " + value.f1.toString() + "\n")
.addSink(sink);
// execute program
env.execute("Flink Streaming Java API Skeleton");
}
public static final class Tokenizer
implements FlatMapFunction<String, Tuple2<String, Integer>> {
#Override
public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
String[] tokens = value.toLowerCase().split("\\W+");
for (String token : tokens) {
if (token.length() > 0) {
out.collect(new Tuple2<>(token, 1));
}
}
}
}
}
Note that I have set the s3.access-key and s3.secret-key value in the configuration and tested by changing them to incorrect values (I got a error on incorrect values)
Any pointers what may be going wrong?
Could it be that you are running into this issue?
Given that Flink sinks and UDFs in general do not differentiate between normal job termination (e.g. finite input stream) and termination due to failure, upon normal termination of a job, the last in-progress files will not be transitioned to the “finished” state.

Resolve a type in a context

I have a Spring boot project and I want to parse it and file the dependencies between classes I am using the JavaSymbolSolver to find out the Class Name
public static void main(String[] args) throws Exception {
Set<Map<String, Set<String>>> entries = new HashSet<>();
String jdkPath = "/usr/lib/jvm/java-11-openjdk-amd64/";
List<File> projectFiles = FileHandler.readJavaFiles(new File("/home/dell/MySpace/Tekit/soon-back/src/main"));
CombinedTypeSolver combinedSolver = new CombinedTypeSolver
(
new JavaParserTypeSolver(new File("/home/dell/MySpace/Tekit/soon-back/src/main/java/")),
new JavaParserTypeSolver(new File(jdkPath)),
new ReflectionTypeSolver()
);
JavaSymbolSolver symbolSolver = new JavaSymbolSolver(combinedSolver);
StaticJavaParser.getConfiguration().setSymbolResolver(symbolSolver);
CompilationUnit cu = null;
try {
cu = StaticJavaParser.parse(projectFiles.get(7));
} catch (FileNotFoundException e) {
e.printStackTrace();
}
List<ClassOrInterfaceDeclaration> classes = new ArrayList<>();
TypeDeclarationImp typeDeclarationImp = new TypeDeclarationImp();
typeDeclarationImp.visit(cu, classes);
Set<String> collect = classes.stream()
.map(classOrInterfaceDeclaration -> {
List<MethodCallExpr> collection = new ArrayList<>();
MethodInvocationImp methodInvocationImp = new MethodInvocationImp();
classOrInterfaceDeclaration.accept(methodInvocationImp, collection);
return collection;
})
.flatMap(Collection::stream)
.map(methodCallExpr -> {
return methodCallExpr
.getScope()
.stream()
.filter(Expression::isNameExpr)
.map(Expression::calculateResolvedType)
.map(ResolvedType::asReferenceType)
.map(ResolvedReferenceType::getQualifiedName)
.map(s -> s.split("\\."))
.map(strings -> strings[strings.length - 1])
.collect(Collectors.toSet());
})
.filter(expressions -> expressions.size() != 0)
.flatMap(Collection::stream)
.collect(Collectors.toSet());
collect.forEach(System.out::println);
}
I am facing this issue
Exception in thread "main" UnsolvedSymbolException{context='SecurityContextHolder', name='Solving SecurityContextHolder', cause='null'}
could you tell me if it is necessary to indicate all the libraries used by the project to parse it or there is another way for that
It's not entirely correct. If you only want to traverse the AST you don't need to provide project dependencies but if you want for example to know the type of a variable you must use the symbol solver and declare all the dependencies of the project to it.
Furthermore Javaparser can recover from parsing error (see https://matozoid.github.io/2017/06/11/parse-error-recovery.html)

Java Iterators and Streams

I am trying to convert a loop that I have made into Java streams, though the code uses iterators and I am finding it hard to convert it into readable code.
private void printKeys() throws IOException {
ClassLoader classLoader = getClass().getClassLoader();
// read a json file
ObjectMapper objectMapper = new ObjectMapper();
JsonNode root = objectMapper.readTree(classLoader.getResource("AllSets.json"));
Set<String> names = new HashSet<>();
// loop through each sub node and store the keys
for (JsonNode node : root) {
for (JsonNode cards : node.get("cards")) {
Iterator<String> i = cards.fieldNames();
while(i.hasNext()){
String name = i.next();
names.add(name);
}
}
}
// print each value
for (String name : names) {
System.out.println(name);
}
}
I have tried the following though I feel like its not going the right way.
List<JsonNode> nodes = new ArrayList<>();
root.iterator().forEachRemaining(nodes::add);
Set<JsonNode> cards = new HashSet<>();
nodes.stream().map(node -> node.get("cards")).forEach(cards::add);
Stream s = StreamSupport.stream(cards.spliterator(), false);
//.. unfinished and unhappy
You can find the Json file I used here: https://mtgjson.com/json/AllSets.json.zip
Be warned its quite large.
You can do most of the things in one swoop, but it's a shame this json api does not support streams better.
List<JsonNode> nodes = new ArrayList<>();
root.iterator().forEachRemaining(nodes::add);
Set<String> names = nodes.stream()
.flatMap(node -> StreamSupport.stream(
node.get("cards").spliterator(), false))
.flatMap(node -> StreamSupport.stream(
((Iterable<String>) () -> node.fieldNames()).spliterator(), false))
.collect(Collectors.toSet());
Or with Patrick's helper method (from the comments):
Set<String> names = stream(root)
.flatMap(node -> stream(node.get("cards")))
.flatMap(node -> stream(() -> node.fieldNames()))
.collect(Collectors.toSet());
...
public static <T> Stream<T> stream(Iterable<T> itor) {
return StreamSupport.stream(itor.spliterator(), false);
}
And printing:
names.stream().forEach(System.out::println);
If you provide 'json' file to us, it will be very useful.
At least now, I can make some suggestions to you:
Set<JsonNode> cards = new HashSet<>();
nodes.stream().map(node -> node.get("cards")).forEach(cards::add);
Replace with:
Set<JsonNode> cards = nodes.stream().map(node -> node.get("cards")).collect(Collectors.toSet());
for (String name : names) {
System.out.println(name);
}
Replace with:
names.forEach(System.out::println);
Replace
Set<JsonNode> cards = new HashSet<>();
with
List<JsonNode> cards = new ArrayList<>();
Remove
Stream s = StreamSupport.stream(cards.spliterator(), false);
Then add below lines
cards.stream().forEach( card -> {
Iterable<String> iterable = () -> card.fieldNames();
Stream<String> targetStream = StreamSupport.stream(iterable.spliterator(), false);
targetStream.forEach(names::add);
});
names.forEach(System.out::println);

Transforming JavaPairDStream into RDD

I'm using Spark streaming with Spark MLlib to evaluate a naive bayes model. Actually, I reach to point where I can't go ahead since I could not transform the object of JavaPairDStream into RDD to calculate the accuracy. The result of prediction and labels are stored in this JavaPairDStream but I want to go through each one of them and compare to calculate the accuracy.
I will post my code to make my question more clear, The code raise an exception in the part of calculating the accuracy (The operator / is undefined for the argument type(s) JavaDStream, double) (because this way is only working with JavaPairRDD not), I need a help to calculate the accuracy for JavaPairDStream.
Edit: I edited the code, my problem now is how to read the value of accuracy which is JavaDStream and then accumulate this value for each batch of data to calculate the accuracy for all data.
public static JSONArray testSparkStreaming(){
SparkConf sparkConf = new SparkConf().setAppName("My app").setMaster("local[2]").set("spark.driver.allowMultipleContexts", "true");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.milliseconds(500));
String savePath = "path to saved model";
final NaiveBayesModel savedModel = NaiveBayesModel.load(jssc.sparkContext().sc(), savePath);
JavaDStream<String> data = jssc.textFileStream("path to CSV file");
JavaDStream<LabeledPoint> testData = data.map(new Function<String, LabeledPoint>() {
public LabeledPoint call(String line) throws Exception {
List<String> featureList = Arrays.asList(line.trim().split(","));
double[] points = new double[featureList.size()-1];
double classLabel = Double.parseDouble(featureList.get(featureList.size() - 1));
for (int i = 0; i < featureList.size()-1; i++){
points[i] = Double.parseDouble(featureList.get(i));
}
return new LabeledPoint(classLabel, Vectors.dense(points));
}
});
JavaPairDStream<Double, Double> predictionAndLabel = testData.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
public Tuple2<Double, Double> call(LabeledPoint p) {
return new Tuple2<Double, Double>(savedModel.predict(p.features()), p.label());
}
});
JavaDStream<Long> accuracy = predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() {
public Boolean call(Tuple2<Double, Double> pl) throws JSONException {
return pl._1().equals(pl._2());
}
}).count();
jssc.start();
jssc.awaitTermination();
System.out.println("*************");
JSONArray jsonArray = new JSONArray();
JSONObject obj = new JSONObject();
jsonArray.put(obj);
obj = new JSONObject();
obj.put("Accuracy", accuracy*100 + "%");
jsonArray.put(obj);
return jsonArray;
}

How to Get the file name for record in spark RDD (JavaRDD)

I am loading multiple files into a JavaRDD using
JavaRDD<String> allLines = sc.textFile(hdfs://path/*.csv);
After loading the files I modify each record and want to save them. However I need to also save the original file name (ID) with the record for future reference. Is there anyway that I can get the original file name from the individual record in RDD?
thanks
You can try to do something like in the following snippet:
JavaPairRDD<LongWritable, Text> javaPairRDD = sc.newAPIHadoopFile(
"hdfs://path/*.csv",
TextInputFormat.class,
LongWritable.class,
Text.class,
new Configuration()
);
JavaNewHadoopRDD<LongWritable, Text> hadoopRDD = (JavaNewHadoopRDD) javaPairRDD;
JavaRDD<Tuple2<String, String>> namedLinesRDD = hadoopRDD.mapPartitionsWithInputSplit((inputSplit, lines) -> {
FileSplit fileSplit = (FileSplit) inputSplit;
String fileName = fileSplit.getPath().getName();
Stream<Tuple2<String, String>> stream =
StreamSupport.stream(Spliterators.spliteratorUnknownSize(lines, Spliterator.ORDERED), false)
.map(line -> {
String lineText = line._2().toString();
// emit file name as key and line as a value
return new Tuple2(fileName, lineText);
});
return stream.iterator();
}, true);
Update (for java7)
JavaRDD<Tuple2<String, String>> namedLinesRDD = hadoopRDD.mapPartitionsWithInputSplit(
new Function2<InputSplit, Iterator<Tuple2<LongWritable, Text>>, Iterator<Tuple2<String, String>>>() {
#Override
public Iterator<Tuple2<String, String>> call(InputSplit inputSplit, final Iterator<Tuple2<LongWritable, Text>> lines) throws Exception {
FileSplit fileSplit = (FileSplit) inputSplit;
final String fileName = fileSplit.getPath().getName();
return new Iterator<Tuple2<String, String>>() {
#Override
public boolean hasNext() {
return lines.hasNext();
}
#Override
public Tuple2<String, String> next() {
Tuple2<LongWritable, Text> entry = lines.next();
return new Tuple2<String, String>(fileName, entry._2().toString());
}
};
}
},
true
);
You want spark's wholeTextFiles function. From the documentation:
For example, if you have the following files:
hdfs://a-hdfs-path/part-00000
hdfs://a-hdfs-path/part-00001
...
hdfs://a-hdfs-path/part-nnnnn
Do val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path"),
then rdd contains
(a-hdfs-path/part-00000, its content)
(a-hdfs-path/part-00001, its content)
...
(a-hdfs-path/part-nnnnn, its content)
It returns you an RDD of tuples where the left is the filename and the right is the content.
You should be able to use toDebugString. Using wholeTextFile will read in the entire content of your file as one element, whereas sc.textfile creates an RDD with each line as an individual element - as described here.
for example:
val file= sc.textFile("/user/user01/whatever.txt").cache()
val wordcount = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
wordcount.toDebugString
// res0: String =
// (2) ShuffledRDD[4] at reduceByKey at <console>:23 []
// +-(2) MapPartitionsRDD[3] at map at <console>:23 []
// | MapPartitionsRDD[2] at flatMap at <console>:23 []
// | /user/user01/whatever.txt MapPartitionsRDD[1] at textFile at <console>:21 []
// | /user/user01/whatever.txt HadoopRDD[0] at textFile at <console>:21 []

Categories

Resources