Reduce on Pojo field with Apache Flink using Java

Reduce on Pojo field with Apache Flink using Java - java

I'm building a benchmarking tool for some distributed processing tools at the moment, and have some trouble with Apache Flink.
The setup is simple: LogPojo is a simple Pojo with three fields (long date, double value, String data). Out of a List I'm looking for the one LogPojo with the minimum "value" field. Basically the equivalent to:
pojoList.stream().min(new LogPojo.Comp()).get().getValue();
My flink setup looks like:
public double processLogs(List<LogPojo> logs) {
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<LogPojo> logSet = env.fromCollection(logs);
double result = 0.0;
try {
ReduceOperator ro = logSet.reduce(new LogReducer());
List<LogPojo> c = ro.collect();
result = c.get(0).getValue();
} catch (Exception ex) {
System.out.println("Exception caught" + ex);
}
return result;
}
public class LogReducer implements ReduceFunction<LogPojo> {
#Override
public LogPojo reduce(LogPojo o1, LogPojo o2) {
return (o1.getValue() < o2.getValue()) ? o1 : o2;
}
}
It stops with:
Exception in thread "main" java.lang.NoSuchMethodError: scala.collection.immutable.HashSet$.empty()Lscala/collection/immutable/HashSet;
So somehow it seems to be unable to apply the reduce function. I just can't find, why. Any hints?

First of all you should check your imports. You get an exception from a Scala class but your program is implemented in Java. You might have accidentally imported the Scala DataSet API. Using the Java API should not result in a Scala exception (unless you are using classes which depend on Scala).
Regardless of that, Flink has a built-in aggregation methods for min, max, etc.
DataSet<LogPojo> logSet = env.fromCollection(logs);
// map LogPojo to a Tuple1<Double>
// (Flink's built-in aggregation functions work only on Tuple types)
DataSet<Tuple1<Double>> values = logSet.map(new MapFunction<LogPojo, Tuple1<Double>>() {
#Override
public Tuple1<Double> map(LogPojo l) throws Exception {
return new Tuple1<>(l.value);
}
});
// fetch the min value (at position 0 in the Tuple)
List<Tuple1<Double>> c = values.min(0).collect();
// get the first field of the Tuple
Double minVal = c.get(0).f0;

Related

How to remove the time measurement logic

I need to calculate the execution time of some methods. These are private methods in the class, so Spring AOP is not appropriate. Now the code looks like this.
public void method() {
StopWatch sw = new StopWatch();
sw.start();
innerMethod1();
sw.stop();
Monitoring.add("eventType1", sw.getLastTaskTimeMillis());
sw.start();
innerMethod2("abs");
sw.stop();
Monitoring.add("eventType2", sw.getLastTaskTimeMillis());
sw.start();
innerMethod3(5, 29);
sw.stop();
Monitoring.add("eventType3", sw.getLastTaskTimeMillis());
}
But inserts with time measurement fit into the business logic. Are there any solutions? These data will be then recorded in the database for grafana. I'm looking towards AspectJ, but I can't pass keys when starting the app.
When class instrumentation is required in environments that do not support or are not supported by the existing LoadTimeWeaver implementations, a JDK agent can be the only solution. For such cases, Spring provides InstrumentationLoadTimeWeaver, which requires a Spring-specific (but very general) VM agent,org.springframework.instrument-{version}.jar (previously named spring-agent.jar).
To use it, you must start the virtual machine with the Spring agent, by supplying the following JVM options:
-javaagent:/path/to/org.springframework.instrument-{version}.jar
to Mark Bramnik
If I understand you correctly, then for methods
private List<String> innerMethod3(int value, int count) {
//
}
private String innerMethod2(String event) {
//
}
need methods
public <T, R, U> U timed(T value, R count, BiFunction<T, R, U> function) {
long start = System.currentTimeMillis();
U result = function.apply(value, count);
Monitoring.add("method", System.currentTimeMillis() - start);
return result;
}
public <T, R> R timed(T value, Function<T, R> function) {
long start = System.currentTimeMillis();
R result = function.apply(value);
Monitoring.add("method", System.currentTimeMillis() - start);
return result;
}
And calling methods:
List<String> timed = timed(5, 5, this::innerMethod3);
String string = timed("string", this::innerMethod2);
But if method4 has 4 parameters, then I need a new method for measuring time and a new functional interface

There are many approaches you can take but all will boil down to refactoring.
Approach 1:
class Timed {
public static void timed(String name, Runnable codeBlock) {
long from = System.currentTimeMillis();
codeBlock.run();
long to = System.currentTimeMillis();
System.out.println("Monitored: " + name + " : " + (to - from) + " ms");
}
public static <T> T timed(String name, Supplier<T> codeBlock) {
long from = System.currentTimeMillis();
T result = codeBlock.get();
long to = System.currentTimeMillis();
System.out.println("Monitored: " + name + " : " + (to - from) + " ms");
return result;
}
}
Notes:
I've used Runnable / Supplier interfaces for simplicity you might want to create your own functional interfaces for this.
I've used System.out - you'll use the existing Monitoring.add call instead
The aforementioned code can be used like this:
Timed.timed("sample.runnable", ()-> { // Timed. can be statically imported for even further brevity
// some code block here
});
// will measure
int result = Timed.timed("sample.callable", () -> 42);
// will measure and result will be 42
Another approach.
Refactor the code to public methods and integrate with Micrometer that already has annotations support (see #Timed).
I don't know what Monitoring is but micrometer already contains both integration with Prometheus (and other similar products that can store the metrics and later on used from grafana) + it keeps in memory the mathematical model of your measurements and doesn't keep in memory the information per each measurement. In the custom implementation its a complicated code to maintain.
Update 1
No, you got it wrong, you don't need to maintain different versions of timed - you need only two versions that I've provided in the solution. In the case that you've presented in the question, you won't even need the second version of timed.
Your code will become:
public void method() {
Timed.timed("eventType1", () -> {
innerMethod1();
});
Timed.timed("eventType2", () -> {
innerMethod2("abs");
});
Timed.timed("eventType3", () -> {
innerMethod3(5, 29);
});
}
The second version is required for the cases where you actually return some value from the "timed" code:
Example:
Lets say you have innerMethod4 that returns String, so you'll write the following code:
String result = Timed.timed("eventType3", () -> {
return innerMethod4(5, 29);
});

Why does my List<Long> contain integers?

So, I have the following method which is faking a database locally:
public class TestClassDao implements ClassDao {
// ...
private static List<ClassDto> classes = new ArrayList<>();
#Override
public List<ClassDto> getClassesByIds(List<Long> classIds) {
List<ClassDto> results = new ArrayList<>();
for (ClassDto classInstance : classes) {
if (classIds.contains(classInstance.getId())) {
results.add(classInstance);
}
}
return cloner.deepClone(results);
}
//...
}
I was puzzled, because the results were always coming back empty. I stepped through the debugger in Android Studio, and found that the contains check is always returning false even when the right ID is known to be present.
Tracing that back with the debugger, I found what I suspect to be the culprit: according to the debugger, List<Long> classIds contains *Integer* objects. What gives? I'm not sure how to debug this any further.
EDIT:
Here's the debugger output the question is based on:
EDIT 2:
Here's how the test data is being loaded into the data store, you can see I am correctly passing Long values:
The below method is called by a method which does a similar thing for schools, and then persisted via a method in the test DAO.
public static ClassDto getClassTestData(int classId) {
ClassDto classDto = new ClassDto();
switch (classId) {
case 1:
classDto.setId(1L);
classDto.setName("207E - Mrs. Randolph");
classDto.setTeacher(getTeacherTestData());
classDto.setStudents(getStudentsTestData());
return classDto;
case 2:
classDto.setId(2L);
classDto.setName("209W - Mr. Burns");
classDto.setTeacher(getTeacherTestData());
return classDto;
case 3:
classDto.setId(3L);
classDto.setName("249E - Mr. Sorola");
classDto.setTeacher(getTeacherTestData());
return classDto;
default:
return null;
}
}
EDIT 3:
Here is the DAO where the school information is being persisted/retrieved from. The problem is occuring somewhere between the time that the data is inserted and the time it is removed. It goes in with type Long and comes out with Type Int
#Dao
public interface SchoolDao {
#Query("SELECT * FROM schools")
List<SchoolDto> getAllSchools();
#Insert
void insertSchool(SchoolDto schoolDto);
}

Wow, what a nightmare. I have found the culprit.
I had created a TypeConverter to turn a List<Integer> to into a string (and back) so that it can be stored in a single column in the DB in room without having to modify the existing DTOs. However, when I switched over to using Long types as IDs, I failed to convert a single generic argument below in the converter; look carefully at the following code:
public class IdsListConverter {
#TypeConverter
public List<Long> idsFromString(String value) {
Gson gson = new Gson();
if (value == null || value.isEmpty()) {
return null;
} else {
Type resultType = new TypeToken<List<Integer>>(){}.getType();
return gson.fromJson(value, resultType);
}
}
#TypeConverter
public String idsToString(List<Long> ids) {
if (ids == null) {
return null;
} else {
Gson gson = new Gson();
return gson.toJson(ids);
}
}
}

It looks like you found your problem:
Type resultType = new TypeToken<List<Integer>>(){}.getType();
return gson.fromJson(value, resultType);
(in a method returning List<Long>) whereas it should have been:
Type resultType = new TypeToken<List<Long>>(){}.getType();
There is a type-safe way to write this which would have picked up the problem at compile time:
TypeToke<List<Integer>> resultTypeToken = new TypeToken<List<Integer>>() {};
return gson.getAdapter(resultTypeToken).fromJson(value);
This wouldn't have compiled, because the return statement's type is incompatible with the method's return type.
It might be worth looking for other occurrences of fromJson so you can migrate them and see if there are other problems you haven't found yet!

You look at wrong variables. ClassDao instance is below, you can see "{Long#6495} "1". But the Integer "1" you spread is the element of ClassIds which is omitted in your code. You are sure ClassIds is List(), when adding element, you should do classIds.add(new Long(1)).

For future reference, this list of casting rules will help you. In essence, I believe there is/was an implicit casting conflict.
byte –> short –> int –> long –> float –> double

Datastore queries in Dataflow DoFn slow down pipeline when run in the cloud

I am trying to enhance data in a pipeline by querying Datastore in a DoFn step.
A field from an object from the Class CustomClass is used to do a query against a Datastore table and the returned values are used to enhance the object.
The code looks like this:
public class EnhanceWithDataStore extends DoFn<CustomClass, CustomClass> {
private static Datastore datastore = DatastoreOptions.defaultInstance().service();
private static KeyFactory articleKeyFactory = datastore.newKeyFactory().kind("article");
#Override
public void processElement(ProcessContext c) throws Exception {
CustomClass event = c.element();
Entity article = datastore.get(articleKeyFactory.newKey(event.getArticleId()));
String articleName = "";
try{
articleName = article.getString("articleName");
} catch(Exception e) {}
CustomClass enhanced = new CustomClass(event);
enhanced.setArticleName(articleName);
c.output(enhanced);
}
When it is run locally, this is fast, but when it is run in the cloud, this step slows down the pipeline significantly. What's causing this? Is there any workaround or better way to do this?
A picture of the pipeline can be found here (the last step is the enhancing step):
pipeline architecture

What you are doing here is a join between your input PCollection<CustomClass> and the enhancements in Datastore.
For each partition of your PCollection the calls to Datastore are going to be single-threaded, hence incur a lot of latency. I would expect this to be slow in the DirectPipelineRunner and InProcessPipelineRunner as well. With autoscaling and dynamic work rebalancing, you should see parallelism when running on the Dataflow service unless something about the structure of your causes us to optimize it poorly, so you can try increasing --maxNumWorkers. But you still won't benefit from bulk operations.
It is probably better to express this join within your pipeline, using DatastoreIO.readFrom(...) followed by a CoGroupByKey transform. In this way, Dataflow will do a bulk parallel read of all the enhancements and use the efficient GroupByKey machinery to line them up with the events.
// Here are the two collections you want to join
PCollection<CustomClass> events = ...;
PCollection<Entity> articles = DatastoreIO.readFrom(...);
// Key them both by the common id
PCollection<KV<Long, CustomClass>> keyedEvents =
events.apply(WithKeys.of(event -> event.getArticleId()))
PCollection<KV<Long, Entity>> =
articles.apply(WithKeys.of(article -> article.getKey().getId())
// Set up the join by giving tags to each collection
TupleTag<CustomClass> eventTag = new TupleTag<CustomClass>() {};
TupleTag<Entity> articleTag = new TupleTag<Entity>() {};
KeyedPCollectionTuple<Long> coGbkInput =
KeyedPCollectionTuple
.of(eventTag, keyedEvents)
.and(articleTag, keyedArticles);
PCollection<CustomClass> enhancedEvents = coGbkInput
.apply(CoGroupByKey.create())
.apply(MapElements.via(CoGbkResult joinResult -> {
for (CustomClass event : joinResult.getAll(eventTag)) {
String articleName;
try {
articleName = joinResult.getOnly(articleTag).getString("articleName");
} catch(Exception e) {
articleName = "";
}
CustomClass enhanced = new CustomClass(event);
enhanced.setArticleName(articleName);
return enhanced;
}
});
Another possibility, if there are very few enough articles to store the lookup in memory, is to use DatastoreIO.readFrom(...) and then read them all as a map side input via View.asMap() and look them up in a local table.
// Here are the two collections you want to join
PCollection<CustomClass> events = ...;
PCollection<Entity> articles = DatastoreIO.readFrom(...);
// Key the articles and create a map view
PCollectionView<Map<Long, Entity>> = articleView
.apply(WithKeys.of(article -> article.getKey().getId())
.apply(View.asMap());
// Do a lookup join by side input to a ParDo
PCollection<CustomClass> enhanced = events
.apply(ParDo.withSideInputs(articles).of(new DoFn<CustomClass, CustomClass>() {
#Override
public void processElement(ProcessContext c) {
Map<Long, Entity> articleLookup = c.sideInput(articleView);
String articleName;
try {
articleName =
articleLookup.get(event.getArticleId()).getString("articleName");
} catch(Exception e) {
articleName = "";
}
CustomClass enhanced = new CustomClass(event);
enhanced.setArticleName(articleName);
return enhanced;
}
});
Depending on your data, either of these may be a better choice.

After some checking I managed to pinpoint the problem: the project is located in the EU (and as such, the Datastore is located in the EU-zone; same as the AppEningine zone), while the Dataflow jobs themselves (and thus the workers) are hosted in the US by default (when not overwriting the zone-option).
The difference in performance is 25-30 fold: ~40 elements/s compared to ~1200 elements/s for 15 workers.

Java | How Count all methods with Reflection & RTTI?

In our uni project we were asked to build a project in which we should also provide an info class in which we should insert all the info like total number of lines of code, number of methods (in the whole project).
We were asked to provide the complete number of methods, to compute with Reflection & RTTI, and obviously with no use of external libraries.
How shall I do?

Simplest approach you can use is,
Create a class to store the info you need ( number of methods,lines or whatever) - use setters/getters.
Use a static block in all your application classes, to calculate the methods, lines etc for each class & update it to info class.
I hope at a given instant it can give you the info about the loaded number of methods/code.
As #Jägermeister rightly said, the aim of this project is to try-out things yourself. So I gave some insights - which you can follow & try out yourself.

Eventually I came to a solution, thank you all.
Here's the code:
private int getNumMethods() {
java.io.File src = new java.io.File("src/APManager2016");
int result = 0;
if (src.isDirectory()) {
String[] list = src.list((java.io.File dir, String name) -> name.toLowerCase().endsWith(".java"));
try {
for (String x : list) {
Class<?> c = Class.forName("APManager2016." + x.replace(".java", ""));
result += c.getDeclaredMethods().length;
}
} catch (ClassNotFoundException ex) {
System.err.println(ex.getMessage());
result = 0;
}
}
if (result == 0)
{
result = 111;
}
return result;
}

Moving Average in Spark Java

I have real time streaming data coming into spark and I would like to do a moving average forecasting on that time-series data. Is there any way to implement this using spark in Java?
I've already referred to : https://gist.github.com/samklr/27411098f04fc46dcd05/revisions
and
Apache Spark Moving Average
but both these codes are written in Scala. Since I'm not familiar with Scala, I'm not able to judge if I'll find it useful or even convert the code to Java.
Is there any direct implementation of forecasting in Spark Java?

I took the question you were referring and struggled for a couple of hours in order to translate the Scala code into Java:
// Read a file containing the Stock Quotations
// You can also paralelize a collection of objects to create a RDD
JavaRDD<String> linesRDD = sc.textFile("some sample file containing stock prices");
// Convert the lines into our business objects
JavaRDD<StockQuotation> quotationsRDD = linesRDD.flatMap(new ConvertLineToStockQuotation());
// We need these two objects in order to use the MLLib RDDFunctions object
ClassTag<StockQuotation> classTag = scala.reflect.ClassManifestFactory.fromClass(StockQuotation.class);
RDD<StockQuotation> rdd = JavaRDD.toRDD(quotationsRDD);
// Instantiate a RDDFunctions object to work with
RDDFunctions<StockQuotation> rddFs = RDDFunctions.fromRDD(rdd, classTag);
// This applies the sliding function and return the (DATE,SMA) tuple
JavaPairRDD<Date, Double> smaPerDate = rddFs.sliding(slidingWindow).toJavaRDD().mapToPair(new MovingAvgByDateFunction());
List<Tuple2<Date, Double>> smaPerDateList = smaPerDate.collect();
Then you have to use a new Function Class to do the actual calculation of each data window:
public class MovingAvgByDateFunction implements PairFunction<Object,Date,Double> {
/**
*
*/
private static final long serialVersionUID = 9220435667459839141L;
#Override
public Tuple2<Date, Double> call(Object t) throws Exception {
StockQuotation[] stocks = (StockQuotation[]) t;
List<StockQuotation> stockList = Arrays.asList(stocks);
Double result = stockList.stream().collect(Collectors.summingDouble(new ToDoubleFunction<StockQuotation>() {
#Override
public double applyAsDouble(StockQuotation value) {
return value.getValue();
}
}));
result = result / stockList.size();
return new Tuple2<Date, Double>(stockList.get(0).getTimestamp(),result);
}
}
If you want more detail on this, I wrote about Simple Moving Averages here:
https://t.co/gmWltdANd3

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reduce on Pojo field with Apache Flink using Java - java

Related

How to remove the time measurement logic

Why does my List<Long> contain integers?

Datastore queries in Dataflow DoFn slow down pipeline when run in the cloud

Java | How Count all methods with Reflection & RTTI?

Moving Average in Spark Java

Categories

Resources