I have real time streaming data coming into spark and I would like to do a moving average forecasting on that time-series data. Is there any way to implement this using spark in Java?
I've already referred to : https://gist.github.com/samklr/27411098f04fc46dcd05/revisions
and
Apache Spark Moving Average
but both these codes are written in Scala. Since I'm not familiar with Scala, I'm not able to judge if I'll find it useful or even convert the code to Java.
Is there any direct implementation of forecasting in Spark Java?
I took the question you were referring and struggled for a couple of hours in order to translate the Scala code into Java:
// Read a file containing the Stock Quotations
// You can also paralelize a collection of objects to create a RDD
JavaRDD<String> linesRDD = sc.textFile("some sample file containing stock prices");
// Convert the lines into our business objects
JavaRDD<StockQuotation> quotationsRDD = linesRDD.flatMap(new ConvertLineToStockQuotation());
// We need these two objects in order to use the MLLib RDDFunctions object
ClassTag<StockQuotation> classTag = scala.reflect.ClassManifestFactory.fromClass(StockQuotation.class);
RDD<StockQuotation> rdd = JavaRDD.toRDD(quotationsRDD);
// Instantiate a RDDFunctions object to work with
RDDFunctions<StockQuotation> rddFs = RDDFunctions.fromRDD(rdd, classTag);
// This applies the sliding function and return the (DATE,SMA) tuple
JavaPairRDD<Date, Double> smaPerDate = rddFs.sliding(slidingWindow).toJavaRDD().mapToPair(new MovingAvgByDateFunction());
List<Tuple2<Date, Double>> smaPerDateList = smaPerDate.collect();
Then you have to use a new Function Class to do the actual calculation of each data window:
public class MovingAvgByDateFunction implements PairFunction<Object,Date,Double> {
/**
*
*/
private static final long serialVersionUID = 9220435667459839141L;
#Override
public Tuple2<Date, Double> call(Object t) throws Exception {
StockQuotation[] stocks = (StockQuotation[]) t;
List<StockQuotation> stockList = Arrays.asList(stocks);
Double result = stockList.stream().collect(Collectors.summingDouble(new ToDoubleFunction<StockQuotation>() {
#Override
public double applyAsDouble(StockQuotation value) {
return value.getValue();
}
}));
result = result / stockList.size();
return new Tuple2<Date, Double>(stockList.get(0).getTimestamp(),result);
}
}
If you want more detail on this, I wrote about Simple Moving Averages here:
https://t.co/gmWltdANd3
Related
I'm building a benchmarking tool for some distributed processing tools at the moment, and have some trouble with Apache Flink.
The setup is simple: LogPojo is a simple Pojo with three fields (long date, double value, String data). Out of a List I'm looking for the one LogPojo with the minimum "value" field. Basically the equivalent to:
pojoList.stream().min(new LogPojo.Comp()).get().getValue();
My flink setup looks like:
public double processLogs(List<LogPojo> logs) {
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<LogPojo> logSet = env.fromCollection(logs);
double result = 0.0;
try {
ReduceOperator ro = logSet.reduce(new LogReducer());
List<LogPojo> c = ro.collect();
result = c.get(0).getValue();
} catch (Exception ex) {
System.out.println("Exception caught" + ex);
}
return result;
}
public class LogReducer implements ReduceFunction<LogPojo> {
#Override
public LogPojo reduce(LogPojo o1, LogPojo o2) {
return (o1.getValue() < o2.getValue()) ? o1 : o2;
}
}
It stops with:
Exception in thread "main" java.lang.NoSuchMethodError: scala.collection.immutable.HashSet$.empty()Lscala/collection/immutable/HashSet;
So somehow it seems to be unable to apply the reduce function. I just can't find, why. Any hints?
First of all you should check your imports. You get an exception from a Scala class but your program is implemented in Java. You might have accidentally imported the Scala DataSet API. Using the Java API should not result in a Scala exception (unless you are using classes which depend on Scala).
Regardless of that, Flink has a built-in aggregation methods for min, max, etc.
DataSet<LogPojo> logSet = env.fromCollection(logs);
// map LogPojo to a Tuple1<Double>
// (Flink's built-in aggregation functions work only on Tuple types)
DataSet<Tuple1<Double>> values = logSet.map(new MapFunction<LogPojo, Tuple1<Double>>() {
#Override
public Tuple1<Double> map(LogPojo l) throws Exception {
return new Tuple1<>(l.value);
}
});
// fetch the min value (at position 0 in the Tuple)
List<Tuple1<Double>> c = values.min(0).collect();
// get the first field of the Tuple
Double minVal = c.get(0).f0;
I have a stream of objects which I would like to collect the following way.
Let's say we are handling forum posts:
class Post {
private Date time;
private Data data
}
I want to create a list which groups posts by a period. If there were no posts for X minutes, create a new group.
class PostsGroup{
List<Post> posts = new ArrayList<> ();
}
I want to get a List<PostGroups> containing the posts grouped by the interval.
Example: interval of 10 minutes.
Posts:
[{time:x, data:{}}, {time:x + 3, data:{}} , {time:x + 12, data:{}, {time:x + 45, data:{}}}]
I want to get a list of posts group:
[
{posts : [{time:x, data:{}}, {time:x + 3, data:{}}, {time:x + 12, data:{}]]},
{posts : [{time:x + 45, data:{}]}
]
notice that the first group lasted till X + 22. Then a new post was received at X + 45.
Is this possible?
This problem could be easily solved using the groupRuns method of my StreamEx library:
long MAX_INTERVAL = TimeUnit.MINUTES.toMillis(10);
StreamEx.of(posts)
.groupRuns((p1, p2) -> p2.time.getTime() - p1.time.getTime() <= MAX_INTERVAL)
.map(PostsGroup::new)
.toList();
I assume that you have a constructor
class PostsGroup {
private List<Post> posts;
public PostsGroup(List<Post> posts) {
this.posts = posts;
}
}
The StreamEx.groupRuns method takes a BiPredicate which is applied to two adjacent input elements and returns true if they must be grouped together. This method creates the stream of lists where each list represents the group. This method is lazy and works fine with parallel streams.
You need to retain state between stream entries and write yourself a grouping classifier. Something like this would be a good start.
class Post {
private final long time;
private final String data;
public Post(long time, String data) {
this.time = time;
this.data = data;
}
#Override
public String toString() {
return "Post{" + "time=" + time + ", data=" + data + '}';
}
}
public void test() {
System.out.println("Hello");
long t = 0;
List<Post> posts = Arrays.asList(
new Post(t, "One"),
new Post(t + 1000, "Two"),
new Post(t + 10000, "Three")
);
// Group every 5 seconds.
Map<Long, List<Post>> gouped = posts
.stream()
.collect(Collectors.groupingBy(new ClassifyByTimeBetween(5000)));
gouped.entrySet().stream().forEach((e) -> {
System.out.println(e.getKey() + " -> " + e.getValue());
});
}
class ClassifyByTimeBetween implements Function<Post, Long> {
final long delay;
long currentGroupBy = -1;
long lastDateSeen = -1;
public ClassifyByTimeBetween(long delay) {
this.delay = delay;
}
#Override
public Long apply(Post p) {
if (lastDateSeen >= 0) {
if (p.time > lastDateSeen + delay) {
// Grab this one.
currentGroupBy = p.time;
}
} else {
// First time - start there.
currentGroupBy = p.time;
}
lastDateSeen = p.time;
return currentGroupBy;
}
}
Since no one has provided a solution with a custom collector as it was required in the original problem statement, here is a collector-implementation that groups Post objects based on the provided time-interval.
Date class mentioned in the question is obsolete since Java 8 and not recommended to be used in new projects. Hence, LocalDateTime will be utilized instead.
Post & PostGroup
For testing purposes, I've used Post implemented as a Java 16 record (if you substitute it with a class, the overall solution will be fully compliant with Java 8):
public record Post(LocalDateTime dateTime) {}
Also, I've enhanced the PostGroup object. My idea is that it should be capable to decide whether the offered Post should be added to the list of posts or rejected as the Information expert principle suggests (in short: all manipulations with the data should happen only inside a class to which that data belongs).
To facilitate this functionality, two extra fields were added: interval of type Duration from the java.time package to represent the maximum interval between the earliest post and the latest post in a group, and intervalBound of type LocalDateTime which gets initialized after the first post will be added a later on will be used internally by the method isWithinInterval() to check whether the offered post fits into the interval.
public class PostsGroup {
private Duration interval;
private LocalDateTime intervalBound;
private List<Post> posts = new ArrayList<>();
public PostsGroup(Duration interval) {
this.interval = interval;
}
public boolean tryAdd(Post post) {
if (posts.isEmpty()) {
intervalBound = post.dateTime().plus(interval);
return posts.add(post);
} else if (isWithinInterval(post)) {
return posts.add(post);
}
return false;
}
public boolean isWithinInterval(Post post) {
return post.dateTime().isBefore(intervalBound);
}
#Override
public String toString() {
return "PostsGroup{" + posts + '}';
}
}
I'm making two assumptions:
All posts in the source are sorted by time (if it is not the case, you should introduce sorted() operation in the pipeline before collecting the results);
Posts need to be collected into the minimum number of groups, as a consequence of this it's not possible to split this task and execute stream in parallel.
Building a Custom Collector
We can create a custom collector either inline by using one of the versions of the static method Collector.of() or by defining a class that implements the Collector interface.
These parameters have to be provided while creating a custom collector:
Supplier Supplier<A> is meant to provide a mutable container which store elements of the stream. In this case, ArrayDeque (as an implementation of the Deque interface) will be handy as a container to facilitate the convenient access to the most recently added element, i.e. the latest PostGroup.
Accumulator BiConsumer<A,T> defines how to add elements into the container provided by the supplier. For this task, we need to provide the logic on that will allow determining whether the next element from the stream (i.e. the next Post) should go into the last PostGroup in the Deque, or a new PostGroup needs to be allocated for it.
Combiner BinaryOperator<A> combiner() establishes a rule on how to merge two containers obtained while executing stream in parallel. Since this operation is treated as not parallelizable, the combiner is implemented to throw an AssertionError in case of parallel execution.
Finisher Function<A,R> is meant to produce the final result by transforming the mutable container. The finisher function in the code below turns the container, a deque containing the result, into an immutable list.
Note: Java 16 method toList() is used inside the finisher function, for Java 8 it can be replaced with collect(Collectors.toUnmodifiableList()) or collect(Collectors.toList()).
Characteristics allow providing additional information, for instance Collector.Characteristics.UNORDERED which is used in this case denotes that the order in which partial results of the reduction produced while executing in parallel is not significant. In this case, collector doesn't require any characteristics.
The method below is responsible for generating the collector based on the provided interval.
public static Collector<Post, ?, List<PostsGroup>> groupPostsByInterval(Duration interval) {
return Collector.of(
ArrayDeque::new,
(Deque<PostsGroup> deque, Post post) -> {
if (deque.isEmpty() || !deque.getLast().tryAdd(post)) { // if no groups have been created yet or if adding the post into the most recent group fails
PostsGroup postsGroup = new PostsGroup(interval);
postsGroup.tryAdd(post);
deque.addLast(postsGroup);
}
},
(Deque<PostsGroup> left, Deque<PostsGroup> right) -> { throw new AssertionError("should not be used in parallel"); },
(Deque<PostsGroup> deque) -> deque.stream().collect(Collectors.collectingAndThen(Collectors.toUnmodifiableList())));
}
main() - demo
public static void main(String[] args) {
List<Post> posts =
List.of(new Post(LocalDateTime.of(2022,4,28,15,0)),
new Post(LocalDateTime.of(2022,4,28,15,3)),
new Post(LocalDateTime.of(2022,4,28,15,5)),
new Post(LocalDateTime.of(2022,4,28,15,8)),
new Post(LocalDateTime.of(2022,4,28,15,12)),
new Post(LocalDateTime.of(2022,4,28,15,15)),
new Post(LocalDateTime.of(2022,4,28,15,18)),
new Post(LocalDateTime.of(2022,4,28,15,27)),
new Post(LocalDateTime.of(2022,4,28,15,48)),
new Post(LocalDateTime.of(2022,4,28,15,54)));
Duration interval = Duration.ofMinutes(10);
List<PostsGroup> postsGroups = posts.stream()
.collect(groupPostsByInterval(interval));
postsGroups.forEach(System.out::println);
}
Output:
PostsGroup{[Post[dateTime=2022-04-28T15:00], Post[dateTime=2022-04-28T15:03], Post[dateTime=2022-04-28T15:05], Post[dateTime=2022-04-28T15:08]]}
PostsGroup{[Post[dateTime=2022-04-28T15:12], Post[dateTime=2022-04-28T15:15], Post[dateTime=2022-04-28T15:18]]}
PostsGroup{[Post[dateTime=2022-04-28T15:27]]}
PostsGroup{[Post[dateTime=2022-04-28T15:48], Post[dateTime=2022-04-28T15:54]]}
You can also play around with this Online Demo
I want to iterate through values of KV pCollection on perKey basis. I used below code to combine using custom class,
PCollection<KV<String, String>> combinesAttributes =
valExtract.get(extAttUsers).apply(Combine.<String, String>perKey(
new CombineAttributes()));
And below is my custom combine class,
public static class CombineAttributes implements SerializableFunction<Iterable<String>, String> {
#Override
public String apply(Iterable<String> input) {...}..}
This was working fine for small inputs but for large inputs the combine was not as expected. The output had combined only few values for a key, others were missing. I was assuming that the output had only combined data from one node.
The documentation in https://cloud.google.com/dataflow/model/combine mentions to use CombineFn in order to combine full collection-of-values per key in all nodes.
But when I changed the custom combine function as below, I am getting following error,
incompatible types: CombineAttributes cannot be converted to com.google.cloud.dataflow.sdk.transforms.SerializableFunction<java.lang.Iterable<java.lang.String>,java.lang.String>
Combine function
public static class CombineAttributes extends CombineFn<Iterable<String>, CombineAttributes.Accum, String> {
public static class Accum {
List<String> inputList = new ArrayList<String>();
}
public Accum createAccumulator() { return new Accum(); }
public Accum addInput(Accum accum, Iterable<String> input) {
for (String item : input) {
accum.inputList.add(item);
}
return accum;
}
public Accum mergeAccumulators(Iterable<Accum> accums) {
Accum merged = createAccumulator();
for (Accum accum : accums) {
for (String item : accum.inputList) {
merged.inputList.add(item);
}
}
return merged;
}
public String extractOutput(Accum accum) {
return "";
}
}
There was no sample code available for combine perKey extending CombineFn. Please let me know what is wrong with the code above.
If you just want to iterate through all the values you can use GroupByKey to turn a PCollection<KV<K, V>> into PCollection<KV<K, Iterable<V>>. Then you can write a DoFn that processes each element of that, and inside iterate over the Iterable<V>.
Note that you'll only receive all values associated with a key in the same window. If you're using the default global window, that will be all values.
Combine and CombineFn are most useful when you want to combine all the values into a smaller output. For instance, if you want to take the sum or mean of all the values it will be more efficient to do so using Sum.perKey() or Mean.perKey(). The efficiency comes from being able to pass around (and merge) accumulators. In the case of Sum, this corresponds to a partial sum.
As an example, say the pipeline runs on two machines. The first machine processes KV<user1, attr1a>, KV<user1, attr1b>, KV<user2, attr2a> and the second machine processes KV<user1, attr1c>, KV<user2, attr2b>.
The CombineAttributes (either way it was implemented) would first be invoked on each machine. So it could combine [attr1a, attr1b] into a single string or accumulator (say attr1a+attr1b). Then it would run on the other machine to combine [attr1c] to attr1c. Then it would merge all of these partial results to get a final accumulator -- attr1a+attr1b+attr1c. In the case of the original implementation, that would be the final answer. In the latter, extractOutput would be called on this accumulator.
Basically I want something like this,
int count = 100;
Java<String> myRandomRDD = generate(count, new Function<String, String>() {
#Override
public String call(String arg0) throws Exception {
return RandomStringUtils.randomAlphabetic(42);
}
});
Theoretically I could use Spark RandomRDD, but I can't get it working right. I'm overwhelmed by the choices. Should I use RandomRDDs::randomRDD or RandomRDDs::randomRDDVector? Or should I use RandomVectorRDD?
I have tried the following, but I can't even get the syntax to be correct.
RandomRDDs.randomRDD(jsc, new RandomDataGenerator<String>() {
#Override
public void setSeed(long arg0) {
// TODO Auto-generated method stub
}
#Override
public org.apache.spark.mllib.random.RandomDataGenerator<String> copy() {
// TODO Auto-generated method stub
return null;
}
#Override
public String nextValue() {
RandomStringUtils.randomAlphabetic(42);
}
}, count, ??);
The documentation is sparse, I'm confused, and I would appreciate any help.
Thanks!
The simplest solution I can think of is:
JavaRDD<String> randomStringRDD = RandomRDDs.uniformJavaRDD(jsc, numRows).map((Double d) -> RandomStringUtils.randomAlphabetic(42));
Here is a more complete example to test locally:
SparkConf conf = new SparkConf().setAppName("Test random").setMaster("local");
JavaSparkContext jsc = new JavaSparkContext(conf);
int numRows= 10;//put here how many rows you want
JavaRDD<String> randomStringRDD = RandomRDDs.uniformJavaRDD(jsc, rows).map((Double d) -> RandomStringUtils.randomAlphabetic(42));
//display (to use only on small dataset)
for(String row:randomStringRDD.collect()){
System.out.println(numRows);
}
There is a small CPU overhead because there is no need to generate the initial set of random numbers, but it takes care of creating the partitions etc.
If avoiding that small overhead is important to you, and you want to generate 1 million rows in 10 partitions, you could try the following:
Create an empty rdd via jsc.emptyRDD()
Set its partitioning via repartition to create 10 partitions
use a mapPartition function to create 1milion/10 partition = 100000 rows per partition. Your RDD is ready.
Side notes:
Having the RandomRDDs.randomRDD() class exposed would make it simpler, but it is unfortunately not.
However, RandomRDDs.randomVectorRDD() is exposed, so you could use that one if you need to generate randomized vectors. (but you asked for Strings here so that does not apply).
The RandomRDD class is private to Spark, but we can access the RandomRDDs class and to create these. There are some examples in JavaRandomRDDsSuite.java (see https://github.com/apache/spark/blob/master/mllib/src/test/java/org/apache/spark/mllib/random/JavaRandomRDDsSuite.java ). It seems that the java examples all make Double's and the like but we can use this and turn it into strings like so:
import static org.apache.spark.mllib.random.RandomRDDs.*;
...
JavaDoubleRDD rdd1 = normalJavaRDD(sc, size, numPartitions);
JavaRDD<String> rdd = rdd1.map(e -> Double.toString(e));
That being siad we could use the randomRDD function, but it uses class tags which are a bit frustrating to use with Java. (I've created a JIRA https://issues.apache.org/jira/browse/SPARK-10626 to make an easy Java API for accessing this).
What is the simplest way to implement a parallel computation (e.g. on a multiple core processor) using Java.
I.E. the java equivalent to this Scala code
val list = aLargeList
list.par.map(_*2)
There is this library, but it seems overwhelming.
http://gee.cs.oswego.edu/dl/jsr166/dist/extra166ydocs/
Don't give up so fast, snappy! ))
From the javadocs (with changes to map to your f) the essential matter is really just this:
ParallelLongArray a = ... // you provide
a.replaceWithMapping (new LongOp() { public long op(long a){return a*2L;}};);
is pretty much this, right?
val list = aLargeList
list.par.map(_*2)
& If you are willing to live with a bit less terseness, the above can be a reasonably clean and clear 3 liner (and of course, if you reuse functions, then its the same exact thing as Scala - inline functions.):
ParallelLongArray a = ... // you provide
LongOp f = new LongOp() { public long op(long a){return a*2L;}};
a.replaceWithMapping (f);
[edited above to show concise complete form ala OP's Scala variant]
and here it is in maximal verbose form where we start from scratch for demo:
import java.util.Random;
import jsr166y.ForkJoinPool;
import extra166y.Ops.LongGenerator;
import extra166y.Ops.LongOp;
import extra166y.ParallelLongArray;
public class ListParUnaryFunc {
public static void main(String[] args) {
int n = Integer.parseInt(args[0]);
// create a parallel long array
// with random long values
ParallelLongArray a = ParallelLongArray.create(n-1, new ForkJoinPool());
a.replaceWithGeneratedValue(generator);
// use it: apply unaryLongFuncOp in parallel
// to all values in array
a.replaceWithMapping(unaryLongFuncOp);
// examine it
for(Long v : a.asList()){
System.out.format("%d\n", v);
}
}
static final Random rand = new Random(System.nanoTime());
static LongGenerator generator = new LongGenerator() {
#Override final
public long op() { return rand.nextLong(); }
};
static LongOp unaryLongFuncOp = new LongOp() {
#Override final public long op(long a) { return a * 2L; }
};
}
Final edit and notes:
Also note that a simple class such as the following (which you can reuse across your projects):
/**
* The very basic form w/ TODOs on checks, concurrency issues, init, etc.
*/
final public static class ParArray {
private ParallelLongArray parr;
private final long[] arr;
public ParArray (long[] arr){
this.arr = arr;
}
public final ParArray par() {
if(parr == null)
parr = ParallelLongArray.createFromCopy(arr, new ForkJoinPool()) ;
return this;
}
public final ParallelLongArray map(LongOp op) {
return parr.replaceWithMapping(op);
}
public final long[] values() { return parr.getArray(); }
}
and something like that will allow you to write more fluid Java code (if terseness matters to you):
long[] arr = ... // you provide
LongOp f = ... // you provide
ParArray list = new ParArray(arr);
list.par().map(f);
And the above approach can certainly be pushed to make it even cleaner.
Doing that on one machine is pretty easy, but not as easy as Scala makes it. That library you posted is already apart of Java 5 and beyond. Probably the simplest thing to use is a ExecutorService. That represents a series of threads that can be run on any processor. You send it tasks and those things return results.
http://download.oracle.com/javase/1,5.0/docs/api/java/util/concurrent/ThreadPoolExecutor.html
http://www.fromdev.com/2009/06/how-can-i-leverage-javautilconcurrent.html
I'd suggest using ExecutorService.invokeAll() which will return a list of Futures. Then you can check them to see if their done.
If you're using Java7 then you could use the fork/join framework which might save you some work. With all of these you can build something very similar to Scala parallel arrays so using it is fairly concise.
Using threads, Java doesn't have this sort of thing built-in.
There will be an equivalent in Java 8: http://www.infoq.com/articles/java-8-vs-scala