How to effectively parse text files with Java Stream API - java

I understand how to get specific data from a file with Java 8 Streams. For example if we need to get Loaded packages from a file like this
2015-01-06 11:33:03 b.s.d.task [INFO] Emitting: eVentToRequestsBolt __ack_ack
2015-01-06 11:33:03 c.s.p.d.PackagesProvider [INFO] ===---> Loaded package com.foo.bar
2015-01-06 11:33:04 b.s.d.executor [INFO] Processing received message source: eventToManageBolt:2, stream: __ack_ack, id: {}, [-6722594615019711369 -1335723027906100557]
2015-01-06 11:33:04 c.s.p.d.PackagesProvider [INFO] ===---> Loaded package co.il.boo
2015-01-06 11:33:04 c.s.p.d.PackagesProvider [INFO] ===---> Loaded package dot.org.biz
we can do
List<String> packageList = Files.lines(Paths.get(args[1])).filter(line -> line.contains("===---> Loaded package"))
.map(line -> line.split(" "))
.map(arr -> arr[arr.length - 1]).collect(Collectors.toList());
I took (and slightly modified) the code from Parsing File Example.
But what if we also need to get all the dates (and times) for Emitting: events from the same log file? How we can do this within working with the same Stream?
I can only imagine using collect(groupingBy(...)) which groups lines with Loaded packages and lines with Emitting: before parsing and then parse each group (a map entry) separately. But that would create a map with all the raw data from log file which is very memory consuming.
Is there a similar way to effectively extract multiple types of data from Java 8 Streams?

You may solve it without defining new collectors and using third-party libraries in more imperative style. First you need to define a class which represents the parsing result. It should have two methods to accept an input line and combine with existing partial result:
class Data {
List<String> packageDates = new ArrayList<>();
List<String> emittingDates = new ArrayList<>();
// Consume single input line
void accept(String line) {
if(line.contains("===---> Loaded package"))
packageDates.add(line.substring(0, "XXXX-XX-XX".length()));
if(line.contains("Emitting"))
packageDates.add(line.substring(0, "XXXX-XX-XX XX:XX:XX".length()));
}
// Combine two partial results
void combine(Data other) {
packageDates.addAll(other.packageDates);
emittingDates.addAll(other.emittingDates);
}
}
Now you can collect in quite straightforward way:
Data result = Files.lines(Paths.get(args[1]))
.collect(Data::new, Data::accept, Data::combine);

You may use pairing collector which I wrote in this answer and which is available in my StreamEx library. For your concrete problem you will also need a filtering collector which is available in JDK-9 early access builds and also in my StreamEx library. If you don't like using third-party library, you may copy it from this answer.
Also you will need to store everything into some data structure. I declared the Data class for this purpose:
class Data {
List<String> packageDates;
List<String> emittingDates;
public Data(List<String> packageDates, List<String> emittingDates) {
this.packageDates = packageDates;
this.emittingDates = emittingDates;
}
}
Putting everything together you can define a parsingCollector:
Collector<String, ?, List<String>> packageDatesCollector =
filtering(line -> line.contains("===---> Loaded package"),
mapping(line -> line.substring(0, "XXXX-XX-XX".length()), toList()));
Collector<String, ?, List<String>> emittingDatesCollector =
filtering(line -> line.contains("Emitting"),
mapping(line -> line.substring(0, "XXXX-XX-XX XX:XX:XX".length()), toList()));
Collector<String, ?, Data> parsingCollector = pairing(
packageDatesCollector, emittingDatesCollector, Data::new);
And use it like this:
Data data = Files.lines(Paths.get(args[1])).collect(parsingCollector);

Related

Quarkus - KStream and KTable join does not output messages

I am building a project modeled on this project. The key difference is, I want to output, conditionally, a message using the messages from the joined topics. As opposed to the example project, where an aggregation is performed. I am struggling to use Serde for JSON messages and so, I have simplified the message structure as follows.
t1 (KStream) - a plain text value.
t2 (KTable) - a plain text value separated by a ;.
t3 (KStream) - a CSV string.
I am publishing messages using kafkacat with the -k option to set a key e.g. k1. The problem I am facing is: I don't see any output in t3.
This is my TopologyProducer.java.
#Produces
public Topology buildTopology() {
StreamsBuilder builder = new StreamsBuilder();
ObjectMapperSerde<stream1> stream1 = new ObjectMapperSerde<>(stream1.class);
ObjectMapperSerde<topic1> topic1 = new ObjectMapperSerde<>(topic1.class);
ObjectMapperSerde<output1> output1 = new ObjectMapperSerde<>(output1.class);
GlobalKTable<String, topic1> topic1 = builder.globalTable(
t2,
Consumed.with(Serdes.String(), topic1));
builder.stream(t1,
Consumed.with(Serdes.String(), stream1))
.join(t2,
(paramName, paramValue) -> paramName,
(paramValue, paramLimits) -> {
// Add some logic to return conditionally
return new output1("paramName", 0.0, 0.0, true);
})
.to(t3,
Produced.with(Serdes.String(), output1));
return builder.build();
}
}
The Java version I had in my Dockerfile was wrong.
When I inspected the container logs, I saw an error about the difference in version of Java used to compile (higher) vs running (lower). I chose the simpler of two i.e. used a more recent version of Java to run the application (than, adjusting the Java version for local mvn build). The error can be traced to the following instruction as documented here.
The Dockerfile created by Quarkus by default needs one adjustment for the aggregator application in order to run the Kafka Streams pipeline. To do so, edit the file aggregator/src/main/docker/Dockerfile.jvm and replace the line FROM fabric8/java-alpine-openjdk8-jre with FROM fabric8/java-centos-openjdk8-jdk.
I edited my Dockerfile to use FROM registry.access.redhat.com/ubi8/openjdk-17:1.11 and have the application running.

SparkContext parallelize invocation example in java

Am getting started with Spark, and ran into issue trying to implement the simple example for map function. The issue is with the definition of 'parallelize' in the new version of Spark. Can someone share example of how to use it, since the following way is giving error for insufficient arguments.
Spark Version : 2.3.2
Java : 1.8
SparkSession session = SparkSession.builder().appName("Compute Square of Numbers").config("spark.master","local").getOrCreate();
SparkContext context = session.sparkContext();
List<Integer> seqNumList = IntStream.rangeClosed(10, 20).boxed().collect(Collectors.toList());
JavaRDD<Integer> numRDD = context.parallelize(seqNumList, 2);
Compiletime Error Message : The method expects 3 arguments
I do not get what the 3rd argument should be like? As per the documentation, it's supposed to be
scala.reflect.ClassTag<T>
But how to even define or use it?
Please do not suggest using JavaSparkContext, as i wanted to know how to get this approach to work with using generic SparkContext.
Ref : https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/SparkContext.html#parallelize-scala.collection.Seq-int-scala.reflect.ClassTag-
Here is the code which worked for me finally. Not the best way to achieve the result, but was a way to explore the API for me
SparkSession session = SparkSession.builder().appName("Compute Square of Numbers")
.config("spark.master", "local").getOrCreate();
SparkContext context = session.sparkContext();
List<Integer> seqNumList = IntStream.rangeClosed(10, 20).boxed().collect(Collectors.toList());
RDD<Integer> numRDD = context
.parallelize(JavaConverters.asScalaIteratorConverter(seqNumList.iterator()).asScala()
.toSeq(), 2, scala.reflect.ClassTag$.MODULE$.apply(Integer.class));
numRDD.toJavaRDD().foreach(x -> System.out.println(x));
session.stop();
If you don't want to deal with providing the extra two parameters using sparkConext, you can also use JavaSparkContext.parallelize(), which only needs an input list:
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.rdd.RDD;
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
final RDD<Integer> rdd = jsc.parallelize(seqNumList).map(num -> {
// your implementation
}).rdd();

Cannot resolve method 'flatMap(<lambdaexpression>)' error

I am new to apache spark and trying to run the wordcount example . But intellij editor gives me the error at line 47 Cannot resolve method 'flatMap()' error.
Edit :
This is the line where I am getting the error
JavaRDD<String> words = lines.flatMap(s -> Arrays.asList(SPACE.split(s)).iterator());
It looks like you're using an older version of Spark that expects Iterable rather than Iterator from the flatMap() function. Try this:
JavaRDD<String> words = lines.flatMap(s -> Arrays.asList(SPACE.split(s)));
See also Spark 2.0.0 Arrays.asList not working - incompatible types
Stream#flatMap is used for combining multiple streams into one, so the supplier method you provided must return a Stream result.
you can try like this:
lines.stream().flatMap(line -> Stream.of(SPACE.split(line)))
.map(word -> // map to JavaRDD)
flatMap method take a FlatMapFunctionas parameter which is not annotated with #FunctionalInterface. So indeed you can not use it as a lambda.
Just build a real FlatMapFunctionobject as parameter and you will be sure of it.
flatMap() is Java 8 Stream API. I think you should check the IDEA compile java version.
compile java version

ClosureCompiler: create_name_map_files from Java API

from command-line i can get a alias list of the function renaming from compiler.jar
Help says:
java -jar compiler.jar --help
[...]
--create_name_map_files : If true, variable renaming and
property renaming map files will be
produced as {binary name}_vars_map.out
and {binary name}_props_map.out. Note
that this flag cannot be used in
conjunction with either variableMapOut
putFile or property_map_output_file
--create_source_map VAL : If specified, a source map file
mapping the generated source files
back to the original source file will
be output to the specified path. The
%outname% placeholder will expand to
the name of the output file that the
source map corresponds to.
[...]
so, how can i get "create_name_map_files" from inline java? i took a look into the AbstractCommandLineRunner.java but all classes/methods which relate to this command line option are private and not reachable from my code..
My Code:
CompilerOptions opt = new CompilerOptions();
// decide mode
compilationLevel.ADVANCED_OPTIMIZATIONS.setOptionsForCompilationLevel(opt);
opt.prettyPrint = false;
Compiler.setLoggingLevel(Level.OFF);
Compiler compressor = new Compiler();
compressor.disableThreads();
List<SourceFile> inputs = ...;
List<SourceFile> externs = ...;
compressor.compile(externs, inputs, opt);
you can just use the option : variable_map_output_file filename , similarly for props.
Note that: The flags variable_map_output_file and create_name_map_files cannot both be used simultaniously.
From CommandLineRunner.java, I would say
opt.setCreateNameMapFiles(true)
The "compile" function returns a Result object that contains the variable (variableMap) and property (propertyMap) renaming maps. These properties contain VariableMap objects that can be serialized:
Result result = compiler.compiler(...);
result.variableMap.save(varmapPath);
result.propertyMap.save(propmapPath);

Java nested Map to Scala nested sequence

I'm new to Scala and our project mixes Java and Scala code together (using the Play Framework). I'm trying to write a Scala method that can take a nested Java Map such as:
LinkedHashMap<String, LinkedHashMap<String, String>> groupingA = new LinkedHashMap<String, LinkedHashMap<String,String>>();
And have that java object passed to a Scala function that can loop through it. I have the following scala object definition to try and support the above Java nested map:
Seq[(String, Seq[(String,String)])]
Both the Java file and the Scala file compile fine individually, but when my java object tries to create a new instance of my scala class and pass in the nested map, I get a compiler error with the following details:
[error] ..... overloaded method value apply with alternatives:
[error] (options: java.util.List[String])scala.collection.mutable.Buffer[(String, String)] <and>
[error] (options: scala.collection.immutable.List[String])List[(String, String)] <and>
[error] (options: java.util.Map[String,String])Seq[(String, String)] <and>
[error] (options: scala.collection.immutable.Map[String,String])Seq[(String, String)] <and>
[error] (options: (String, String)*)Seq[(String, String)]
[error] cannot be applied to (java.util.LinkedHashMap[java.lang.String,java.util.LinkedHashMap[java.lang.String,java.lang.String]])
Any ideas here on how I can pass in a nested Java LinkedHashMap such as above into a Scala file where I can generically iterate over a nested collection? I'm trying to write this generic enough that it would also work for a nested Scala collection in case we ever switch to writing our play framework controllers in Scala instead of Java.
Seq is a base trait defined in the Scala Collections hierarchy. While java and scala offer byte code compatibility, scala defines a number of its own types including its own collection library. The rub here is if you want to write idiomatic scala you need to convert your java data to scala data. The way I see it you have a few options.
You can use Richard's solution and convert the java types to scala types in your scala code. I think this is ugly because it assumes your input will always be coming from java land.
You can write beautiful, perfect scala handler and provide a companion object that offers the ugly java conversion behavior. This disentangles your scala implementation from the java details.
Or you could write an implicit def like the one below genericizing it to your heart's content.
.
import java.util.LinkedHashMap
import scala.collection.JavaConversions.mapAsScalaMap
object App{
implicit def wrapLhm[K,V,G](i:LinkedHashMap[K,LinkedHashMap[G,V]]):LHMWrapper[K,V,G] = new LHMWrapper[K,V,G](i)
def main(args: Array[String]){
println("Hello World!")
val lhm = new LinkedHashMap[String, LinkedHashMap[String,String]]()
val inner = new LinkedHashMap[String,String]()
inner.put("one", "one")
lhm.put("outer",inner);
val s = lhm.getSeq()
println(s.toString())
}
class LHMWrapper[K,V,G](value: LinkedHashMap[K,LinkedHashMap[G,V]]){
def getSeq():Seq[ (K, Seq[(G,V)])] = mapAsScalaMap(value).mapValues(mapAsScalaMap(_).toSeq).toSeq
}
}
Try this:
import scala.collections.JavaConversions.mapAsScalaMap
val lhm: LinkedHashMap[String, LinkedHashMap[String, String]] = getLHM()
val scalaMap = mapAsScalaMap(lhm).mapValues(mapAsScalaMap(_).toSeq).toSeq
I tested this, and got a result of type Seq[String, Seq[(String, String)]]
(The conversions will wrap the original Java object, rather than actually creating a Scala object with a copy of the values. So the conversions to Seq aren't necessary, you could leave it as a Map, the iteration order will be the same).
Let me guess, are you processing query parameters?

Categories

Resources