Apache Beam Flatten Iterable<String>

Apache Beam Flatten Iterable<String> - java

In the below code after groupbyKey, I am getting PCollection>>. How to flatten the Iterable in the value before sending to FileIO.
.apply(GroupByKey.<String, String>create())
.apply("Write file to output",FileIO.< String, KV<String,String>>writeDynamic()
.by(KV::getKey)
.withDestinationCoder(StringUtf8Coder.of())
.via(Contextful.fn(KV::getValue), TextIO.sink())
.to("Out")
.withNaming(key -> FileIO.Write.defaultNaming("file-" + key, ".txt")));
Thanks for the kind help.

You need to use a ParDo to flatten the Iterable portion of the PCollection as shown below:-
PCollection<KV<String, Doc>> urlDocPairs = ...;
PCollection<KV<String, Iterable<Doc>>> urlToDocs =
urlDocPairs.apply(GroupByKey.<String, Doc>create());
PCollection<R> results =
urlToDocs.apply(ParDo.of(new DoFn<KV<String, Iterable<Doc>>, R>() {
{#literal #}ProcessElement
public void processElement(ProcessContext c) {
String url = c.element().getKey();
for <String,Doc> docsWithThatUrl : c.element().getValue();
c.output(docsWithThatUrl)
}}));

Related

Printing Kafka KTable Data

I have this method which input's a JSON payload. I want to create a K-table out of this payload, read the data out of it and print it.
So far, I am to create the KTable, but when I iterate over it, the control skips over it.
Please can anyone help me/guide me where am I going wrong or what I am I missing?
Thanks.
public void twilioKTable(JsonNode payloadJson) throws JsonProcessingException {
String payload = payloadJson.toString();
final StreamsBuilder builder = new StreamsBuilder();
final KTable<String, String> kTable = builder.table("testktable");
kTable.toStream().map((key, value) -> new KeyValue<>(key, value))
.foreach(new ForeachAction<String, String>() {
public void apply(String key, String value) {
System.out.println("From MAP " + key + ": Value " +
value);
}
});
}

You could also use .peek() method in KStream to print every key with value

How to convert a nested complex scala Map to Java

I'm trying to use a Scala library in my Java program and I have some difficulties to convert a complex Scala Map to Java.
My used Scala object method has the following return type: scala.collection.mutable.Map<String, Map<Object, Seq<Object>>>
How do I convert that to a Java equivalent of Map<String, Map<Object, List<Object>>> ?
I already played around with the JavaConversions and JavaConvertors packages but no luck :(
public void getPartitionAssignmentForTopics(final List<String> topics) {
final Seq<String> seqTopics = scala.collection.JavaConversions.asScalaBuffer(topics).toList();
scala.collection.mutable.Map<String, Map<Object, Seq<Object>>> map2 = zkUtils
.getPartitionAssignmentForTopics(seqTopics);
val map:scala.collection.mutable.Map[String, Map[Object, Seq[Object]]] = scala.collection.mutable.Map()
map:
collection.mutable.Map[String, Map[Object, Seq[Object]]] =Map()
map.mapValues(_.mapValues(_.asJava).asJava).asJava
res2:
java.util.Map[String, java.util.Map[Object, java.util.List[Object]]] ={}
}
This does not compile :)
With playing around I meant that I use the following code to convert from Scala Seq to Java List:
scala.collection.JavaConversions.seqAsJavaList(zkUtils.getAllTopics());

I ended up with the following code. Not really nice :D
public java.util.Map<String, java.util.Map<Integer, java.util.List<Integer>>> getPartitionAssignmentForTopics(final List<String> topics) {
final scala.collection.Seq<String> seqTopics = scala.collection.JavaConversions.asScalaBuffer(topics).toList();
scala.collection.mutable.Map<String, scala.collection.Map<Object, scala.collection.Seq<Object>>> tmpMap1 =
zkUtils.getPartitionAssignmentForTopics(seqTopics);
final java.util.Map<String, java.util.Map<Integer, java.util.List<Integer>>> result = new HashMap<>();
java.util.Map<String, Map<Object, Seq<Object>>> tmpMap2 = JavaConversions.mapAsJavaMap(tmpMap1);
tmpMap2.forEach((k1, v1) -> {
String topic = (String)k1;
java.util.Map<Object, Seq<Object>> objectSeqMap = JavaConversions.mapAsJavaMap(v1);
java.util.Map<Integer, List<Integer>> tmpResultMap = new HashMap<>();
objectSeqMap.forEach((k2, v2) -> {
Integer tmpInt = (Integer)k2;
List<Integer> tmpList = (List<Integer>)(Object)JavaConversions.seqAsJavaList(v2);
tmpResultMap.put(tmpInt, tmpList);
});
result.put(topic, tmpResultMap);
});
return result;
}

How convert JavaRDD<Row> to JavaRDD<List<String>>?

JavaRDD<List<String>> documents = StopWordsRemover.Execute(lemmatizedTwits).toJavaRDD().map(new Function<Row, List<String>>() {
#Override
public List<String> call(Row row) throws Exception {
List<String> document = new LinkedList<String>();
for(int i = 0; i<row.length(); i++){
document.add(row.get(i).toString());
}
return document;
}
});
I try make it with use this code, but I get WrappedArray
[[WrappedArray(happy, holiday, beth, hope, wonderful, christmas, wish, best)], [WrappedArray(light, shin, meeeeeeeee, like, diamond)]]
How make it correctly?

You can use getList method:
Dataset<Row> lemmas = StopWordsRemover.Execute(lemmatizedTwits).select("lemmas");
JavaRDD<List<String>> documents = lemmas.toJavaRDD().map(row -> row.getList(0));
where lemmas is the name of the column with lemmatized text. If there is only one column (it looks like this is the case) you can skip select. If you know the index of the column you can skip select as well and pass index to getList but it is error prone.
Your current code iterates over the Row not the field you're trying to extract.

Here's an example with using an excel file :
JavaRDD<String> data = sc.textFile(yourPath);
String header = data.first();
JavaRDD<String> dataWithoutHeader = data.filter(line -> !line.equalsIgnoreCase(header) && !line.isEmpty());
JavaRDD<List<String>> dataAsList = dataWithoutHeader.map(line -> Arrays.asList(line.split(";")));
hope this peace of code help you

Convert RDD List to RDD of individual element in spark

I have a input rdd (JavaRDD<List<String>>) and i want to convert it to JavaRDD<String> as output.
Each element of input RDD list should become a individual element in output rdd.
how to achieve it in java?
JavaRDD<List<String>> input; //suppose rdd length is 2
input.saveAsTextFile(...)
output:
[a,b] [c,d]
what i want:
a b c d

Convert it into a DataFrame and use Explode UDF function.

I did a workaround using below code snippet:
Concat each element of list with separator '\n' then save rdd using standard spark API.
inputRdd.map(new Function<List<String>, String>() {
#Override
public String call(List<String> scores) throws Exception {
int size = scores.size();
StringBuffer sb = new StringBuffer();
for (int i=0; i <size;i++){
sb.append(scores.get(i));
if(i!=size-1){
sb.append("\n");
}
}
return sb.toString();
}
}).saveAsTextFile("/tmp/data"));

If the rdd type is RDD[List[String]], you can just do this:
val newrdd = rdd.flatmap(line => line)
Each of the elements will be a new line in the new rdd.

below will solve your problem
var conf = new SparkConf().setAppName("test")
.setMaster("local[1]")
.setExecutorEnv("executor-cores", "2")
var sc = new SparkContext(conf)
val a = sc.parallelize(Array(List("a", "b"), List("c", "d")))
a.flatMap(x => x).foreach(println)
output :
a
b
c
d

Convert Spark Java to Spark scala

I am trying to convert my Java code to scala in Spark, but found it very complicated. Is it possible to convert the following Java code to scala? Thanks!
JavaPairRDD<String,Tuple2<String,String>> newDataPair = newRecords.mapToPair(new PairFunction<String, String, Tuple2<String, String>>() {
private static final long serialVersionUID = 1L;
#Override
public Tuple2<String, Tuple2<String, String>> call(String t) throws Exception {
MyPerson p = (new Gson()).fromJson(t, MyPerson.class);
String nameAgeKey = p.getName() + "_" + p.getAge() ;
Tuple2<String, String> value = new Tuple2<String, String>(p.getNationality(), t);
Tuple2<String, Tuple2<String, String>> kvp =
new Tuple2<String, Tuple2<String, String>>(nameAgeKey.toLowerCase(), value);
return kvp;
}
});
I tried the following, but I am sure I have missed many things. And actually it is not clear to me how to do the override function in scala ... Please suggest or share some examples. Thank you!
val newDataPair = newRecords.mapToPair(new PairFunction<String, String, Tuple2<String, String>>() {
#Override
public val call(String t) throws Exception {
val p = (new Gson()).fromJson(t, MyPerson.class);
val nameAgeKey = p.getName() + "_" + p.getAge() ;
val value = new Tuple2<String, String>(p.getNationality(), t);
val kvp =
new Tuple2<String, Tuple2<String, String>>(nameAgeKey.toLowerCase(), value);
return kvp;
}
});

Literal translations from Spark-Java to Spark-Scala typically don't work because Spark-Java introduces many artifacts to cope with the limited type system in Java. Examples in this case: mapToPair in Java is just map in Scala. Tuple2 has a more terse syntax (a,b)
Applying that (and some more) to the snippet:
val newDataPair = newRecords.map{t =>
val p = (new Gson()).fromJson(t, classOf[MyPerson])
val nameAgeKey = p.getName + "_" + p.getAge
val value = (p.getNationality(), t)
(nameAgeKey.toLowerCase(), value)
}
It could be made a bit more concise but I wanted to keep the same structure as the Java counterpart to facilitate the understanding of it.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache Beam Flatten Iterable<String> - java

Related

Printing Kafka KTable Data

How to convert a nested complex scala Map to Java

How convert JavaRDD<Row> to JavaRDD<List<String>>?

Convert RDD List to RDD of individual element in spark

Convert Spark Java to Spark scala

Categories

Resources