String transformation using Spark - java

I'm learning Spark, and trying to write quite simple app.
As input I have log string, which looks like
INFO - {timestamp} - {path} - {json message}
INFO - 124534234534534 - test.class - {"message": "something happened"]
I want to pass it to ElasticSearch. So I need to take {timestamp} and put it to new field to {json message}, so it should look like
{"timestamp": "1234343132", "message": "something happened"}
Can someone help me with this transformation using Java?

Create a Function<String, String> which takes a line of log and returns JSON string.
Function<String, String> f = new Function<String, String>() {
public String call(String s) { return ...; }
}
Read data using SparkContext.textFile
JavaSparkContext sc = ...;
JavaRDD<String> rdd = sc.textFile(...)
map created RDD using function defined in point 1.
rdd.map(f);

Related

How to parse a application/x-www-form-urlencoded with an associative array?

I'm trying to parse the following body:
event=invoice.created&data%5Bid%5D=1757E1D7FD5E410A9C563024250015BF&
data%5Bstatus%5D=pending&data%5Baccount_id%5D=70CA234077134ED0BF2E0E46B0EDC36F&
data%5Bsubscription_id%5D=F4115E5E28AE4CCA941FCCCCCABE9A0A
Which translates to:
event = invoice.created
data[id] = 1757E1D7FD5E410A9C563024250015BF
data[status] = pending
data[account_id] = 70CA234077134ED0BF2E0E46B0EDC36F
data[subscription_id] = F4115E5E28AE4CCA941FCCCCCABE9A0A
Code:
#PostMapping(consumes = [MediaType.APPLICATION_FORM_URLENCODED_VALUE])
fun cb(event: SubscriptionRenewed)
{
println(event)
}
data class SubscriptionRenewed(
val event: String,
val data: Data
)
data class Data(
val id: String,
val status: String,
val account_id: String,
val subscription_id: String
)
Normally you just create a POJO representation of the incoming body and spring a translates it to an object.
I learned that I could add all the parameters to the function declaration as #RequestParam("data[id]") id: String, but that would make things really verbose.
The issue is with parsing data[*], ideas of how to make it work?
Edit:
I discovered that if I change val data: Data to val data: Map<String, String> = HashMap(), the associative array will be correctly inserted into the map, ideas of how to map it to an object instead?
Note: IDs/Tokens are not real. They are from a documentation snippet.
Deserialize to Map and use json serialize/deserialize to object
Initially deserialize the input to Map<String, String>
Use Json processor(like ObjectMapper or Gson) to serialize the Map constructed in the previous step
Use the json processor to deserialize the json output of previous step to a custom object.
static class Data {
private String one;
private String a;
#Override
public String toString() {
return "Data{one=" + one + ", a=" + a + "}";
}
}
public static void main(String[] args) {
String input = "{\"one\":1, \"a\":\"B\"}";
Gson gson = new GsonBuilder().create();
Map<String, String> map = gson.fromJson(input, new TypeToken<Map<String, String>>(){}.getType());
Data data = gson.fromJson(gson.toJson(map), Data.class);
System.out.println(data);
}
This is surely a round about approach and i am not aware of any optimization

How to Convert Edn string to Json

I have to retrieve data from some site that sends back responses with edn bodies. I am trying to convert the sent back Edn to Json so I can parse it with Jsoup.
I found a website that was able to do the conversion, but how do I implement something like this in java?
I tried something like this, but it didn't a full job:
public static String edmToJson(String edm) {
String json = edm;
json = json.replaceFirst("(\\(\\{).*?(}\\))", "1").replace("(", "").replace("})", "").replace("} {", "},{");
return json;
}
Is there a way to do it without using closure?
You can parse EDN data in java by using a library like edn-java.
Sample usage:
#Test
public void simpleUsageExample() throws IOException {
Parseable pbr = Parsers.newParseable("{:x 1, :y 2}");
Parser p = Parsers.newParser(defaultConfiguration());
Map<?, ?> m = (Map<?, ?>) p.nextValue(pbr);
assertEquals(m.get(newKeyword("x")), 1L);
assertEquals(m.get(newKeyword("y")), 2L);
assertEquals(Parser.END_OF_INPUT, p.nextValue(pbr));
}
Complete docs available at edn-java

Apache Beam Flatten Iterable<String>

In the below code after groupbyKey, I am getting PCollection>>. How to flatten the Iterable in the value before sending to FileIO.
.apply(GroupByKey.<String, String>create())
.apply("Write file to output",FileIO.< String, KV<String,String>>writeDynamic()
.by(KV::getKey)
.withDestinationCoder(StringUtf8Coder.of())
.via(Contextful.fn(KV::getValue), TextIO.sink())
.to("Out")
.withNaming(key -> FileIO.Write.defaultNaming("file-" + key, ".txt")));
Thanks for the kind help.
You need to use a ParDo to flatten the Iterable portion of the PCollection as shown below:-
PCollection<KV<String, Doc>> urlDocPairs = ...;
PCollection<KV<String, Iterable<Doc>>> urlToDocs =
urlDocPairs.apply(GroupByKey.<String, Doc>create());
PCollection<R> results =
urlToDocs.apply(ParDo.of(new DoFn<KV<String, Iterable<Doc>>, R>() {
{#literal #}ProcessElement
public void processElement(ProcessContext c) {
String url = c.element().getKey();
for <String,Doc> docsWithThatUrl : c.element().getValue();
c.output(docsWithThatUrl)
}}));

Get a JSON from a Dataset in Spark SQL (java)

I have a Spark SQL application running on a server. It takes data from .parquet files and in each request performs an SQL query on those data. I need to send the JSON corresponding to the output of the query in the response.
This is what I do
Dataset<Row> sqlDF = spark.sql(query);
sqlDF.show();
So I know that the query works.
I tried returning sqlDF.toJSON().collect(), but in the other end I only receive [Ljava.lang.String;#1cd86ff9.
I tried writing sqlDF as a JSON file, but then I don't know how to add its content to the response, and it saves a structure of files that have nothing to do with a JSON file.
Any idea/suggestion?
You can return JSON String using the below code.
List<String> stringDataset = sqlDF.toJSON().collectAsList();
return stringDataset;
Jackson will return the JSON string in this case.
If you want to return proper JSONObject the you can use the below code :
List<Map<String,Object>> result= new ArrayList<>();
List<String> stringDataset = sqlDF.toJSON().collectAsList();
for(String s : stringDataset){
Map<String,Object> map = new HashMap<>();
map = mapper.readValue(s, new TypeReference<Map<String, String>>(){});
result.add(map);
}

How to convert from Json to Protobuf?

I'm new to using protobuf, and was wondering if there is a simple way to convert a json stream/string to a protobuf stream/string in Java?
For example,
protoString = convertToProto(jsonString)
I have a json string that I want to parse into a protobuf message. So, I want to first convert the json string to protobuf, and then call Message.parseFrom() on it.
With proto3 you can do this using JsonFormat. It parses directly from the JSON representation, so there is no need for separately calling MyMessage.parseFrom(...). Something like this should work:
JsonFormat.parser().merge(json_string, builder);
//You can use this for converting your input json to a Struct / any other Protobuf Class
import com.google.protobuf.Struct.Builder;
import com.google.protobuf.Struct;
import com.google.protobuf.util.JsonFormat;
import org.json.JSONObject;
JSONObject parameters = new JSONObject();
Builder structBuilder = Struct.newBuilder();
JsonFormat.parser().merge(parameters.toString(), structBuilder);
// Now use the structBuilder to pass below (I used it for Dialog Flow V2 Context Management)
Since someone asked about getting the exception "com.google.protobuf.InvalidProtocolBufferException: JsonObject" when following Adam's advice--I ran into the same issue. Turns out it was due to the google protobuf timestamps. They are being serialized as an object containing two fields "seconds" and "nanos", since this isn't production code, I just got around this by parsing the JSON using jackson, going through the JSON object recursively and changing every timestamp from an object to a string formatted as per RFC 3339, I then serialized it back out and used the protobuf JSON parser as Adam has shown. This fixed the issue. This is some throwaway code I wrote (in my case all timestamp fields contain the word "timestamp", this could be more robust, but I don't care):
public Map<String, Object> fixJsonTimestamps(Map<String, Object> inMap) {
Map<String, Object> outMap = new HashMap<>();
for(String key : inMap.keySet()) {
Object val = inMap.get(key);
if(val instanceof Map) {
Map<String, Object> valMap = (Map<String, Object>)val;
if(key.toLowerCase().contains("timestamp") &&
valMap.containsKey("seconds") && valMap.containsKey("nanos")) {
if(valMap.get("seconds") != null) {
ZonedDateTime d = ZonedDateTime.ofInstant(Instant.ofEpochSecond((int)valMap.get("seconds")).plusNanos((int)valMap.get("nanos")),
ZoneId.of("UTC"));
val = d.format(DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"));
}
} else {
val = fixJsonTimestamps(valMap);
}
} else if(val instanceof List && ((List) val).size() > 0 &&
((List) val).get(0) instanceof Map) {
List<Map<String, Object>> outputList = new ArrayList<>();
for(Map item : (List<Map>)val) {
outputList.add(fixJsonTimestamps(item));
}
val = outputList;
}
outMap.put(key, val);
}
return outMap;
}
Not the most ideal solution but it works for what I am doing, I think I saw someone recommend using a different timestamp class.
You can convert json string to Proto using builder and json String
Example :
YourProto.Builder protoBuilder = YourProto.newBuilder();
JsonFormat.parser().merge(JsonString, protoBuilder);
If you want to ignore unknown json field then
YourProto.Builder protoBuilder = YourProto.newBuilder();
JsonFormat.parser()..ignoringUnknownFields().merge(JsonString, protoBuilder);
Another way is, to use mergeFrom method from ProtocolBuffer
Example :
YourProto.Builder protoBuilder = YourProto.newBuilder();
protoBuilder.mergeFrom(JsonString.getBytes());
Once it execute, you will get all the data in protoBuilder from json String
online service:
https://json-to-proto.github.io/
This tool instantly converts JSON into a Protobuf. Paste a JSON structure on the left and the equivalent Protobuf will be generated to the right, which you can paste into your program. The script has to make some assumptions, so double-check the output!

Categories

Resources