XML parse with Storm streaming and spark streaming

XML parse with Storm streaming and spark streaming - java

How can I parse XML data in Storm and Spark streaming? For example in Spark streaming;
// Define spark streaming MAP function.
private static final Function<XML_DOCUMENT_TYPE, MY_JAVA_CLASS> parsingXMLFunc = (doc -> {
// create my java object
MY_JAVA_CLASS mjc = new MY_JAVA_CLASS();
// classic xml parsing
List<String> parsed_doc = doc.parse(); // etc
mjc.temperature = parsed_doc[0];
mjc.accelerometer = parsed_doc[1];
return mjc;
});
In this example, can Spark parse xml in parallel?
Or Storm streaming example;
#Override
public void execute(Tuple tuple) {
// create my java object
MY_JAVA_CLASS mjc = new MY_JAVA_CLASS();
// classic xml parsing
Document doc = tuple.get(0);
List<String> parsed_doc = doc.parse(); // etc
mjc.temperature = parsed_doc[0];
mjc.accelerometer = parsed_doc[1];
_collector.emit(new Values(mjc));
};
In the above examples, is the XML parse operation done in parallel? Or do you have better approachs?

I haven't worked in Spark. Regarding Storm, you can create a function to do XML parsing (using some common java XML parser's you prefer) & call that function inside "execute" method. This will run in parallel depending upon number of workers & executors you provide for your application.

Related

Streaming data from Kinesis to S3 fails with Illegal Character that KPL itself writes

I have a relatively straightforward use case:
Read Avro data from a Kafka topic
Use KPL (v0.14.12) to send this data to Kinesis Data Streams
Use Kinesis Firehose to transform this data into Parquet and transfer it to S3.
The Kafka topic was written into by Kafka Streams using the following producer Configuration:
private void addAwsGlueSpecificProperties(Map<String, Object> props) {
props.put(AWSSchemaRegistryConstants.AWS_REGION, "eu-central-1");
props.put(AWSSchemaRegistryConstants.DATA_FORMAT, DataFormat.AVRO.name());
props.put(AWSSchemaRegistryConstants.SCHEMA_AUTO_REGISTRATION_SETTING, true);
props.put(AWSSchemaRegistryConstants.REGISTRY_NAME, "Kinesis_Schema_Registry");
props.put(AWSSchemaRegistryConstants.COMPRESSION_TYPE, AWSSchemaRegistryConstants.COMPRESSION.ZLIB.name());
props.put(DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(DEFAULT_VALUE_SERDE_CLASS_CONFIG, GlueSchemaRegistryKafkaStreamsSerde.class.getName());
}
Most notably, I've set SCHEMA_AUTO_REGISTRATION_SETTING to true to try and rule out problems with my schema definition. The auto-registration itself worked without any issues.
I have a very simple loop running for test purposes, which does step 1 and 2 of the above. It looks as follows:
KinesisProducer kinesisProducer = new KinesisProducer(getKinesisConfig());
try (final KafkaConsumer<String, AvroEvent> consumer = new KafkaConsumer<>(properties)) {
consumer.subscribe(Collections.singletonList(TOPIC));
while (true) {
log.info("Polling...");
final ConsumerRecords<String, AvroEvent> records = consumer.poll(Duration.ofMillis(100));
for (final ConsumerRecord<String, AvroEvent> record : records) {
final String key = record.key();
ListenableFuture<UserRecordResult> request = kinesisProducer.addUserRecord("my-data-stream", key, randomExplicitHashKey(), value.toByteBuffer(), gsrSchema);
Futures.addCallback(request, CALLBACK, executor);
}
Thread.sleep(Duration.ofSeconds(10).toMillis());
}
}
The callback just does a bit of logging on success/failure.
My Kinesis Config looks as follows:
private static KinesisProducerConfiguration getKinesisConfig() {
KinesisProducerConfiguration config = new KinesisProducerConfiguration();
GlueSchemaRegistryConfiguration schemaRegistryConfiguration = getGlueSchemaRegistryConfiguration();
config.setGlueSchemaRegistryConfiguration(schemaRegistryConfiguration);
config.setRegion("eu-central-1");
config.setCredentialsProvider(new DefaultAWSCredentialsProviderChain());
config.setMaxConnections(2);
config.setThreadingModel(KinesisProducerConfiguration.ThreadingModel.POOLED);
config.setThreadPoolSize(2);
config.setRateLimit(100L);
return config;
}
private static GlueSchemaRegistryConfiguration getGlueSchemaRegistryConfiguration() {
GlueSchemaRegistryConfiguration gsrConfig = new GlueSchemaRegistryConfiguration("eu-central-1");
gsrConfig.setAvroRecordType(AvroRecordType.GENERIC_RECORD ); // have also tried SPECIFIC_RECORD
gsrConfig.setRegistryName("Kinesis_Schema_Registry");
gsrConfig.setCompressionType(AWSSchemaRegistryConstants.COMPRESSION.ZLIB);
return gsrConfig;
}
This setup allows me to read Specific Avro records from Kafka and send them to Kinesis. I have also verified that the correct schema version ID is queried from GSR by my code. However, when my data gets to Firehose, I receive only the following error message for all my records (one per record):
{
"attemptsMade": 1,
"arrivalTimestamp": 1659622848304,
"lastErrorCode": "DataFormatConversion.ParseError",
"lastErrorMessage": "Encountered malformed JSON. Illegal character ((CTRL-CHAR, code 3)): only regular white space (\\r, \\n, \\t) is allowed between tokens\n at [Source: com.fasterxml.jackson.databind.util.ByteBufferBackedInputStream#6252e7eb; line: 1, column: 2]",
"attemptEndingTimestamp": 1659623152452,
"rawData": "<base64EncodedData>",
"sequenceNumber": "<seqNum>",
"dataCatalogTable": {
"databaseName": "<Glue database name>",
"tableName": "<Glue table name>",
"region": "eu-central-1",
"versionId": "LATEST",
"roleArn": "<arn>"
}
}
Unfortunately I can't post the entirety of the data as it is sensitive. However, the relevant part is that it always starts with the above control character that is causing the problem:
0x03 0x05 <schemaVersionId> <data>
My original data does not contain these control characters. After some debugging, I've found that KPL explicitly adds these bytes to the beginning of a UserRecord. In com.amazonaws.services.schemaregistry.serializers.SerializationDataEncoder#write:
public byte[] write(final byte[] objectBytes, UUID schemaVersionId) {
byte[] bytes;
try (ByteArrayOutputStream out = new ByteArrayOutputStream()) {
writeHeaderVersionBytes(out);
writeCompressionBytes(out);
writeSchemaVersionId(out, schemaVersionId);
boolean shouldCompress = this.compressionHandler != null;
bytes = writeToExistingStream(out, shouldCompress ? compressData(objectBytes) : objectBytes);
} catch (Exception e) {
throw new AWSSchemaRegistryException(e.getMessage(), e);
}
return bytes;
}
With writeHeaderVersionBytes(out) and writeCompressionBytes(out) writing to the front of the stream, respectively:
// byte HEADER_VERSION_BYTE = (byte) 3;
private void writeHeaderVersionBytes(ByteArrayOutputStream out) {
out.write(AWSSchemaRegistryConstants.HEADER_VERSION_BYTE);
}
// byte COMPRESSION_BYTE = (byte) 5
// byte COMPRESSION_DEFAULT_BYTE = (byte) 0
private void writeCompressionBytes(ByteArrayOutputStream out) {
out.write(compressionHandler != null ? AWSSchemaRegistryConstants.COMPRESSION_BYTE
: AWSSchemaRegistryConstants.COMPRESSION_DEFAULT_BYTE);
}
Why is Kinesis unable to parse a message that is produced by the library that is supposed to be best suited for writing to it? What am I missing?

I've finally figured out the problem and it's quite dumb.
What it boils down to, is that the transformer that converts data to parquet in Firehose expects a pure JSON payload. It expects records in the form:
{"itemId": 1, "itemName": "someItem"}{"itemId": 2, "itemName": "otherItem"}
It seemingly does not accept the same data in a different format.
This means that Avro-compatible JSON (where the above itemId would look like "itemId": {"long": 1}, or e.g. binary Avro data, is not compatible with the Kinesis Firehose parquet transformer, regardless of the fact that my schema definition in the Glue Schema Registry is explicitly registered as being in Avro format.
In addition, the Firehose parquet transformer requires the use of a Glue table - creating this table from an imported Avro schema simply does not work (see this answer), and had to be created manually. Luckily, even though it can't use the table that is based on an existing schema, the table definition was the same (with the exception of the Serde it needs to use), so it was relatively easy to fix...
To sum up, to get the above code to work I had to:
Create a Glue table for the schema manually (you can use the first table created from the existing schema as a template for creating this second table, but you can't have Firehose link to the first table)
Change the above code:
kinesisProducer.addUserRecord("my-data-stream", key, randomExplicitHashKey(), value.toByteBuffer(), gsrSchema);
to:
ByteBuffer data = ByteBuffer.wrap(value.toString().getBytes(StandardCharsets.UTF_8));
kinesisProducer.addUserRecord("my-data-stream", key, randomExplicitHashKey(), data);
Note that the I am now using the overloaded addUserRecord function that does not include a Schema parameter, which internally invokes the previous function with a null schema parameter. This prevents the KPL from encoding my payload and instead sends the 'plain' JSON over to KDS.
This is contrary to the only AWS Docs example that I could find on the topic, which likely is meant for a Firehose stream which does not convert the data prior to sending it to its destination.
I can't quite understand the reasons for all these undocumented limitations, and it was a pain to debug seeing how neither of the KPL functions nor KDS explicitly mentions anywhere that I can find that this is the expected behaviour. I feel like it's not worth trying to open an issue/PR over at the KPL repo seeing how it seems like Amazon doesn't really care about maintaining it that much...
I'll probably switch over to the plain Kinesis Client + Kinesis Aggregation for a more robust solution in the future, but hey, at least it works.

Apache Flink Dynamic Pipeline

I'm working on creating a framework to allow customers to create their own plugins to my software built on Apache Flink. I've outlined in a snippet below what I'm trying to get working (just as a proof of concept), however I'm getting a org.apache.flink.client.program.ProgramInvocationException: The main method caused an error. error when trying to upload it.
I want to be able to branch the input stream into x number of different pipelines, then having those combine together into a single output. What I have below is just my simplified version I'm starting with.
public class ContentBase {
public static void main(String[] args) throws Exception {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "kf-service:9092");
properties.setProperty("group.id", "varnost-content");
// Setup up execution environment and get stream from Kafka
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<ObjectNode> logs = see.addSource(new FlinkKafkaConsumer011<>("log-input",
new JSONKeyValueDeserializationSchema(false), properties).setStartFromLatest())
.map((MapFunction<ObjectNode, ObjectNode>) jsonNodes -> (ObjectNode) jsonNodes.get("value"));
// Create a new List of Streams, one for each "rule" that is being executed
// For now, I have a simple custom wrapper on flink's `.filter` function in `MyClass.filter`
List<String> codes = Arrays.asList("404", "200", "500");
List<DataStream<ObjectNode>> outputs = new ArrayList<>();
for (String code : codes) {
outputs.add(MyClass.filter(logs, "response", code));
}
// It seemed as though I needed a seed DataStream to union all others on
ObjectMapper mapper = new ObjectMapper();
ObjectNode seedObject = (ObjectNode) mapper.readTree("{\"start\":\"true\"");
DataStream<ObjectNode> alerts = see.fromElements(seedObject);
// Union the output of each "rule" above with the seed object to then output
for (DataStream<ObjectNode> output : outputs) {
alerts.union(output);
}
// Convert to string and sink to Kafka
alerts.map((MapFunction<ObjectNode, String>) ObjectNode::toString)
.addSink(new FlinkKafkaProducer011<>("kf-service:9092", "log-output", new SimpleStringSchema()));
see.execute();
}
}
I can't figure out how to get the actual error out of the Flink web interface to add that information here

There were a few errors I found.
1) A Stream Execution Environment can only have one input (apparently? I could be wrong) so adding the .fromElements input was not good
2) I forgot all DataStreams are immutable so the .union operation creates a new DataStream output.
The final result ended up being much simpler
public class ContentBase {
public static void main(String[] args) throws Exception {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "kf-service:9092");
properties.setProperty("group.id", "varnost-content");
// Setup up execution environment and get stream from Kafka
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<ObjectNode> logs = see.addSource(new FlinkKafkaConsumer011<>("log-input",
new JSONKeyValueDeserializationSchema(false), properties).setStartFromLatest())
.map((MapFunction<ObjectNode, ObjectNode>) jsonNodes -> (ObjectNode) jsonNodes.get("value"));
// Create a new List of Streams, one for each "rule" that is being executed
// For now, I have a simple custom wrapper on flink's `.filter` function in `MyClass.filter`
List<String> codes = Arrays.asList("404", "200", "500");
List<DataStream<ObjectNode>> outputs = new ArrayList<>();
for (String code : codes) {
outputs.add(MyClass.filter(logs, "response", code));
}
Optional<DataStream<ObjectNode>> alerts = outputs.stream().reduce(DataStream::union);
// Convert to string and sink to Kafka
alerts.map((MapFunction<ObjectNode, String>) ObjectNode::toString)
.addSink(new FlinkKafkaProducer011<>("kf-service:9092", "log-output", new SimpleStringSchema()));
see.execute();
}
}

The code you post cannot be compiled through because of the last part code (i.e., converting to string). You mixed up the java stream API map with Flink one. Change it to
alerts.get().map(ObjectNode::toString);
can fix it.
Good luck.

Java | GSON | Add JSON objects to excisting JSON-File

I have currently started a kind of diary project to teach myself how to code, which I write in Java. The project has a graphical interface which I realized with JavaFX.
I want to write data into a JSON file, which I enter into two text fields and a slider. Such a JSON entry should look like this:
{
"2019-01-13": {
"textfield1": "test1",
"textfield2": "test2",
"Slider": 2
}
}
I have already created a class in which the values can be passed and retrieved by the JSONWriter.
The class looks like this:
public class Entry {
private String date, textfield1, textfield2;
private Integer slider;
public String getDate() {
return date;
}
public void setDate(String date) {
this.date = date;
}
public String getTextfield1() {
return textfield1;
}
public void setTextfield1(String textfield1) {
this.textfield1 = textfield1;
}
public String getTextfield2() {
return textfield2;
}
public void setTextfield2(String textfield2) {
this.textfield2 = textfield2;
}
public Integer getSlider() {
return slider;
}
public void setSlider(Integer slider) {
this.slider= slider;
}
}
The code of the JSONWriter looks like this:
void json() throws IOException {
Gson gson = new GsonBuilder().setPrettyPrinting().create();
JsonWriter writer = new JsonWriter(new FileWriter("test.json",true));
JsonParser parser = new JsonParser();
Object obj = parser.parse(new FileReader("test.json"));
JsonObject jsonObject = (JsonObject) obj;
System.out.println(jsonObject);
writer.beginObject();
writer.name(entry.getDate());
writer.beginObject();
writer.name("textfield1").value(entry.getTextfield1());
writer.name("textfield2").value(entry.getTextfield2());
writer.name("Slider").value(entry.getSlider());
writer.endObject();
writer.endObject();
writer.close();
}
The date is obtained from the datepicker. Later I want to filter the data from the Json file by date and transfer the containing objects (textfield 1, textfiel2, slider) into the corresponding fields.
If possible, I would also like to try to overwrite the objects of a date. This means, if an entry of the date already exists and I want to change something in the entries, it should be replaced in the JSON file, so I can retrieve it later.
If you can recommend a better memory type for this kind of application, I am open for it. But it should also be compatible with databases later on. Later I would like to deal with databases as well.
So far I have no idea how to do this because I am still at the beginning of programming. I've been looking for posts that could cover the topic, but I haven't really found anything I understand.

You could start without JsonParser and JsonWriter and use Gson's fromJson(..) and toJson(..) because your current Json format is easily mapped as a map of entry POJOs.
Creating some complex implementation with JsonParser & JsonWriter might be more efficient for big amounts of data but in that point you already should have studied how to persist to db anyway.
POJOs are easy to manipulate and they can be later easily persisted to db - for example if you decide to use technology like JPA with only few annotations.
See below simple example:
#Test
public void test() throws IOException {
Gson gson = new GsonBuilder().setPrettyPrinting().create();
// Your current Json seems to be a map with date string as a key
// Create a corresponding type for gson to deserialize to
// correct generic types
Type type = new TypeToken<Map<String, Entry>>() {}.getType();
// Check this file name for your environment
String fileName = "src/test/java/org/example/diary/test.json";
Reader reader = new FileReader(new File(fileName));
// Read the whole diary to memory as java objects
Map<String, Entry> diary = gson.fromJson(reader, type);
// Modify one field
diary.get("2019-01-13").setTextfield1("modified field");
// Add a new date entry
Entry e = new Entry();
e.setDate("2019-01-14");
e.setScale(3);
e.setTextfield1("Dear Diary");
e.setTextfield1("I met a ...");
diary.put(e.getDate(), e);
// Store the new diary contents. Note that this one does not overwrite the
// original file but appends ".out.json" to file name to preserver the original
FileWriter fw = new FileWriter(new File(fileName + ".out.json"));
gson.toJson(diary, fw);
fw.close();
}
This should result test.json.out.json like:
{
"2019-01-13": {
"textfield1": "modified field",
"textfield2": "test2",
"Slider": 2
},
"2019-01-14": {
"date": "2019-01-14",
"textfield1": "Dear Diary",
"textfield2": "I met a ...",
"Slider": 3
}
}
Note that I also made little assumption about this:
// Just in case you meant to map "Slider" in Json as "scale"
#SerializedName("Slider")
private Integer scale;

I will give you general tips up to you to go deeper.
First of all, I recommend you this architecture that is common on web-applications or even desktop apps to get the front-end layer separately of back-end server:
Front-end (use Java Fx if you want). Tutorial: http://www.mastertheboss.com/jboss-frameworks/resteasy/rest-services-using-javafx-tutorial
Back-end (Java 1.8, Springboot, MySQL database). Example: there are tons of examples and tutorials using this stack, I recommend mykong or baeldung blogs.
The front-end will communicate to server over HTTP request through back-end REST API using JSON or XML format for messaging. In real life there are physically separated but just create 2 different java projects running on different ports.
For the back-end, just follow the tutorial to get up and running a REST API server. Set up MVC pattern: Controller layer, Service layer, Repository layer, model layer, dto layers, etc. For your specific model I recommend you the following:
selected_date: Date
inputs: Map of strings
size: Integer
On Front-end project with Java FX, just re-use the code you already wrote and add some CSS if you want. Use the components actions to call the back-end REST API to create, retrieve, update and delete your data from date-picker or whatever operation you want to do.
You will transform java objects into JSON strings permanently, I recommend you to use Gson library or Jackson library that do this in a direct way and it is not need to build the JsonObject manually. If you still want to write the JSON into a file, transform the java object into string (this is a string with the object written in JSON format) using the mentioned libraries, and then write the string into file. But I strongly believe it will more practice if you implement database.
Hope it helps

Using static Datasets inside ForeachRDD in Java Spark Streaming for DStreams RDD parallel processing

We have DStreams which is consuming JSON messages using custom receiver. This JSON messages is nothing but a user request in the form of some input parameteres.
JavaReceiverInputDStream<String> msgDStream = ssc.receiverStream(receiver);
Another thing is i have static Dataset(preloaded) for e.g.
Dataset<Row> loanDS = spark.read().parquet("/path")
Now in my use case, i want to process DStream RDDs data(JSON messages) parallely
msgDStream.foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> stringJavaRDD) throws Exception {
if(!stringJavaRDD.isEmpty()) {
System.out.println("Json string: " + requestJSON);
stringJavaRDD.foreach(new VoidFunction<String>(){
public void call(String s) throws Exception{
parseJSON(s); // externam utility to parse the JSON messages
{here i want to build aggregate clause, select clause based on the static dataset loanDS with the JSON messages}
}
});
}}});
When i use loandDS inside stringJavaRDD.foreach, it does not have anything because foreach executes on the worker nodes and loanDS is present on Driver.
How to achieve this inside foreach because i want to process JSON messages inside DStream RDD's parallely.

Streaming a json element

Let's say I have a json that looks like this:
{"body":"abcdef","field":"fgh"}
Now suppose the value of the 'body' element is huge(~100 MB or more). I would like to stream out the value of the body element instead of storing it in a String.
How can I do this? Is there any Java library I could use for this?
This is the line of code that fails with an OutOfMemoryException when a large json value comes in:
String inputStreamString = (String) JsonPath.read(textValue.toString(), "$.body");
'textValue' here is a hadoop.io.Text object.
I'm assuming that the OutOfMemory error occurs because we try to do method calls like toString() (which creates a new object), and JsonPath.read(), all of which are done in-memory. I need to know if there is an approach I could take while handling large-sized textValue objects.
Please let me know if you need additional info.

JsonSurfer is good for processing very large JSON data with selective extraction.
Example how to surf in JSON data collecting matched values in the listeners:
BufferedReader reader = new BufferedReader(new FileReader(jsonFile));
JsonSurfer surfer = new JsonSurfer(GsonParser.INSTANCE, GsonProvider.INSTANCE);
SurfingConfiguration config = surfer.configBuilder().bind("$.store.book[*]", new JsonPathListener() {
#Override
public void onValue(Object value, ParsingContext context) throws Exception {
JsonObject book = (JsonObject) value;
}
}).build();
surfer.surf(reader, config);

Jackson offers a streaming API for generating and processing JSON data.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

XML parse with Storm streaming and spark streaming - java

I haven't worked in Spark. Regarding Storm, you can create a function to do XML parsing (using some common java XML parser's you prefer) & call that function inside "execute" method. This will run in parallel depending upon number of workers & executors you provide for your application.

Related

Streaming data from Kinesis to S3 fails with Illegal Character that KPL itself writes

Apache Flink Dynamic Pipeline

Java | GSON | Add JSON objects to excisting JSON-File

Using static Datasets inside ForeachRDD in Java Spark Streaming for DStreams RDD parallel processing

Streaming a json element

Categories

Resources