Writing a custom unbounded sink for Dataflow v2.1 - java

I'm using the mvn dependency google-cloud-dataflow-java-sdk-all version 2.1.0 and I'm trying to add a custom Sink for my pipeline.
In the pipeline, I'm retrieving Pubsub messages and am eventually transforming these to a PCollection of Strings.
This is a simplified version of the pipeline I've set up:
Pipeline pipeline = Pipeline.create(options);
pipeline.apply(PubsubIO.readMessages())
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(1))))
//transformations
.apply(//Write to custom sink)
The reason I need a custom Sink is because someone else on the team has already written the code to write out this data to BigQuery and provided a REST API for this. So, my Sink would be calling this REST API with the relevant data. I'm not keen on using BigQueryIO since that would involve duplicating parts of the code that was already written.
The problem is that I can not find any documentation on the Apache Beam website about writing custom Sinks using the Java SDK, so if someone could give me a nod in the right direction, it'd be much appreciated.
I've also considered just using a ParDo to send the data to the REST API, but then I technically would not have a Sink anymore and I wouldn't be doing it the "Dataflow way".

For unbounded sinks, there is no sink specific API in Beam. All the IO transforms essentially implement a ParDo. There are a few techniques to provide specific guarantees (e.g. using a GroupByKey to provide a checkpoint barrier in Dataflow) it depends on your interaction with external system (REST API in this case). Looks like writing a ParDo is the way to go in your case.

Related

Does Streams API functionality overlap with Spring Integration?

When introducing parallel processing to an application where multiple save entity calls are being made, I see prior dev has chosen to do it via Spring Integration using split().channel(MessageChannels.executor(Executors.newfixedThreadPool(10))).handle("saveClass","saveMethod").aggregate().get() - where this method is mapped to a requestChannel using #Gateway annotation. My question is this task seems to be simpler to do using the parallelStream() and forEach() method. Does IntergrationFlow provide any benefit in this case?
If you really do a plain in-memory data processing where Java's Stream API is enough, then indeed you don't need the whole messaging solution like Spring Integration. But if you deal with distributed requirements to process data from different systems, like from HTTP to Apache Kafka or DB, then it is better to use some tool which allows you smoothly to connect everything together. Also: no one stops you to use Stream API in the Spring Integration application. In the end all your code is Java anyway. Please, learn more what is EIP and why we would need a special framework to implement this messaging-based solutions: https://www.enterpriseintegrationpatterns.com/

Call a Rest API from Apache Beam

Context: I'm creating objects that contain some data, and in my case, data cannot find from one source. So I'm consuming data coming from a Kinesis stream and need to call a Rest API providing some data that came from the stream to get some other fields required to create the final object.
Question: I know that Apache beam is not yet released as an official way to communicate with Rest APIs. But looks like it is still in progress - https://issues.apache.org/jira/browse/BEAM-1946
What I want to know is the best and proper way to call a Rest API inside the beam pipeline. Highly appreciate it if you can share any resources/examples along with your feedback's.
Tried the standard(HttpClient API) java way of calling Rest API, but I want to know whether there is a better way to achieve this. Am expecting answers and some resources if possible.
TIA
Apache Beam offers a way to define arbitrary computations in the form of DoFn UDF implementation used with ParDo transforms. This should allow you to add logic that connects to any REST API. Currently there's no official REST Beam connector as you mentioned but this should not block you from connecting to any custom API within your pipeline.

Dynamic Streams Topology in Kafka

While creating Kafka Streams using Kafka Streams DSL
https://kafka.apache.org/0110/documentation/streams/developer-guide
we have encountered a scenario where we need to update the Kafka Streams with new topology definition.
For example:
When we started, we have a topology defined to read from one topic (Source) and a destination topic (Sink).
However, on a configuration change we now need to read from 2 different topics (2 sources if you will) and write to a single destination topic.
From what we have built right now, the topology definition is hard coded, something like defined in processor topology.
Questions:
Is it possible to define topology in a declarative way (say in a Json or something else), which doesn't require a codification of the topology?
Is it possible to reload an existing Kafka Stream to use a new definition of the Kafka Streams Topology?
For #2 mentioned above, does Kafka Streams DSL provide a way to "reload" new topology definitions by way of an external trigger or system call?
We are using JDK 1.8 and Kafka DSL 2.2.0
Thanks,
Ayusman
Is it possible to define topology in a declarative way (say in a Json or something else), which doesn't require a codification of the topology?
The KStreams DSL is declarative, but I assume you mean something other than the DSL?
If so, the answer is No. You may want to look at KSQL, however.
Is it possible to reload an existing Kafka Stream to use a new definition of the Kafka Streams Topology?
You mean if an existing Kafka Streams application can reload a new definition of a processing topology? If so, the answer is No. In such cases, you'd deploy a new version of your application.
Depending on how the old/new topologies are defined, a simple rolling upgrade of your application may suffice (roughly: if the topology change was minimal), but probably you will need to deploy the new application separately and then, once the new one is vetted, decommission your old application.
Note: KStreams is a Java library and, by design, does not include functionality to operate/manage the Java applications that use the KStreams library.
For #2 mentioned above, does Kafka Streams DSL provide a way to "reload" new topology definitions by way of an external trigger or system call?
No.

Elastic Stack - REST API logging with full JSON request and response

Background
We have a web server written in Java that communicates with thousands of mobile apps via HTTPS REST APIs.
For investigation purposes we have to log all API calls - currently this is implemented as a programming #Aspect, and for each API call we save an api_call_log object into a MySQL table with the following attributes:
tenant_id
username
device_uuid
api_method
api_version
api_start_time
api_processing_duration
request_parameters
full_request (JSON)
full_response (JSON)
response_code
Problem
As you can imagine after reaching a certain throughput this solution doesn't scale well, and also querying this table is very slow even with the use of the right MySQL indices.
Approach
That's why we want to use the Elastic Stack to re-implement this solution, however I am a bit stuck at the moment.
Question
I couldn't found any Logstash plugins yet that would suit my needs - should I output this api_call_log object into a log file instead and use Logstash to parse, filter and transform that file?
Exactly this is what I would do in this case. Write your log to a file using a framework like logback, rotate it. If you want easy parsing use json as logging format (also available in logback). Then use Filebeat in order to ingest the logfile as it gets written. If you need to transform/parse the messages in elasticsearch ingest nodes using pipelines.
Consider tagging/enriching the logfiles read by filebeat with machine or enviroment specific informations in order to ask for them in your visualisation or report etc.
The filebeat-to-elastic approach is the simplest one. Try this first. If you can't get your parsing done in elasticsearch pipelines, put a logstash in between.
Using filebeat you'll get many stuff for free like backpressure handling and daily indicies what comes very handy in the logging scenario we are discussing here.
When you need a visualization or search ui, have a look on kibana or grafana.
And if you have more questions, raise a new question here.
Have Fun!
https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-installation.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html

Preprocessing input text before calling ElasticSearch API

I have a Java client that allows indexing documents on a local ElasticSearch server.
I now want to build a simple Web UI that allows users to query the ES index by typing in some text in a form.
My problem is that, before calling ES APIs to issue the query, I want to preprocess the user input by calling some Java code.
What is the easiest and "cleanest" way to achieve this?
Should I create my own APIs so that the UI can access my Java code?
Should I build the UI with JSP so that I can directly call my Java
code?
Can I somehow make ElasticSearch execute my Java code before
the query is executed? (Perhaps by creating my own ElasticSearch plugin?)
In the end, I opted for the simple solution of using Json-based RESTful APIs. Time proved this to be quite flexible and effective for my case, so I thought I should share it:
My Java code exposes its ability to query an ElasticSearch index, by running an HTTP server and responding to client requests with Json-formatted ES results. I created the HTTP server with a few lines of code, by using sun.net.HttpServer. There are more serious/complex HTTP servers out there (such as Tomcat), but this was very quick to adopt and required zero configuration headaches.
My Web UI makes HTTP GET requests to the Java server, receives Json-formatted data and consumes it happily. My UI is implemented in PHP, but any web language does the job, as long as you can issue HTTP requests.
This solution works really well in my case, because it allows to have no dependencies on ES plugins. I can do any sort of pre-processing before calling ES, and even post-process ES output before sending the results back to the UI.
Depending on the type of pre-processing, you can create an Elasticsearch plugin as custom analyser or custom filter: you essentially extend the appropriate Lucene class(es) and wrap everything into an Elasticsearch plugin. Once the plugin is loaded, you can configure the custom analyser and apply it to the related fields. There are a lot of analysers and filters already available in Elasticsearch, so you might want to have a look at those before writing your own.
Elasticsearch plugins: https://www.elastic.co/guide/en/elasticsearch/reference/1.6/modules-plugins.html (a list of known plugins at the end)
Defining custom analysers: https://www.elastic.co/guide/en/elasticsearch/guide/current/custom-analyzers.html

Categories

Resources