Call a Rest API from Apache Beam - java

Context: I'm creating objects that contain some data, and in my case, data cannot find from one source. So I'm consuming data coming from a Kinesis stream and need to call a Rest API providing some data that came from the stream to get some other fields required to create the final object.
Question: I know that Apache beam is not yet released as an official way to communicate with Rest APIs. But looks like it is still in progress - https://issues.apache.org/jira/browse/BEAM-1946
What I want to know is the best and proper way to call a Rest API inside the beam pipeline. Highly appreciate it if you can share any resources/examples along with your feedback's.
Tried the standard(HttpClient API) java way of calling Rest API, but I want to know whether there is a better way to achieve this. Am expecting answers and some resources if possible.
TIA

Apache Beam offers a way to define arbitrary computations in the form of DoFn UDF implementation used with ParDo transforms. This should allow you to add logic that connects to any REST API. Currently there's no official REST Beam connector as you mentioned but this should not block you from connecting to any custom API within your pipeline.

Related

Does Streams API functionality overlap with Spring Integration?

When introducing parallel processing to an application where multiple save entity calls are being made, I see prior dev has chosen to do it via Spring Integration using split().channel(MessageChannels.executor(Executors.newfixedThreadPool(10))).handle("saveClass","saveMethod").aggregate().get() - where this method is mapped to a requestChannel using #Gateway annotation. My question is this task seems to be simpler to do using the parallelStream() and forEach() method. Does IntergrationFlow provide any benefit in this case?
If you really do a plain in-memory data processing where Java's Stream API is enough, then indeed you don't need the whole messaging solution like Spring Integration. But if you deal with distributed requirements to process data from different systems, like from HTTP to Apache Kafka or DB, then it is better to use some tool which allows you smoothly to connect everything together. Also: no one stops you to use Stream API in the Spring Integration application. In the end all your code is Java anyway. Please, learn more what is EIP and why we would need a special framework to implement this messaging-based solutions: https://www.enterpriseintegrationpatterns.com/

How to send data from a java application to elastic search without using the elasticsearch libraries

I saw already answered questions and seems they are old enough that I couldn't use them. I tried an example given at https://www.elastic.co/blog/found-java-clients-for-elasticsearch which has the code written but not in an organized manner that would help me. The libraries are old and code gives me error.
I saw Spring Data project but that only allow a specific type of document/class to be indexed and needs the model to be predefined which is not my usecase. My goal is to build a java web application which could ingest any data documents fed to elasticsearch and we could analyze it with Kibana. I would need to know how can i fire a rest call or curl for bulk data. Can anyone state an example with complete parts please.
Use rest client.
The Java High Level REST Client works on top of the Java Low Level
REST client. Its main goal is to expose API specific methods, that
accept request objects as an argument and return response objects, so
that request marshalling and response un-marshalling is handled by the
client itself.
To upload data from java application to ES , you can use bulk Api.
To check list of APIs check link

Writing a custom unbounded sink for Dataflow v2.1

I'm using the mvn dependency google-cloud-dataflow-java-sdk-all version 2.1.0 and I'm trying to add a custom Sink for my pipeline.
In the pipeline, I'm retrieving Pubsub messages and am eventually transforming these to a PCollection of Strings.
This is a simplified version of the pipeline I've set up:
Pipeline pipeline = Pipeline.create(options);
pipeline.apply(PubsubIO.readMessages())
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(1))))
//transformations
.apply(//Write to custom sink)
The reason I need a custom Sink is because someone else on the team has already written the code to write out this data to BigQuery and provided a REST API for this. So, my Sink would be calling this REST API with the relevant data. I'm not keen on using BigQueryIO since that would involve duplicating parts of the code that was already written.
The problem is that I can not find any documentation on the Apache Beam website about writing custom Sinks using the Java SDK, so if someone could give me a nod in the right direction, it'd be much appreciated.
I've also considered just using a ParDo to send the data to the REST API, but then I technically would not have a Sink anymore and I wouldn't be doing it the "Dataflow way".
For unbounded sinks, there is no sink specific API in Beam. All the IO transforms essentially implement a ParDo. There are a few techniques to provide specific guarantees (e.g. using a GroupByKey to provide a checkpoint barrier in Dataflow) it depends on your interaction with external system (REST API in this case). Looks like writing a ParDo is the way to go in your case.

Add users in batches

I tried adding members using Java library ecwid and using post request and I am able to do that but I am not getting how we can add members in batches using API.
V3.0 of API has not implemented batch operation as yet. What about V2.0, somebody is able to do it?
In v2.0, the call you're looking for is batchSubscribe().

Preprocessing input text before calling ElasticSearch API

I have a Java client that allows indexing documents on a local ElasticSearch server.
I now want to build a simple Web UI that allows users to query the ES index by typing in some text in a form.
My problem is that, before calling ES APIs to issue the query, I want to preprocess the user input by calling some Java code.
What is the easiest and "cleanest" way to achieve this?
Should I create my own APIs so that the UI can access my Java code?
Should I build the UI with JSP so that I can directly call my Java
code?
Can I somehow make ElasticSearch execute my Java code before
the query is executed? (Perhaps by creating my own ElasticSearch plugin?)
In the end, I opted for the simple solution of using Json-based RESTful APIs. Time proved this to be quite flexible and effective for my case, so I thought I should share it:
My Java code exposes its ability to query an ElasticSearch index, by running an HTTP server and responding to client requests with Json-formatted ES results. I created the HTTP server with a few lines of code, by using sun.net.HttpServer. There are more serious/complex HTTP servers out there (such as Tomcat), but this was very quick to adopt and required zero configuration headaches.
My Web UI makes HTTP GET requests to the Java server, receives Json-formatted data and consumes it happily. My UI is implemented in PHP, but any web language does the job, as long as you can issue HTTP requests.
This solution works really well in my case, because it allows to have no dependencies on ES plugins. I can do any sort of pre-processing before calling ES, and even post-process ES output before sending the results back to the UI.
Depending on the type of pre-processing, you can create an Elasticsearch plugin as custom analyser or custom filter: you essentially extend the appropriate Lucene class(es) and wrap everything into an Elasticsearch plugin. Once the plugin is loaded, you can configure the custom analyser and apply it to the related fields. There are a lot of analysers and filters already available in Elasticsearch, so you might want to have a look at those before writing your own.
Elasticsearch plugins: https://www.elastic.co/guide/en/elasticsearch/reference/1.6/modules-plugins.html (a list of known plugins at the end)
Defining custom analysers: https://www.elastic.co/guide/en/elasticsearch/guide/current/custom-analyzers.html

Categories

Resources