apache pulsar Direct approach with Apache spark

apache pulsar Direct approach with Apache spark - java

There are 2 approaches of integrating apache kafka with apache spark as mentioned here with pros and cons.
From what i have read there are 2 spark connector for apache pulsar
org.apache.pulsar:pulsar-spark
io.streamnative.connectors:pulsar-spark-connector_2.12
Can someone please help me understand if
they are following receiver based approach or direct approach?
What is the core difference between both the connectors?
Is there a possibility of data loss if my job fails while it is processing some batch.

Related

Enabling connection pooling in apache camel Kafka component

I have an apache camel route which is calling apache kafka topic (producer) just like below in my spring boot application (deployed on Tomcat server):
from("timer://foo?period=1000").to("kafka:myTopic?brokers=localhost:9092");
This spring boot app is a rest API that is supposed to get around 300 TPS.
Q1) I know that a single tomcat server thread serves each request coming to my spring boot app. Does the same thread will be used in the above line of code used by apache camel for invoking the myTopic? Or apache camel uses some connection pooling internally just like RestTemplate?
Q2) Because the TPS will be increasing to 500 TPS in near future, does it make any sense to introduce pooling for the above line of code? I believe if I use connection pooling, my application performance will increase. However, I am not able to find the code which I can use to enable connection pooling for the above line of code.
If anyone has any idea please let me know. Please note that I am not looking for parallel processing so seda or multicast in camel is not an option. I am only having a single call to kafka topic as shown above in the code so just looking for how to enable connection pooling in this line of code.
Thanks.

Azure Blob support in Apache Beam?

I'm wondering if Apache Beam supports windows azure storage blob files(wasb) IO. Is there any support yet?
I'm asking because I've deployed an apache beam application to run a job on an Azure Spark cluster and basically it's impossible to IO wasb files from the associated storage container with that spark cluster. Is there any alternative solution?
Context: I'm trying to run the WordCount example on my Azure Spark Cluster. Already had set some components as stated here believing that would help me. Below is the part of my code where I'm setting the hadoop configuration:
final SparkPipelineOptions options = PipelineOptionsFactory.create().as(SparkPipelineOptions.class);
options.setAppName("WordCountExample");
options.setRunner(SparkRunner.class);
options.setSparkMaster("yarn");
JavaSparkContext context = new JavaSparkContext();
Configuration conf = context.hadoopConfiguration();
conf.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem");
conf.set("fs.azure.account.key.<storage-account>.blob.core.windows.net",
"<key>");
options.setProvidedSparkContext(context);
Pipeline pipeline = Pipeline.create(options);
But unfortunately I keep ending with the following error:
java.lang.IllegalStateException: Failed to validate wasb://<storage-container>#<storage-account>.blob.core.windows.net/user/spark/kinglear.txt
at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:288)
at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:195)
at org.apache.beam.sdk.runners.PipelineRunner.apply(PipelineRunner.java:76)
at org.apache.beam.runners.spark.SparkRunner.apply(SparkRunner.java:129)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:400)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:323)
at org.apache.beam.sdk.values.PBegin.apply(PBegin.java:58)
at org.apache.beam.sdk.Pipeline.apply(Pipeline.java:173)
at spark.example.WordCount.main(WordCount.java:47)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627)
Caused by: java.io.IOException: Unable to find handler for wasb://<storage-container>#<storage-account>.blob.core.windows.net/user/spark/kinglear.txt
at org.apache.beam.sdk.util.IOChannelUtils.getFactory(IOChannelUtils.java:187)
at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:283)
... 13 more
I was thinking about implementing a customized IO for Apache Beam in this case for Azure Storage Blobs if comes to that as a solution, I'd like to check with the community if that's an alternative solution.

Apache Beam doesn't have a built-in connector for Windows Azure Storage Blob (WASB) at this moment.
There's an active effort in the Apache Beam project to add support for the HadoopFileSystem. I believe WASB has a connector for HadoopFileSystem in the hadoop-azure module. This would make WASB available with Beam indirectly -- this is likely the easiest path forward, and it should be ready very soon.
Now, it would be great to have a native support for WASB in Beam. It would likely enable another level of performance, and should be relatively straightforward to implement. As far as I'm aware, nobody is actively working on it, but this would be an awesome contribution to the project! (If you are personally interested in contributing, please reach out!)

How to automate Kafka Testing

We have developed a system using kafka to queue the data and later consume that data to place orders for users.
We have tested certain things manually, but now our aim is automate the process.
Is there any client available to test it? I found out ways to Unit test it using kafka client itself, but my aim is to test the system as whole.
EDIT: our purpose is just API testing i.e., just the back-end, not the UI

You can start Kafka programmatically in your integration test, Kafka uses Zookeeper so firsly look at Zookeeper TestingServer - instance of this class creates and starts the Zk server using the given port.
Next look at KafkaServerStartable.scala, you have to provide configuration that points to your in memory Zk server and invoke startup() method, here is some code:
import kafka.server.KafkaConfig;
import kafka.server.KafkaServerStartable;
import java.util.Properties;
public KafkaTest() {
Properties properties = createProperties();
KafkaConfig kafkaConfig = new KafkaConfig(properties);
KafkaServerStartable kafka = new KafkaServerStartable(kafkaConfig);
kafka.startup();
}
Hope these help:)

You can go for integration-testing or end-to-end testing by bringing up Kafka in a docker container. If you use Apache kafka-clients:2.1.0, then you don't need to deal with ZooKeeper at the API level while producing or consuming the records.
Dockerizing Kafka, and testing helps to cover the scenarios in a single node as well as multi-node Kafka cluster. This way you don't have to test against Mock/In-Memory Kafka once, then real Kafka later. This can be done using TestContainers.
If you have too many test scenarios to cover, you can go for Kafka Declarative Testing like docker-compose style, by which you can eliminate the Kafka client API coding.
Checkout some handy examples here for validating produce and consume.
TestContainers project also supports docker-compose.

As I understood you want to implement end to end tests starting from messages. Me and some people from recently made a research for libraries, tools and frameworks to test Event-driven systems using Kafka.
We found Zerocode which is an automated API testing using declarative language like JSON or YAML. It support REST, SOAP and what we are interested, Messaging. It sends and consumes messages from topics and make assertions in the end, easy to learn and use. Here is the link for more details Zerocode. It seems like a good option although we are starting using it.
You will need to have Kafka brokers and the dependencies running to make this solution to work, but nothing like a docker compose and/or some scripts to bring a environment for tests.
Another way is to implement your own project with Kafka libraries and use the libraries to send and receive messages in the tests.
Unfortunately we couldn't find more options available out there. Kafka has a proposition to create a test kit but it's not in progress yet.

Unfortunately, the approach described by Pavel does not work for Kafka 2.8+ anymore. However, I could make our end-to-end tests with Kafka 3.2 work using the approach taken by KarelDB:
Properties props = TestUtils.createBrokerConfig(
brokerId,
zkConnect,
false,
false,
TestUtils.RandomPort(),
noInterBrokerSecurityProtocol,
noFile,
EMPTY_SASL_PROPERTIES,
true,
false,
TestUtils.RandomPort(),
false,
TestUtils.RandomPort(),
false,
TestUtils.RandomPort(),
Option.<String>empty(),
1,
false,
1,
(short) 1
);
KafkaConfig config = KafkaConfig.fromProps(props);
KafkaServer server = TestUtils.createServer(config, Time.SYSTEM);
// `createServer` will also start your Kafka server.
// To shutdown:
server.shutdown();

JBoss configuration : Where should JMS Bridge configuration be?

I'm new to JMS programming (Java).
I have a machine M1, in a domain D1 and a machine M2 in another domain D2.
I have in M1 a JMS producer. And in M2 a JMS consumer. Both have as servers JBoss 7.2.
So it seems the only solution is to create a JMS bridge.
I'm reading the official documentation. So I wonder if creating an SSH tunnel is necessary.
Second, in which hornetq-configuration.xml file should I set the following configuration?
<bridge name="my-bridge">
<queue-name>jms.queue.sausage-factory</queue-name>
<forwarding-address>jms.queue.mincing-machine</forwarding-address>
<filter-string="name='aardvark'"/>
<transformer-class-name>
org.hornetq.jms.example.HatColourChangeTransformer
</transformer-class-name>
<retry-interval>1000</retry-interval>
<ha>true</ha>
<retry-interval-multiplier>1.0</retry-interval-multiplier>
<reconnect-attempts>-1</reconnect-attempts>
<failover-on-server-shutdown>false</failover-on-server-shutdown>
<use-duplicate-detection>true</use-duplicate-detection>
<confirmation-window-size>10000000</confirmation-window-size>
<user>foouser</user>
<password>foopassword</password>
<static-connectors>
<connector-ref>remote-connector</connector-ref>
</static-connectors>
<!-- alternative to static-connectors
<discovery-group-ref discovery-group-name="bridge-discovery-group"/>
-->
</bridge>
Should it be in the in JBoss server of the JMS producer machine or consumer machine?
My third question is, is there a difference in settings between JMS bridge and core bridge?
I would be so thankful for any additional information and explainations!
Thank you a lot!

I know this is little late for OP, may be this information helps someone.
Firstly, the difference between Core and JMS bridge.
Read doc here
Core bridges are for linking a HornetQ node with another HornetQ node and do not use the JMS API. A JMS Bridge is used for linking any two JMS 1.1 compliant JMS providers.
The Core bridge uses proprietary HornetQ core api so it can only connect two HornetQ servers. Whereas JMS bridges use the JMS API so could connect any JMS1.1 API complaint servers.eg: HornetQ to ActiveMQ.
The configuration mentioned in question is Core bridge and can be configured in the source server.Since you seem to connect two HornetQ server Core bridges is the way forward. That said, in your case you could use JMS bridges as well since both are JMS complaint.But the recommended approach would be to use Core bridge due to performance advantage gain.
Finally, the JBoss installation server comes with some handy examples.You could find Core bridge example under [JBOSS_HOME]\jboss-as\extras\hornetq\examples\jms\bridge.

Apache Camel - transaction in routes

I have a general question about Apache Camel. I wasn't able to find whether the aggregator is transacted. If it is transacted, how the transactions are implemented and how fast the aggregation is?

Sending the messages into the aggregator can run in a transaction.
You would need a persistent store with the aggregator to let the outgoing messages act as a transaction. See the documentation about persistence
http://camel.apache.org/aggregator2
For example there is a JDBC based and HawtDB (file based) persistent support out of the box. Its pluggable as you can also build your custom.
Camel in Action book chapter 8 and 9 convers this in much more details.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

apache pulsar Direct approach with Apache spark - java

Related

Enabling connection pooling in apache camel Kafka component

Azure Blob support in Apache Beam?

How to automate Kafka Testing

JBoss configuration : Where should JMS Bridge configuration be?

Apache Camel - transaction in routes

Categories

Resources