I'm wondering if Apache Beam supports windows azure storage blob files(wasb) IO. Is there any support yet?
I'm asking because I've deployed an apache beam application to run a job on an Azure Spark cluster and basically it's impossible to IO wasb files from the associated storage container with that spark cluster. Is there any alternative solution?
Context: I'm trying to run the WordCount example on my Azure Spark Cluster. Already had set some components as stated here believing that would help me. Below is the part of my code where I'm setting the hadoop configuration:
final SparkPipelineOptions options = PipelineOptionsFactory.create().as(SparkPipelineOptions.class);
options.setAppName("WordCountExample");
options.setRunner(SparkRunner.class);
options.setSparkMaster("yarn");
JavaSparkContext context = new JavaSparkContext();
Configuration conf = context.hadoopConfiguration();
conf.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem");
conf.set("fs.azure.account.key.<storage-account>.blob.core.windows.net",
"<key>");
options.setProvidedSparkContext(context);
Pipeline pipeline = Pipeline.create(options);
But unfortunately I keep ending with the following error:
java.lang.IllegalStateException: Failed to validate wasb://<storage-container>#<storage-account>.blob.core.windows.net/user/spark/kinglear.txt
at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:288)
at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:195)
at org.apache.beam.sdk.runners.PipelineRunner.apply(PipelineRunner.java:76)
at org.apache.beam.runners.spark.SparkRunner.apply(SparkRunner.java:129)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:400)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:323)
at org.apache.beam.sdk.values.PBegin.apply(PBegin.java:58)
at org.apache.beam.sdk.Pipeline.apply(Pipeline.java:173)
at spark.example.WordCount.main(WordCount.java:47)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627)
Caused by: java.io.IOException: Unable to find handler for wasb://<storage-container>#<storage-account>.blob.core.windows.net/user/spark/kinglear.txt
at org.apache.beam.sdk.util.IOChannelUtils.getFactory(IOChannelUtils.java:187)
at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:283)
... 13 more
I was thinking about implementing a customized IO for Apache Beam in this case for Azure Storage Blobs if comes to that as a solution, I'd like to check with the community if that's an alternative solution.
Apache Beam doesn't have a built-in connector for Windows Azure Storage Blob (WASB) at this moment.
There's an active effort in the Apache Beam project to add support for the HadoopFileSystem. I believe WASB has a connector for HadoopFileSystem in the hadoop-azure module. This would make WASB available with Beam indirectly -- this is likely the easiest path forward, and it should be ready very soon.
Now, it would be great to have a native support for WASB in Beam. It would likely enable another level of performance, and should be relatively straightforward to implement. As far as I'm aware, nobody is actively working on it, but this would be an awesome contribution to the project! (If you are personally interested in contributing, please reach out!)
Related
I try to use the MongoDB in the spring-boot project.
I tried a different of non-usable tutorial and stopped on the official documentation
I created cluster and now I ready to use MongoDB in my project.
Documentation says:
As I understand the pwd - this is the password, so my application.properties looks like:
spring.data.mongodb.uri=mongodb+srv://root:pass#88.155.XX.XXX.mongodb.net/mygrocerylist
spring.data.mongodb.database=mygrocerylist
this is all what I added to the application.properties - there are no other information in the tutorial.
After I started project, I receive error, that repeats every several seconds (however project still launched - I receive error even during system logging):
com.mongodb.MongoConfigurationException: No SRV records available for
host 88.155.21.126 at
com.mongodb.internal.dns.DefaultDnsResolver.resolveHostFromSrvRecords(DefaultDnsResolver.java:64)
~[mongodb-driver-core-4.4.2.jar:na] at
com.mongodb.internal.connection.DefaultDnsSrvRecordMonitor$DnsSrvRecordMonitorRunnable.run(DefaultDnsSrvRecordMonitor.java:78)
~[mongodb-driver-core-4.4.2.jar:na] at
java.base/java.lang.Thread.run(Thread.java:829) ~[na:an]
please, can anyone to explain how to exactly add MongoDB to the spring project?
I prefer to use such solution, since I need to complete the task and launch my project in the another PC. In any case - I tried to use mongoDb in the docker and this way also not work for me, and I not sure the other side use MongoDB locally.
There are 2 approaches of integrating apache kafka with apache spark as mentioned here with pros and cons.
From what i have read there are 2 spark connector for apache pulsar
org.apache.pulsar:pulsar-spark
io.streamnative.connectors:pulsar-spark-connector_2.12
Can someone please help me understand if
they are following receiver based approach or direct approach?
What is the core difference between both the connectors?
Is there a possibility of data loss if my job fails while it is processing some batch.
I need to design and configure Kafka jdbc connect project where source and sink both are postgres db, and I am using apache Kafka 2.8.
I have prepared POC for standalone mode, but I need to design it for distributed mode and data volume would be several million records.
Can you share any reference to setup for distributed mode and also parameters tuning and best practices?
I have gone through several documents but not getting precise document only for apache Kafka with jdbc connector.
Also please let me know how can I make this solution dockerized?
Thanks,
Suvendu
reference to setup for distributed mode
This is in the Kafka documentation. Run connect-distributed.sh along with its config file.
parameters tuning and best practices?
The config has reasonable defaults, but you're welcome to inspect the file for any changes. Only other thing would be heap settings, but 2G is the default Xmx, and can be set with KAFKA_HEAP_OPTS env var
This starts an HTTP server, and you POST JSON to it that has the same key values as the standalone jdbc worker file
precise document only for apache Kafka with jdbc connector
There's the official configuration page and handful of blogs (by Confluent) about it
how can I make this solution dockerized?
The Confluent Docker images would be best for this, though you may have to confluent-hub install the JDBC connector into an image of your own
I'd recommend Debezium as the source, though
I have created a database pool on WASCE 3.0.0.3 (WebSphere Application Server Community Edition) which i am using through JNDI. I want to set oracle network data encryption and integrity properties for this database pool. The properties i want to set in particular are oracle.net.encryption_client and oracle.net.encryption_types_client.
How can I set these properties? I do not see any option to set these properties while creating the connection pool and I cannot find any documentation related to the same.
You probably cannot find any documentation on how to do this because WAS 3.0 went out of service in 2003, so any documentation for it is long gone.
If you upgrade to a newer version of WAS traditional (or Liberty), you will find much more documentation and people willing to help you. Additionally, in WAS 6.1 an admin console (UI) was added, which will probably walk you through what you are trying to do.
I have been working on a project, in which I have to send files from local system to my FTP-server. for this purpose I thought of using Apache MINA.
is Apache MINA can be implemented in this situation, any suggestion or help will be useful. thanks.
I know Apache Commons Net is a convenient and efficient library for writing FTP clients.
They also provide a FTP client example: FTPClientExample.java
Yes you can use Apache Mina for this purpose. Look for following JARs/References
mina-core-2.0.19.jar - For authentication purpose
slf4j-api-1.7.25.jar - For logging purpose
sshd-common-2.1.0.jar - Common functions dependent jars
sshd-core-2.1.0.jar - Common functions dependent jars
sshd-sftp-2.1.0.jar - For SFTP file transfers and creating
clients and connections
Some example:
mSshClient = SshClient.setUpDefaultClient();
mSshClient.start();
mConnectFuture = mSshClient.connect(mUsername,mServerAddress.getHostAddress(),mServerPort,null);
mClientSession = mConnectFuture.verify().getSession();
mSftpClient = new DefaultSftpClient(mClientSession);