I am configuring a Kafka Mongodb sink connector on my Windows machine.
My connect-standalone.properties file has
plugin.path=E:/Tools/kafka_2.12-2.4.0/plugins
My MongoSinkConnector.properties file has
name=mongo-sink
topics=first_topic
connector.class=com.mongodb.kafka.connect.MongoSinkConnector
tasks.max=1
key.ignore=true
# Specific global MongoDB Sink Connector configuration
connection.uri=mongodb://localhost:27017,mongo1:27017,mongo2:27017,mongo3:27017
database=test_kafka
collection=transactions
max.num.retries=3
retries.defer.timeout=5000
type.name=kafka-connect
In the E:/Tools/kafka_2.12-2.4.0/plugins folder I have mongo-kafka-connect-1.0.1.jar file.
Command
bin\windows\connect-standalone config\connect-standalone.properties config\MongoSinkConnector.properties
The error I get is
[2020-03-23 04:04:12,376] ERROR Stopping after connector error (org.apache.kafka.connect.cli.ConnectStandalone)
java.lang.NoClassDefFoundError: com/mongodb/ConnectionString
at com.mongodb.kafka.connect.sink.MongoSinkConfig.createConfigDef(MongoSinkConfig.java:140)
at com.mongodb.kafka.connect.sink.MongoSinkConfig.<clinit>(MongoSinkConfig.java:78)
at com.mongodb.kafka.connect.MongoSinkConnector.config(MongoSinkConnector.java:62)
at org.apache.kafka.connect.connector.Connector.validate(Connector.java:129)
at org.apache.kafka.connect.runtime.AbstractHerder.validateConnectorConfig(AbstractHerder.java:313)
at org.apache.kafka.connect.runtime.standalone.StandaloneHerder.putConnectorConfig(StandaloneHerder.java:194)
at org.apache.kafka.connect.cli.ConnectStandalone.main(ConnectStandalone.java:115)
Caused by: java.lang.ClassNotFoundException: com.mongodb.ConnectionString
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588)
at org.apache.kafka.connect.runtime.isolation.PluginClassLoader.loadClass(PluginClassLoader.java:104)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
Which other jar files should I place in the plugins folder and/or config changes do I have to make?
UPDATE 1
I have placed mongodb-driver-core-4.0.1 and bson-4.0.1 jar files also in the plugins folder, but have the same error.
Finally, I could make the mongo-kafka-connector work on Windows.
Here is what worked for me:
Kafka installation folder is E:\Tools\kafka_2.12-2.4.0
E:\Tools\kafka_2.12-2.4.0\plugins has mongo-kafka-1.0.1-all.jar file.
I downloaded this from https://www.confluent.io/hub/mongodb/kafka-connect-mongodb
Click on the blue Download button at the left to get mongodb-kafka-connect-mongodb-1.0.1.zip file.
There is also the file MongoSinkConnector.properties in the etc folder inside the zip file.
Move it to kafka_installation_folder\plugins
My connect-standalone.properties file has the following entries:
bootstrap.servers=localhost:9092
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false
offset.storage.file.filename=/tmp/connect.offsets
offset.flush.interval.ms=10000
plugin.path=E:/Tools/kafka_2.12-2.4.0/plugins/mongo-kafka-1.0.1-all.jar
My MongoSinkConnector.properties file has the following entries
name=mongo-sink
topics=topic1,topic2
connector.class=com.mongodb.kafka.connect.MongoSinkConnector
tasks.max=1
connection.uri=mongodb://localhost:27017,localhost:27017,localhost:27017
database=test_kafka
collection=transactions
max.num.retries=3
retries.defer.timeout=5000
field.renamer.mapping=[]
field.renamer.regex=[]
max.batch.size = 0
rate.limiting.timeout=0
rate.limiting.every.n=0
How To Run
Start mongodb, zookeeper, kafka server in three consoles.
In 4th console, start Kafka connect --
bin\windows\connect-standalone config\connect-standalone.properties config\MongoSinkConnector.properties
In 5th console, send msgs to a topic (I did for topic1)
bin\windows\kafka-console-producer --broker-list localhost:9092 --topic topic1
>{"Hello":1}
>{"Mongo":2}
>{"World":3}
Open a mongo client and check your database/collections. You will see these three messages.
Related
I am trying mysql to hdfs data ingestion using gobblin. While running mysql-to-gobblin.pull using steps below:
1) start hadoop:
sbin\start-all.cmd
2) start mysql service:
sudo service mysql start
3) set GOBBLIN_WORK_DIR:
export GOBBLIN_WORK_DIR=/mnt/c/users/name/incubator-gobblin/GOBBLIN_WORK_DIR
4) set GOBBLIN_JOB_CONFIG_DIR
export GOBBLIN_JOB_CONFIG_DIR=/mnt/c/users/name/incubator-gobblin/GOBBLIN_JOB_CONFIG_DIR
5) Start standalone
bin/gobblin.sh service standalone start --jars /mnt/C/Users/name/incubator-gobblin/build/gobblin-sql/libs/gobblin-sql-0.15.0.jar
gives below error
ERROR [JobScheduler-0] org.apache.gobblin.scheduler.JobScheduler$NonScheduledJobRunner 637 - Failed to run job GobblinMySql
org.apache.gobblin.runtime.JobException: Failed to run job GobblinMySql
Caused by: java.lang.ClassNotFoundException: org.apache.gobblin.source.extractor.extract.jdbc.MysqlSource
below is the mysql-to-gobblin.pull file
# Job properties
job.name=GobblinMySql
job.group=MySql
job.description=Data pull from MySql
# Extract properties
extract.table.type=snapshot_only
extract.table.name=user
# Property to consider the extract as full dump
extract.is.full=true
# Source properties
# Source properties - source class to extract data from Mysql Source
source.class=org.apache.gobblin.source.extractor.extract.jdbc.MysqlSource
# Source properties
source.max.number.of.partitions=1
source.querybased.partition.interval=1
source.querybased.is.compression=true
source.querybased.watermark.type=timestamp
# Converter properties - Record from mysql source will be processed by the below series of converters
converter.classes=gobblin.converter.avro.JsonIntermediateToAvroConverter
# date columns format
converter.avro.timestamp.format=yyyy-MM-dd HH:mm:ss'.0'
converter.avro.date.format=yyyy-MM-dd
converter.avro.time.format=HH:mm:ss
# Qualitychecker properties
qualitychecker.task.policies=gobblin.policies.count.RowCountPolicy,gobblin.policies.schema.SchemaCompatibilityPolicy
qualitychecker.task.policy.types=OPTIONAL,OPTIONAL
# Publisher properties
data.publisher.type=gobblin.publisher.BaseDataPublisher
source.querybased.schema=praveen_schema
source.entity=user
source.querybased.extract.type=snapshot
writer.builder.class=org.apache.gobblin.writer.SimpleDataWriterBuilder
writer.file.path.type=tablename
writer.destination.type=HDFS
writer.output.format=txt
data.publisher.type=org.apache.gobblin.publisher.BaseDataPublisher
mr.job.max.mappers=1
metrics.reporting.file.enabled=true
metrics.log.dir=/gobblin-kafka/metrics
metrics.reporting.file.suffix=txt
bootstrap.with.offset=earliest
fs.uri=hdfs://localhost:9000
writer.fs.uri=hdfs://localhost:9000
state.store.fs.uri=hdfs://localhost:9000
mr.job.root.dir=/gobblin-kafka/working
state.store.dir=/gobblin-kafka/state-store
task.data.root.dir=/jobs/kafkaetl/gobblin/gobblin-kafka/task-data
data.publisher.final.dir=/gobblintest/job-output
I am running this command from /mnt/c/users/name/incubator-gobblin/build/gobblin-distribution/distributions/gobblin-dist directory.
What changes I need to do here? How can i solve it?
solution is to add this jar or dependency to get rid of Caused by: java.lang.ClassNotFoundException: org.apache.gobblin.source.extractor.extract.jdbc.MysqlSource
<dependency>
<groupId>com.linkedin.gobblin</groupId>
<artifactId>gobblin-core</artifactId>
<version>0.8.0</version>
</dependency>
download jar from this mvn website.
Hope this helps.
I have created a Spark Cluster with 3 workers on Kubernetes and a JupyterHub deployment to attach to it so I can run huge queries.
My parquet files are stored into IBM Cloud Object Storage (COS) and when I run a simple code to read from COS, I'm getting the following error:
Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=file:/path/myfile.parquet/_common_metadata; isDirectory=false; length=413; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} at parquet.hadoop.ParquetFileReader.readAllFootersInParallel
I have added all the required libraries to the /jars directory on SPARK_HOME directory in the driver.
This is the code I'm using to connect:
# Initial Setup - Once
import os
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
spark_session = SparkSession(sc)
credentials_staging_parquet = {
'bucket_dm':'mybucket1',
'bucket_eid':'bucket2',
'secret_key':'XXXXXXXX',
'iam_url':'https://iam.ng.bluemix.net/oidc/token',
'api_key':'XXXXXXXX',
'resource_instance_id':'crn:v1:bluemix:public:cloud-object-storage:global:a/XXXXX:XXXXX::',
'access_key':'XXXXX',
'url':'https://s3-api.us-geo.objectstorage.softlayer.net'
}
conf = {
'fs.cos.service.access.key': credentials_staging_parquet.get('access_key'),
'fs.cos.service.endpoint': credentials_staging_parquet.get('url'),
'fs.cos.service.secret.key': credentials_staging_parquet.get('secret_key'),
'fs.cos.service.iam.endpoint': credentials_staging_parquet.get('iam_url'),
'fs.cos.service.iam.service.id': credentials_staging_parquet.get('resource_instance_id'),
'fs.stocator.scheme.list': 'cos',
'fs.cos.impl': 'com.ibm.stocator.fs.ObjectStoreFileSystem',
'fs.stocator.cos.impl': 'com.ibm.stocator.fs.cos.COSAPIClient',
'fs.stocator.cos.scheme': 'cos',
'fs.cos.client.execution.timeout': '18000000',
'fs.stocator.glob.bracket.support': 'true'
}
hadoop_conf = sc._jsc.hadoopConfiguration()
for key in conf:
hadoop_conf.set(key, conf.get(key))
parquet_path = 'store/MY_FILE/*'
cos_url = 'cos://{bucket}.service/{parquet_path}'.format(bucket=credentials_staging_parquet.get('bucket_eid'), parquet_path=parquet_path)
df2 = spark_session.read.parquet(cos_url)
I got this similar error & Googled found this post. Next, I realized that I have a file format issue where the saved file was Avro and the file reader was Orc. So ... check your saved file format and reader formats are aligning.
Found the problem to my issue, the required libraries were not available for all workers in the cluster.
There are 2 ways to fix that:
Make sure you add the dependencies on the spark-submit command so it's distributed to the whole cluster, in this case it should be done in the kernel.json file on Jupyterhub located in /usr/local/share/jupyter/kernels/pyspark/kernel.json (assuming you created that).
OR
Add the dependencies on the /jars directory on your SPARK_HOME for each worker in the cluster and the driver (if you didn't do so).
I used the second approach. During my docker image creation, I added the libs so when I start my cluster, all containers already have the libraries required.
Try restarting your system or server and it will work after it.
I faced the same problem. It generally happens when you upgrade your Java version however spark lib still points to old java version. Rebooting your server/system resolves the problem.
I'm working a couple of Kafka connectors and I don't see any errors in their creation/deployment in the console output, however I am not getting the result that I'm looking for (no results whatsoever for that matter, desired or otherwise). I made these connectors based on Kafka's example FileStream connectors, so my debug technique was based off the use of the SLF4J Logger that is used in the example. I've searched for the log messages that I thought would be produced in the console output, but to no avail. Am I looking in the wrong place for these messages? Or perhaps is there a better way of going about debugging these connectors?
Example uses of the SLF4J Logger that I referenced for my implementation:
Kafka FileStreamSinkTask
Kafka FileStreamSourceTask
I will try to reply to your question in a broad way. A simple way to do Connector development could be as follows:
Structure and build your connector source code by looking at one of the many Kafka Connectors available publicly (you'll find an extensive list available here: https://www.confluent.io/product/connectors/ )
Download the latest Confluent Open Source edition (>= 3.3.0) from https://www.confluent.io/download/
Make your connector package available to Kafka Connect in one of the following ways:
Store all your connector jar files (connector jar plus dependency jars excluding Connect API jars) to a location in your filesystem and enable plugin isolation by adding this location to the
plugin.path property in the Connect worker properties. For instance, if your connector jars are stored in /opt/connectors/my-first-connector, you will set plugin.path=/opt/connectors in your worker's properties (see below).
Store all your connector jar files in a folder under ${CONFLUENT_HOME}/share/java. For example: ${CONFLUENT_HOME}/share/java/kafka-connect-my-first-connector. (Needs to start with kafka-connect- prefix to be picked up by the startup scripts). $CONFLUENT_HOME is where you've installed Confluent Platform.
Optionally, increase your logging by changing the log level for Connect in ${CONFLUENT_HOME}/etc/kafka/connect-log4j.properties to DEBUG or even TRACE.
Use Confluent CLI to start all the services, including Kafka Connect. Details here: http://docs.confluent.io/current/connect/quickstart.html
Briefly: confluent start
Note: The Connect worker's properties file currently loaded by the CLI is ${CONFLUENT_HOME}/etc/schema-registry/connect-avro-distributed.properties. That's the file you should edit if you choose to enable classloading isolation but also if you need to change your Connect worker's properties.
Once you have Connect worker running, start your connector by running:
confluent load <connector_name> -d <connector_config.properties>
or
confluent load <connector_name> -d <connector_config.json>
The connector configuration can be either in java properties or JSON format.
Run
confluent log connect to open the Connect worker's log file, or navigate directly to where your logs and data are stored by running
cd "$( confluent current )"
Note: change where your logs and data are stored during a session of the Confluent CLI by setting the environment variable CONFLUENT_CURRENT appropriately. E.g. given that /opt/confluent exists and is where you want to store your data, run:
export CONFLUENT_CURRENT=/opt/confluent
confluent current
Finally, to interactively debug your connector a possible way is to apply the following before starting Connect with Confluent CLI :
confluent stop connect
export CONNECT_DEBUG=y; export DEBUG_SUSPEND_FLAG=y;
confluent start connect
and then connect with your debugger (for instance remotely to the Connect worker (default port: 5005). To stop running connect in debug mode, just run: unset CONNECT_DEBUG; unset DEBUG_SUSPEND_FLAG; when you are done.
I hope the above will make your connector development easier and ... more fun!
i love the accepted answer. one thing - the environment variables didn't work for me... i'm using confluent community edition 5.3.1...
here's what i did that worked...
i installed the confluent cli from here:
https://docs.confluent.io/current/cli/installing.html#tarball-installation
i ran confluent using the command confluent local start
i got the connect app details using the command ps -ef | grep connect
i copied the resulting command to an editor and added the arg (right after java):
-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005
then i stopped connect using the command confluent local stop connect
then i ran the connect command with the arg
brief intermission ---
vs code development is led by erich gamma - of gang of four fame, who also wrote eclipse. vs code is becoming a first class java ide see https://en.wikipedia.org/wiki/Erich_Gamma
intermission over ---
next i launched vs code and opened the debezium oracle connector folder (cloned from here) https://github.com/debezium/debezium-incubator
then i chose Debug - Open Configurations
and entered the highlighted debugging configuration
and then run the debugger - it will hit your breakpoints !!
the connect command should look something like this:
/Library/Java/JavaVirtualMachines/jdk1.8.0_221.jdk/Contents/Home/bin/java -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005 -Xms256M -Xmx2G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.awt.headless=true -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=/var/folders/yn/4k6t1qzn5kg3zwgbnf9qq_v40000gn/T/confluent.CYZjfRLm/connect/logs -Dlog4j.configuration=file:/Users/myuserid/confluent-5.3.1/bin/../etc/kafka/connect-log4j.properties -cp /Users/myuserid/confluent-5.3.1/share/java/kafka/*:/Users/myuserid/confluent-5.3.1/share/java/confluent-common/*:/Users/myuserid/confluent-5.3.1/share/java/kafka-serde-tools/*:/Users/myuserid/confluent-5.3.1/bin/../share/java/kafka/*:/Users/myuserid/confluent-5.3.1/bin/../support-metrics-client/build/dependant-libs-2.12.8/*:/Users/myuserid/confluent-5.3.1/bin/../support-metrics-client/build/libs/*:/usr/share/java/support-metrics-client/* org.apache.kafka.connect.cli.ConnectDistributed /var/folders/yn/4k6t1qzn5kg3zwgbnf9qq_v40000gn/T/confluent.CYZjfRLm/connect/connect.properties
Connector module is executed by the kafka connector framework. For debugging, we can use the standalone mode. we can configure IDE to use the ConnectStandalone main function as entry point.
create debug configure as the following. Need remember to tick "Include dependencies with "Provided" scope if it is maven project
connector properties file need specify the connector class name "connector.class" for debugging
worker properties file can copied from kafka folder /usr/local/etc/kafka/connect-standalone.properties
Has anyone run into problems running the HelloWorld Twill example? My Application gets accepted but then transitions to the "FAILED" state.
Yarn application HelloWorldRunnable application_1406337868863_0013 completed with status FAILED
The YARN Web UI shows this as the error:
Application application_1406337868863_0013 failed 2 times due to AM Container for appattempt_1406337868863_0013_000002 exited with exitCode: -1000 due to: File file:/twill/HelloWorldRunnable/2ba08d9f-ca23-4363-a7be-426b93c88de2/appMaster.775a1137-6134-46e2-b270-fc466ce7fe91.jar does not exist
.Failing this attempt.. Failing the application.
Does YARN expect to find this jar on HDFS at the location above? It seems like the jar gets copied to my local FS at the location specified above but not to HDFS.
Looks like you don't have the hadoop conf directory (e.g. /etc/hadoop/conf) in the classpath so that the local file system (file:/twill/...) is used instead of HDFS.
I have a Solr 4.2.0 server which is running under the Tomcat 7.0 container. I'm trying to wire it with my external zookeeper (actually, it doesn't work with the embdedded zookeeper too).
I tried this java opts:
-Dbootstrap_confdir=./solr/collection1/conf
-Dcollection.configName=myconf
-DzkRun
-DnumShards=2
for running the embedded zookeeper.
And also this java opts:
-Dbootstrap_confdir=./solr/collection1/conf
-Dcollection.configName=myconf
-DzkHost=localhost:2181
-DnumShards=2
For connecting to external zookeeper
In both cases I continue to get the same exception:
java.io.FileNotFoundException: File '.\solr\collection1\conf \admin-extra.html' does not exist
But the problem is that file admin-extra.html exists and it's right here. And I can't figure out what the problem is.
From your exception it seems your path has a white space after the config directory.
Try to define your bootstrap_configdir between "", like:
-Dbootstrap_confdir="./solr/collection1/conf"