Hey I have an EMR cluster with spark 1.5.0 installed on it.
I'm trying to connect to one of our table in RDS and pull some data.
I downloaded the latest jar file from the official mysql site (http://dev.mysql.com/downloads/connector/j/)
I downloaded and untarred the file put it /home/hadoop/connectors and added this path in spark-defaults.conf file.
I manage to create a dataframe with this connection
df=sqlContext.read.format('jdbc')
.options(url="jdbc:mysql://dw-mysql-replica.gtforge.com:3306/dwh?
user= <usr>l&password=<pass>",
dbtable='<table>')
.load()
and manage to print the schema
df.printSchema()
but when I tried to materialized this data frame (AKA df.take(1) or df.collect() ) it throws the following error:
"java.sql.SQLException: No suitable driver found for jdbc:mysql://dw-mysql-
replica.gtforge.com:3306...."
Thanks
Related
Affects Version/s:2.3.2
Component/s:PySpark, Spark Core, Spark Shell
Labels:JSON external-tables hive spark
Environment:hdp 3.1.4
hive-hcatalog-core-3.1.0.3.1.4.0-315.jar & hive-hcatalog-core-3.1.2 both I've tried
Description
I create a external hive table with hdfs file which is formatted as json string.
I can read the data field of this hive table with the help of org.apache.hive.hcatalog.data.JsonSerDe which is packed in hive-hcatalog-core.jar in hive shell.
But when I try to use the spark (pyspark ,spark-shell or whatever) ,I just can't read it.
It gave me a error Table: Unable to get field from serde: org.apache.hive.hcatalog.data.JsonSerDe
I've copied the jar (hive-hcatalog-core.jar) to $spark_home/jars and yarn libs and rerun ,there is no effect,even use --jars $jar_path/hive-hcatalog-core.jar.But when I browse the webpage of spark task ,I can actually find the jar in the env list.
There are several .txt files in my Java gradle project, which I use to populate a database using the MySQL statement LOAD DATA LOCAL INFILE in sql scripts.
There is a MySQL server running on my PC, whose hostname is 127.0.0.1, port is 3306 and DB name is test, so in the gradle project's build.gradle file I configure flyway in the following way:
flyway {
url = 'jdbc:mysql://127.0.0.1:3306/test'
user = 'root'
password = '111111'
}
An example .sql file is
LOAD DATA LOCAL INFILE 'data/categories.txt' INTO TABLE category
FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n';
but when running migration using gradle flywayMigrate -i, I got the following error:
Loading local data is disabled; this must be enabled on both the client and server sides
I think I have enabled it in the server side as shown below, which is a MySQL Command Line Client screenshot.
Hope I did it correctly.
After this step, the error still show up, so I think I also need to enable it in the client side, which is in the Java gradle project, is that correct?
Based on https://dev.mysql.com/doc/refman/8.0/en/load-data-local-security.html, I should set ENABLE_LOCAL_INFILE = 1.
But how to do that? In where should I add this ENABLE_LOCAL_INFILE = 1 connection string?
Hope someone can help me.
Thanks in advance!
Use the JDBC connection string property allowLoadLocalInfile=true
See https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-connp-props-security.html#cj-conn-prop_allowLoadLocalInfile
I have created a Spark Cluster with 3 workers on Kubernetes and a JupyterHub deployment to attach to it so I can run huge queries.
My parquet files are stored into IBM Cloud Object Storage (COS) and when I run a simple code to read from COS, I'm getting the following error:
Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=file:/path/myfile.parquet/_common_metadata; isDirectory=false; length=413; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} at parquet.hadoop.ParquetFileReader.readAllFootersInParallel
I have added all the required libraries to the /jars directory on SPARK_HOME directory in the driver.
This is the code I'm using to connect:
# Initial Setup - Once
import os
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
spark_session = SparkSession(sc)
credentials_staging_parquet = {
'bucket_dm':'mybucket1',
'bucket_eid':'bucket2',
'secret_key':'XXXXXXXX',
'iam_url':'https://iam.ng.bluemix.net/oidc/token',
'api_key':'XXXXXXXX',
'resource_instance_id':'crn:v1:bluemix:public:cloud-object-storage:global:a/XXXXX:XXXXX::',
'access_key':'XXXXX',
'url':'https://s3-api.us-geo.objectstorage.softlayer.net'
}
conf = {
'fs.cos.service.access.key': credentials_staging_parquet.get('access_key'),
'fs.cos.service.endpoint': credentials_staging_parquet.get('url'),
'fs.cos.service.secret.key': credentials_staging_parquet.get('secret_key'),
'fs.cos.service.iam.endpoint': credentials_staging_parquet.get('iam_url'),
'fs.cos.service.iam.service.id': credentials_staging_parquet.get('resource_instance_id'),
'fs.stocator.scheme.list': 'cos',
'fs.cos.impl': 'com.ibm.stocator.fs.ObjectStoreFileSystem',
'fs.stocator.cos.impl': 'com.ibm.stocator.fs.cos.COSAPIClient',
'fs.stocator.cos.scheme': 'cos',
'fs.cos.client.execution.timeout': '18000000',
'fs.stocator.glob.bracket.support': 'true'
}
hadoop_conf = sc._jsc.hadoopConfiguration()
for key in conf:
hadoop_conf.set(key, conf.get(key))
parquet_path = 'store/MY_FILE/*'
cos_url = 'cos://{bucket}.service/{parquet_path}'.format(bucket=credentials_staging_parquet.get('bucket_eid'), parquet_path=parquet_path)
df2 = spark_session.read.parquet(cos_url)
I got this similar error & Googled found this post. Next, I realized that I have a file format issue where the saved file was Avro and the file reader was Orc. So ... check your saved file format and reader formats are aligning.
Found the problem to my issue, the required libraries were not available for all workers in the cluster.
There are 2 ways to fix that:
Make sure you add the dependencies on the spark-submit command so it's distributed to the whole cluster, in this case it should be done in the kernel.json file on Jupyterhub located in /usr/local/share/jupyter/kernels/pyspark/kernel.json (assuming you created that).
OR
Add the dependencies on the /jars directory on your SPARK_HOME for each worker in the cluster and the driver (if you didn't do so).
I used the second approach. During my docker image creation, I added the libs so when I start my cluster, all containers already have the libraries required.
Try restarting your system or server and it will work after it.
I faced the same problem. It generally happens when you upgrade your Java version however spark lib still points to old java version. Rebooting your server/system resolves the problem.
I have created database with my own program and it appeared as mydatabase.mv.db file.
But when I tried to access the same database with DbVisualizer, with apparently same parameters, it created two files mydatabase.lock.db and celebrity.h2.db and didn't see tables, created in the program.
What was the incompatibility?
UPDATE
both setups are follows:
In H2 version 1.3.x, the database file <databaseName>.h2.db is the default. (The storage engine "PageStore" is used).
In H2 version 1.4.x, the database file <databaseName>.mv.dbis the default. (The storage engine "MVStore" is used). The MVStore is still beta right now (November 2014). But you can disable the MVStore by appending ;mv_store=false to the database URL.
The accepted answer is now several years old and since others may be looking for a more "current" solution...
To get it to work just update the H2 JDBC driver that DBVizualizer uses. Basically download the "Platform-Independent Zip" from http://www.h2database.com/html/download.html and copy the h2/bin/h2-X.X.X.jar file to ~/.dbvis/jdbc/ and then restart DBVizualizer so it can pick up the updated driver.
Also, make sure you remove .mv.db from the file name when setting the Database file name in DBVizualizer.
For Windows Users:
The excellent way to read a *.db.mv file would be locally installing the h2 database and then running that database locally with the java command.
Then your path to the file will definitely show the data from your table until and unless any errors occur.
You can download the h2 database form:
http://www.h2database.com/html/download-archive.html
Note: choose the database version for H2 which supports your file.
You can install the H2 database by installing the downloaded .exe file would be around 7 MB.
then in the bin directory of H2 open a command prompt and run the command
java -jar in my case it is
command: java -jar h2-1.4.200.jar
It will show the console of the H2 database on the browser
Provide the database details:
Driver Class: org.h2.Driver JDBC
URL: jdbc:h2:~/h2 "file path"
User Name: "blank by default"
Password: "blank by default"
Refer SS below
enter image description here
I have installed Hadoop & connected with Hadoop locally successful. I can connect the Sqoop via REST api and via cli interface.
But once I want to start creating a job for important data from MySQL. It shows
Connection configuration Warning message: Can't connect to the
database with given credentials: No suitable driver found for
jdbc:mysql://127.0.0.1:3306/for
Error message: Can't load specified driver
after google its solution, I has already
put the mysql-connector.jar to sqoop web lib folder
create a lib folder under the sqoop folder and put the mysql-connector.jar in it
I also have restarted or even reboot my VM. It still says cannot load specified driver.
Is there any config files I have missed to set? Thank you!
My ENV:
VirtualBox + Vagrant + Ubuntu 12.04
JDK (Sun Distrubution 1.7_update 51)
Hadoop 2.2.0 (complied version)
Sqoop 1.99.3 (complied version)
Thank again!
check your 'JDBC Driver Class' of the connection you just created in sqoop,it should be setted
as com.mysql.jdbc.Driver.
If still not work,put mysql-connector-java-3.1.12-bin.jar into $SQOOP_HOME/server/lib