can't read external hive table on spark - java

Affects Version/s:2.3.2
Component/s:PySpark, Spark Core, Spark Shell
Labels:JSON external-tables hive spark
Environment:hdp 3.1.4
hive-hcatalog-core-3.1.0.3.1.4.0-315.jar & hive-hcatalog-core-3.1.2 both I've tried
Description
I create a external hive table with hdfs file which is formatted as json string.
I can read the data field of this hive table with the help of org.apache.hive.hcatalog.data.JsonSerDe which is packed in hive-hcatalog-core.jar in hive shell.
But when I try to use the spark (pyspark ,spark-shell or whatever) ,I just can't read it.
It gave me a error Table: Unable to get field from serde: org.apache.hive.hcatalog.data.JsonSerDe
I've copied the jar (hive-hcatalog-core.jar) to $spark_home/jars and yarn libs and rerun ,there is no effect,even use --jars $jar_path/hive-hcatalog-core.jar.But when I browse the webpage of spark task ,I can actually find the jar in the env list.

Related

Corrupt H2 Database. Failed to recovery using the Recovery Tool

Today one of my H2 database failed to connect and presented the following error message:
Unable to obtain connection from database (jdbc:h2:file:C:\Users\Username\.appfiles\db\appdb) for user 'sa': File corrupted while reading record: null. Possible solution: use the recovery tool [90030-200]
SQL State : 90030
Error Code : 90030
Message : File corrupted while reading record: null. Possible solution: use the recovery tool [90030-200]
As suggested I tried to use the recover tool as instructed by the documentation, the steps I executed were the following:
Go to your h2 data file directory
java -cp h2-1.4.200.jar org.h2.tools.Recover
Use SQL file generated by the recovery tool to recreate the database
The steps created two files: a .sql and a .txt file, but the SQL generated by the tool didn't have any data or DDL from the database, just some aliases and a bunch of comments. The content of the files are linked below, if they can help shed any light on what went wrong during the process.
This is the .sql file output: https://pastebin.com/DFfwPemP
This is the .txt file output: https://pastebin.com/6zwCgqN3
Is there any step I'm not doing right or is any other thing I can try to recover this db? Any suggestion is welcome.
Run that files with
java -cp h2-1.4.200.jar org.h2.tools.RunScript -url jdbc:h2:[path to destination db file]/[db name] -user [user] -password [password] -script [text file/sql file]

Spark in Kubernetes container does not see local file

I have a trivially small Spark application written in Java that I am trying to run in a K8s cluster using spark-submit. I built an image with Spark binaries, my uber-JAR file with all necessary dependencies (in /opt/spark/jars/my.jar), and a config file (in /opt/spark/conf/some.json).
In my code, I start with
SparkSession session = SparkSession.builder()
.appName("myapp")
.config("spark.logConf", "true")
.getOrCreate();
Path someFilePath = FileSystems.getDefault().getPath("/opt/spark/conf/some.json");
String someString = new String(Files.readAllBytes(someFilePath));
and get this exception at readAllBytes from the Spark driver:
java.nio.file.NoSuchFileException: /opt/spark/conf/some.json
If I run my Docker image manually I can definitely see the file /opt/spark/conf/some.json as I expect. My Spark job runs as root so file permissions should not be a problem.
I have been assuming that, since the same Docker image, with the file indeed present, will be used to start the driver (and executors, but I don't even get to that point), the file should be available to my application. Is that not so? Why wouldn't it see the file?
You seem to get this exception from one of your worker nodes, not from the container.
Make sure that you've specified all files needed as --files option for spark-submit.
spark-submit --master yarn --deploy-mode cluster --files <local files dependecies> ...
https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management

java.io.IOException: Could not read footer for file FileStatus when trying to read parquet file from Spark cluster from IBM Cloud Object Storage

I have created a Spark Cluster with 3 workers on Kubernetes and a JupyterHub deployment to attach to it so I can run huge queries.
My parquet files are stored into IBM Cloud Object Storage (COS) and when I run a simple code to read from COS, I'm getting the following error:
Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=file:/path/myfile.parquet/_common_metadata; isDirectory=false; length=413; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} at parquet.hadoop.ParquetFileReader.readAllFootersInParallel
I have added all the required libraries to the /jars directory on SPARK_HOME directory in the driver.
This is the code I'm using to connect:
# Initial Setup - Once
import os
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
spark_session = SparkSession(sc)
credentials_staging_parquet = {
'bucket_dm':'mybucket1',
'bucket_eid':'bucket2',
'secret_key':'XXXXXXXX',
'iam_url':'https://iam.ng.bluemix.net/oidc/token',
'api_key':'XXXXXXXX',
'resource_instance_id':'crn:v1:bluemix:public:cloud-object-storage:global:a/XXXXX:XXXXX::',
'access_key':'XXXXX',
'url':'https://s3-api.us-geo.objectstorage.softlayer.net'
}
conf = {
'fs.cos.service.access.key': credentials_staging_parquet.get('access_key'),
'fs.cos.service.endpoint': credentials_staging_parquet.get('url'),
'fs.cos.service.secret.key': credentials_staging_parquet.get('secret_key'),
'fs.cos.service.iam.endpoint': credentials_staging_parquet.get('iam_url'),
'fs.cos.service.iam.service.id': credentials_staging_parquet.get('resource_instance_id'),
'fs.stocator.scheme.list': 'cos',
'fs.cos.impl': 'com.ibm.stocator.fs.ObjectStoreFileSystem',
'fs.stocator.cos.impl': 'com.ibm.stocator.fs.cos.COSAPIClient',
'fs.stocator.cos.scheme': 'cos',
'fs.cos.client.execution.timeout': '18000000',
'fs.stocator.glob.bracket.support': 'true'
}
hadoop_conf = sc._jsc.hadoopConfiguration()
for key in conf:
hadoop_conf.set(key, conf.get(key))
parquet_path = 'store/MY_FILE/*'
cos_url = 'cos://{bucket}.service/{parquet_path}'.format(bucket=credentials_staging_parquet.get('bucket_eid'), parquet_path=parquet_path)
df2 = spark_session.read.parquet(cos_url)
I got this similar error & Googled found this post. Next, I realized that I have a file format issue where the saved file was Avro and the file reader was Orc. So ... check your saved file format and reader formats are aligning.
Found the problem to my issue, the required libraries were not available for all workers in the cluster.
There are 2 ways to fix that:
Make sure you add the dependencies on the spark-submit command so it's distributed to the whole cluster, in this case it should be done in the kernel.json file on Jupyterhub located in /usr/local/share/jupyter/kernels/pyspark/kernel.json (assuming you created that).
OR
Add the dependencies on the /jars directory on your SPARK_HOME for each worker in the cluster and the driver (if you didn't do so).
I used the second approach. During my docker image creation, I added the libs so when I start my cluster, all containers already have the libraries required.
Try restarting your system or server and it will work after it.
I faced the same problem. It generally happens when you upgrade your Java version however spark lib still points to old java version. Rebooting your server/system resolves the problem.

H2: generate insert scripts initialization script

I have full h2 database with lots data in it. I want to launch integration tests agains that data.
Question1: Is it possible to generate *.sql insert files/scripts from full h2 database?
I've trie SCRIPT TO 'fileName' as described here. But it generates only CREATE/ALTER TABLE/CONSTRAINT queries, means creating schema without data.
If answer to the first question is - "Impossible", than:
Question2: Are *.sql insert files the only way to insert initial dataset into h2 db for integration tests?
Question1: Is it possible to generate *.sql insert files/scripts from
full h2 database?
I have just tested with one of my H2 file databases and as result the export exports both structure and data.
I tested with the 1.4.193version of H2.
The both ways of exporting work :
The SCRIPT command from H2 console
org.h2.tools.Script tool from command line.
1) I have tested first the org.h2.tools.Script tool as I had already used it.
Here is the minimal command to export structure and data :
java -cp <whereFoundYourH2Jar> org.h2.tools.Script -url <url>
-user <user> -password <password>
Where :
<whereFoundYourH2Jar> is the classpath where you have the h2.jar lib (I used that one which is my m2 repo).
<url> is the url of your database
<user> is the user of the database
<password> the password of the database
You have more details in the official help of the org.h2.tools.Script tool :
Creates a SQL script file by extracting the schema and data of a database.
Usage: java org.h2.tools.Script <options>
Options are case sensitive. Supported options are:
[-help] or [-?] Print the list of options
[-url "<url>"] The database URL (jdbc:...)
[-user <user>] The user name (default: sa)
[-password <pwd>] The password
[-script <file>] The target script file name (default: backup.sql)
[-options ...] A list of options (only for embedded H2, see SCRIPT)
[-quiet] Do not print progress information
See also http://h2database.com/javadoc/org/h2/tools/Script.html
2) I have tested with SCRIPT command from the H2 console. It also works.
Nevertheless, the result of the SCRIPT command may be misleading.
Look at the official documentation :
If no 'TO fileName' clause is specified, the script is returned as a
result set. This command can be used to create a backup of the
database. For long term storage, it is more portable than copying the
database files.
If a 'TO fileName' clause is specified, then the whole script
(including insert statements) is written to this file, and a result
set without the insert statements is returned.
You have used the SCRIPT TO 'fileName' command. In this case, the whole script
(including insert statements) is written to this file and as result in the H2 console, you have everything but the insert statements.
For example, enter the SCRIPT TO 'D:\yourBackup.sql' command (or a Unix friendly directory if you use it), then open the file, you will see that SQL insertions are present.
As specified in the documentation, if you want to get both structure and insert statements in the output result of the H2 console, don't specify the TO argument.
Just type : SCRIPT.
Question2: Are *.sql insert files the only way to insert initial
dataset into h2 db for integration tests?
As a long time discussed :) you can with DBunit dataset (a solution among others).

connect to mysql via spark jdbc connector

Hey I have an EMR cluster with spark 1.5.0 installed on it.
I'm trying to connect to one of our table in RDS and pull some data.
I downloaded the latest jar file from the official mysql site (http://dev.mysql.com/downloads/connector/j/)
I downloaded and untarred the file put it /home/hadoop/connectors and added this path in spark-defaults.conf file.
I manage to create a dataframe with this connection
df=sqlContext.read.format('jdbc')
.options(url="jdbc:mysql://dw-mysql-replica.gtforge.com:3306/dwh?
user= <usr>l&password=<pass>",
dbtable='<table>')
.load()
and manage to print the schema
df.printSchema()
but when I tried to materialized this data frame (AKA df.take(1) or df.collect() ) it throws the following error:
"java.sql.SQLException: No suitable driver found for jdbc:mysql://dw-mysql-
replica.gtforge.com:3306...."
Thanks

Categories

Resources