Java invoke Sqoop fails on hive import

Java invoke Sqoop fails on hive import - java

I have a Java Program which run a Sqoop operation by Sqoop.runTool() method. Until now, I can successfully run this operation to get data from oracle to HDFS.
but when I add the parameter --hive-import, it reports an warning in hive's log file in /tmp/root/hive.log:
hive-site.xml not found on CLASSPATH
Mean while I noticed that,run by JavaAPI and by plain sqoop Command，the console message have a different output:
Logging initialized musing configuration in jar:file:/opt/cloudera/parcels/CDH/jars/hive-exec-1.1.0!hive-log4j.properties
Logging initialized musing configuration in jar:file:/opt/cloudera/parcels/CDH/jars/hive-common-1.1.0!hive-log4j.properties
Will be grateful for any help.

Related

Apache Livy : Could not find or load main class org.apache.livy.server.LivyServer

I am trying to start Apache Livy 0.8.0 server on my windows 10 machine for spark 3.1.2 and hadoop 3.2.1. I am taking help from here.. I have successfully built apache livy using maven (I have attached a of it) But I am not able to run the livy server. When I run it I get the following error -
> starting C:/AmazonJDK/jdk1.8.0_332/bin/java -cp /d/ApacheLivy/incubator-livy-master/incubator-livy-master/server/target/jars/*:/d/ApacheLivy/incubator-livy-master/incubator-livy-master/conf:D:/Program_files/spark/conf:D:/ApacheHadoop/hadoop-3.2.1/etc/hadoop: org.apache.livy.server.LivyServer, logging to D:/ApacheLivy/incubator-livy-master/incubator-livy-master/logs/livy--server.out
ps: unknown option -- o
Try `ps --help' for more information.
failed to launch C:/AmazonJDK/jdk1.8.0_332/bin/java -cp /d/ApacheLivy/incubator-livy-master/incubator-livy-master/server/target/jars/*:/d/ApacheLivy/incubator-livy-master/incubator-livy-master/conf:D:/Program_files/spark/conf:D:/ApacheHadoop/hadoop-3.2.1/etc/hadoop: org.apache.livy.server.LivyServer:
Error: Could not find or load main class org.apache.livy.server.LivyServer
full log in D:/ApacheLivy/incubator-livy-master/incubator-livy-master/logs/livy--server.out
I am using Git bash. If you need more information I will provide

The error got resolved when I used Windows Subsystem for Linux (WSL).

How to control Spark logging in IntelliJ

I'm usually running stuff from JUnit but I've also tried running from main and it makes no difference.
I have read nearly two dozen SO questions, blog posts, and articles and tried almost everything to get Spark to stop logging so much.
Things I've tried:
log4j.properties in resources folder (in src and test)
Using spark-submit to add a log4j.properties which failed with "error: missing application resources"
Logger.getLogger("com").setLevel(Level.WARN);
Logger.getLogger("org").setLevel(Level.WARN);
Logger.getLogger("akka").setLevel(Level.WARN);Logger.getRootLogger().setLevel(Level.WARN);spark.sparkContext().setLogLevel("WARN");
In another project I got the logging to be quiet with:
Logger.getLogger("org").setLevel(Level.WARN);
Logger.getLogger("akka").setLevel(Level.WARN);
but it is not working here.
How I'm creating my SparkSession:
SparkSession spark = SparkSession
.builder()
.appName("RS-LDA")
.master("local")
.getOrCreate();
Let me know if you'd like to see more of my code.
Thanks

I'm using IntelliJ and Spark and, this work for me:
Logger.getRootLogger.setLevel(Level.ERROR)
You could change Log Spark configurations too.
$ cd SPARK_HOME/conf
$ gedit log4j.properties.template
# find this lines in the file
# Set everything to be logged to the console
log4j.rootCategory=INFO, console
and change to ERROR
log4j.rootCategory=ERROR, console
In this file you have other options tho change too
# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=WARN
# Settings to quiet third party logs that are too verbose
.....
And finally rename the log4j.properties.template file
$ mv log4j.properties.template log4j.properties
You can follow this link for further configuration:
Logging in Spark with Log4j
or this one too:
Logging in Spark with Log4j. How to customize the driver and executors for YARN cluster mode.

It might be an old question, but I just ran by the same problem.
To fix it what I did was:
Adding private static Logger log = LoggerFactory.getLogger(Spark.class);
as a field for the class.
spark.sparkContext().setLogLevel("WARN"); after creating the spark session
Step 2 will work only after step 1.

What is a simple, effective way to debug custom Kafka connectors?

I'm working a couple of Kafka connectors and I don't see any errors in their creation/deployment in the console output, however I am not getting the result that I'm looking for (no results whatsoever for that matter, desired or otherwise). I made these connectors based on Kafka's example FileStream connectors, so my debug technique was based off the use of the SLF4J Logger that is used in the example. I've searched for the log messages that I thought would be produced in the console output, but to no avail. Am I looking in the wrong place for these messages? Or perhaps is there a better way of going about debugging these connectors?
Example uses of the SLF4J Logger that I referenced for my implementation:
Kafka FileStreamSinkTask
Kafka FileStreamSourceTask

I will try to reply to your question in a broad way. A simple way to do Connector development could be as follows:
Structure and build your connector source code by looking at one of the many Kafka Connectors available publicly (you'll find an extensive list available here: https://www.confluent.io/product/connectors/ )
Download the latest Confluent Open Source edition (>= 3.3.0) from https://www.confluent.io/download/
Make your connector package available to Kafka Connect in one of the following ways:
Store all your connector jar files (connector jar plus dependency jars excluding Connect API jars) to a location in your filesystem and enable plugin isolation by adding this location to the
plugin.path property in the Connect worker properties. For instance, if your connector jars are stored in /opt/connectors/my-first-connector, you will set plugin.path=/opt/connectors in your worker's properties (see below).
Store all your connector jar files in a folder under ${CONFLUENT_HOME}/share/java. For example: ${CONFLUENT_HOME}/share/java/kafka-connect-my-first-connector. (Needs to start with kafka-connect- prefix to be picked up by the startup scripts). $CONFLUENT_HOME is where you've installed Confluent Platform.
Optionally, increase your logging by changing the log level for Connect in ${CONFLUENT_HOME}/etc/kafka/connect-log4j.properties to DEBUG or even TRACE.
Use Confluent CLI to start all the services, including Kafka Connect. Details here: http://docs.confluent.io/current/connect/quickstart.html
Briefly: confluent start
Note: The Connect worker's properties file currently loaded by the CLI is ${CONFLUENT_HOME}/etc/schema-registry/connect-avro-distributed.properties. That's the file you should edit if you choose to enable classloading isolation but also if you need to change your Connect worker's properties.
Once you have Connect worker running, start your connector by running:
confluent load <connector_name> -d <connector_config.properties>
or
confluent load <connector_name> -d <connector_config.json>
The connector configuration can be either in java properties or JSON format.
Run
confluent log connect to open the Connect worker's log file, or navigate directly to where your logs and data are stored by running
cd "$( confluent current )"
Note: change where your logs and data are stored during a session of the Confluent CLI by setting the environment variable CONFLUENT_CURRENT appropriately. E.g. given that /opt/confluent exists and is where you want to store your data, run:
export CONFLUENT_CURRENT=/opt/confluent
confluent current
Finally, to interactively debug your connector a possible way is to apply the following before starting Connect with Confluent CLI :
confluent stop connect
export CONNECT_DEBUG=y; export DEBUG_SUSPEND_FLAG=y;
confluent start connect
and then connect with your debugger (for instance remotely to the Connect worker (default port: 5005). To stop running connect in debug mode, just run: unset CONNECT_DEBUG; unset DEBUG_SUSPEND_FLAG; when you are done.
I hope the above will make your connector development easier and ... more fun!

i love the accepted answer. one thing - the environment variables didn't work for me... i'm using confluent community edition 5.3.1...
here's what i did that worked...
i installed the confluent cli from here:
https://docs.confluent.io/current/cli/installing.html#tarball-installation
i ran confluent using the command confluent local start
i got the connect app details using the command ps -ef | grep connect
i copied the resulting command to an editor and added the arg (right after java):
-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005
then i stopped connect using the command confluent local stop connect
then i ran the connect command with the arg
brief intermission ---
vs code development is led by erich gamma - of gang of four fame, who also wrote eclipse. vs code is becoming a first class java ide see https://en.wikipedia.org/wiki/Erich_Gamma
intermission over ---
next i launched vs code and opened the debezium oracle connector folder (cloned from here) https://github.com/debezium/debezium-incubator
then i chose Debug - Open Configurations
and entered the highlighted debugging configuration
and then run the debugger - it will hit your breakpoints !!
the connect command should look something like this:
/Library/Java/JavaVirtualMachines/jdk1.8.0_221.jdk/Contents/Home/bin/java -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005 -Xms256M -Xmx2G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.awt.headless=true -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=/var/folders/yn/4k6t1qzn5kg3zwgbnf9qq_v40000gn/T/confluent.CYZjfRLm/connect/logs -Dlog4j.configuration=file:/Users/myuserid/confluent-5.3.1/bin/../etc/kafka/connect-log4j.properties -cp /Users/myuserid/confluent-5.3.1/share/java/kafka/*:/Users/myuserid/confluent-5.3.1/share/java/confluent-common/*:/Users/myuserid/confluent-5.3.1/share/java/kafka-serde-tools/*:/Users/myuserid/confluent-5.3.1/bin/../share/java/kafka/*:/Users/myuserid/confluent-5.3.1/bin/../support-metrics-client/build/dependant-libs-2.12.8/*:/Users/myuserid/confluent-5.3.1/bin/../support-metrics-client/build/libs/*:/usr/share/java/support-metrics-client/* org.apache.kafka.connect.cli.ConnectDistributed /var/folders/yn/4k6t1qzn5kg3zwgbnf9qq_v40000gn/T/confluent.CYZjfRLm/connect/connect.properties

Connector module is executed by the kafka connector framework. For debugging, we can use the standalone mode. we can configure IDE to use the ConnectStandalone main function as entry point.
create debug configure as the following. Need remember to tick "Include dependencies with "Provided" scope if it is maven project
connector properties file need specify the connector class name "connector.class" for debugging
worker properties file can copied from kafka folder /usr/local/etc/kafka/connect-standalone.properties

HiveServer2 doesn't recognize hive.aux.jars.path

I have several jar files listed in my hive-site.xml. I have a table that uses special FileInputFormat.
When I run hive, I can do something like: describe my-table. Works fine.
When I run hiveServer2 and connect from beeline. I can see the table, but when I do describe my-table I get:
Error: Error while processing statement: FAILED: RuntimeException java.lang.ClassNotFoundException: package.file.input.format.class.name (state=42000,code=40000)
What do I need to do to make sure hiveserver2 has access to the jar files?

Have you done the three steps: user-defined-function with HiveServer2

SQOOP SQLSERVER Failed to load driver " appropriate connection manager is not being set"

I downloaded sqljdbc4.jar. I'm invoking sqoop like so from the folder (where the jar is stored):
sqoop list-tables --driver com.microsoft.jdbc.sqlserver.SQLServerDriver --connect jdbc:sqlserver://localhost:1433;user=me;password=myPassword; -libjars=./sqljdbc4.jar
I'm getting the following warning & error:
13/10/25 18:38:13 WARN sqoop.ConnFactory: Parameter --driver is set to an explicit driver however appropriate connection manager is not being set (via --connection-manager). Sqoop is going to fall back to org.apache.sqoop.manager.GenericJdbcManager. Please specify explicitly which connection manager should be used next time.
13/10/25 18:38:13 INFO manager.SqlManager: Using default fetchSize of 1000
13/10/25 18:38:13 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.microsoft.jdbc.sqlserver.SQLServerDriver
java.lang.RuntimeException: Could not load db driver class: com.microsoft.jdbc.sqlserver.SQLServerDriver
at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:727)
at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52)
at org.apache.sqoop.manager.SqlManager.listTables(SqlManager.java:418)
at org.apache.sqoop.tool.ListTablesTool.run(ListTablesTool.java:49)
at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
at org.apache.sqoop.Sqoop.main(Sqoop.java:238)
UPDATE
I changed the command line to reflect the comments below, I get the same error:
sqoop list-databases -libjars=<ABSOLUTE_PATH>/jars/sqljdbc4.jar --connect jdbc:sqlserver://localhost:1433;user=me;password=password
13/10/28 17:00:33 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.microsoft.sqlserver.jdbc.SQLServerDriver
java.lang.RuntimeException: Could not load db driver class: com.microsoft.sqlserver.jdbc.SQLServerDriver
at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:727)
at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52)
at org.apache.sqoop.manager.CatalogQueryManager.listDatabases(CatalogQueryManager.java:57)
at org.apache.sqoop.tool.ListDatabasesTool.run(ListDatabasesTool.java:49)
at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
at org.apache.sqoop.Sqoop.main(Sqoop.java:238)
When I look at the listing of sqljdbc4.jar, I do see the class in that path... Is it possible that libjars option isn't doing what I think it is supposed to do?

In vast majority of cases using parameter --driver is not required and even more will lead to an undesirable behaviour. I would strongly recommend dropping this argument entirely from your command line. Check out Connectors vs Drivers blog post for more details.
Also in addition you are specifying a nonexistent JDBC Driver class. The correct one is:
com.microsoft.sqlserver.jdbc.SQLServerDriver
You can see it in the official docs, whereas you are specifying
com.microsoft.jdbc.sqlserver.SQLServerDriver
Notice the different order of jdbc and sqlserver packages. This is one of the reasons why it's recommended to not use the --driver option at all.

You need to put sqljdbc4.jar in $SQOOP_HOME/lib and also add sqoop-1.4.4.jar or whatever version you are using along with sqljdbc4.jar to $HADOOP_HOME/lib.
I'm using Hadoop-2.2.0, so i put it inside $HADOOP_HOME/share/hadoop/common/lib directory, and use the following command to do the import:
export HCAT_HOME=/home/Kuntal/BIG_DATA/hive-0.12.0/hcatalog
(sometimes HCatlog of Hive needs to be exported or set.)
./sqoop-import --connect "jdbc:sqlserver://IP\INSTANCE;port=1433;username=USERNAME;password=PASSWORD;database=DATABASE_NAME" --table TABLE_NAME --target-dir hdfs://localhost:50315/sqoop --m 1
Sometimes you have to specify the port, otherwise default works. Hope you find it useful.

According to this sqoop documentation, generic options like -libjars must come before tool-specific options:
Generic Hadoop command-line arguments:
(must preceed any tool-specific arguments)
...
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.

Recently came across this same problem. Even though documentation says that it will pick up additional jar files. The problem is I believe propagated from Hadoop jar command line option. -libjars is not reliable option to set additional jar file path.
Instead, choose HADOOP_CLASSPATH option to setup additional jar files.
In my case, I had multiple different versions of driver JAR, but using -libjars was not correctly picking up file for me.
To resolve this, I specified
export HADOOP_CLASSPATH=/$SQOOP_HOME/<path_to_driver>.jar
This makes sure that correct JAR file gets loaded.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.