How to control Spark logging in IntelliJ

How to control Spark logging in IntelliJ - java

I'm usually running stuff from JUnit but I've also tried running from main and it makes no difference.
I have read nearly two dozen SO questions, blog posts, and articles and tried almost everything to get Spark to stop logging so much.
Things I've tried:
log4j.properties in resources folder (in src and test)
Using spark-submit to add a log4j.properties which failed with "error: missing application resources"
Logger.getLogger("com").setLevel(Level.WARN);
Logger.getLogger("org").setLevel(Level.WARN);
Logger.getLogger("akka").setLevel(Level.WARN);Logger.getRootLogger().setLevel(Level.WARN);spark.sparkContext().setLogLevel("WARN");
In another project I got the logging to be quiet with:
Logger.getLogger("org").setLevel(Level.WARN);
Logger.getLogger("akka").setLevel(Level.WARN);
but it is not working here.
How I'm creating my SparkSession:
SparkSession spark = SparkSession
.builder()
.appName("RS-LDA")
.master("local")
.getOrCreate();
Let me know if you'd like to see more of my code.
Thanks

I'm using IntelliJ and Spark and, this work for me:
Logger.getRootLogger.setLevel(Level.ERROR)
You could change Log Spark configurations too.
$ cd SPARK_HOME/conf
$ gedit log4j.properties.template
# find this lines in the file
# Set everything to be logged to the console
log4j.rootCategory=INFO, console
and change to ERROR
log4j.rootCategory=ERROR, console
In this file you have other options tho change too
# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=WARN
# Settings to quiet third party logs that are too verbose
.....
And finally rename the log4j.properties.template file
$ mv log4j.properties.template log4j.properties
You can follow this link for further configuration:
Logging in Spark with Log4j
or this one too:
Logging in Spark with Log4j. How to customize the driver and executors for YARN cluster mode.

It might be an old question, but I just ran by the same problem.
To fix it what I did was:
Adding private static Logger log = LoggerFactory.getLogger(Spark.class);
as a field for the class.
spark.sparkContext().setLogLevel("WARN"); after creating the spark session
Step 2 will work only after step 1.

Related

java.io.IOException: Could not read footer for file FileStatus when trying to read parquet file from Spark cluster from IBM Cloud Object Storage

I have created a Spark Cluster with 3 workers on Kubernetes and a JupyterHub deployment to attach to it so I can run huge queries.
My parquet files are stored into IBM Cloud Object Storage (COS) and when I run a simple code to read from COS, I'm getting the following error:
Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=file:/path/myfile.parquet/_common_metadata; isDirectory=false; length=413; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} at parquet.hadoop.ParquetFileReader.readAllFootersInParallel
I have added all the required libraries to the /jars directory on SPARK_HOME directory in the driver.
This is the code I'm using to connect:
# Initial Setup - Once
import os
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
spark_session = SparkSession(sc)
credentials_staging_parquet = {
'bucket_dm':'mybucket1',
'bucket_eid':'bucket2',
'secret_key':'XXXXXXXX',
'iam_url':'https://iam.ng.bluemix.net/oidc/token',
'api_key':'XXXXXXXX',
'resource_instance_id':'crn:v1:bluemix:public:cloud-object-storage:global:a/XXXXX:XXXXX::',
'access_key':'XXXXX',
'url':'https://s3-api.us-geo.objectstorage.softlayer.net'
}
conf = {
'fs.cos.service.access.key': credentials_staging_parquet.get('access_key'),
'fs.cos.service.endpoint': credentials_staging_parquet.get('url'),
'fs.cos.service.secret.key': credentials_staging_parquet.get('secret_key'),
'fs.cos.service.iam.endpoint': credentials_staging_parquet.get('iam_url'),
'fs.cos.service.iam.service.id': credentials_staging_parquet.get('resource_instance_id'),
'fs.stocator.scheme.list': 'cos',
'fs.cos.impl': 'com.ibm.stocator.fs.ObjectStoreFileSystem',
'fs.stocator.cos.impl': 'com.ibm.stocator.fs.cos.COSAPIClient',
'fs.stocator.cos.scheme': 'cos',
'fs.cos.client.execution.timeout': '18000000',
'fs.stocator.glob.bracket.support': 'true'
}
hadoop_conf = sc._jsc.hadoopConfiguration()
for key in conf:
hadoop_conf.set(key, conf.get(key))
parquet_path = 'store/MY_FILE/*'
cos_url = 'cos://{bucket}.service/{parquet_path}'.format(bucket=credentials_staging_parquet.get('bucket_eid'), parquet_path=parquet_path)
df2 = spark_session.read.parquet(cos_url)

I got this similar error & Googled found this post. Next, I realized that I have a file format issue where the saved file was Avro and the file reader was Orc. So ... check your saved file format and reader formats are aligning.

Found the problem to my issue, the required libraries were not available for all workers in the cluster.
There are 2 ways to fix that:
Make sure you add the dependencies on the spark-submit command so it's distributed to the whole cluster, in this case it should be done in the kernel.json file on Jupyterhub located in /usr/local/share/jupyter/kernels/pyspark/kernel.json (assuming you created that).
OR
Add the dependencies on the /jars directory on your SPARK_HOME for each worker in the cluster and the driver (if you didn't do so).
I used the second approach. During my docker image creation, I added the libs so when I start my cluster, all containers already have the libraries required.

Try restarting your system or server and it will work after it.
I faced the same problem. It generally happens when you upgrade your Java version however spark lib still points to old java version. Rebooting your server/system resolves the problem.

What is a simple, effective way to debug custom Kafka connectors?

I'm working a couple of Kafka connectors and I don't see any errors in their creation/deployment in the console output, however I am not getting the result that I'm looking for (no results whatsoever for that matter, desired or otherwise). I made these connectors based on Kafka's example FileStream connectors, so my debug technique was based off the use of the SLF4J Logger that is used in the example. I've searched for the log messages that I thought would be produced in the console output, but to no avail. Am I looking in the wrong place for these messages? Or perhaps is there a better way of going about debugging these connectors?
Example uses of the SLF4J Logger that I referenced for my implementation:
Kafka FileStreamSinkTask
Kafka FileStreamSourceTask

I will try to reply to your question in a broad way. A simple way to do Connector development could be as follows:
Structure and build your connector source code by looking at one of the many Kafka Connectors available publicly (you'll find an extensive list available here: https://www.confluent.io/product/connectors/ )
Download the latest Confluent Open Source edition (>= 3.3.0) from https://www.confluent.io/download/
Make your connector package available to Kafka Connect in one of the following ways:
Store all your connector jar files (connector jar plus dependency jars excluding Connect API jars) to a location in your filesystem and enable plugin isolation by adding this location to the
plugin.path property in the Connect worker properties. For instance, if your connector jars are stored in /opt/connectors/my-first-connector, you will set plugin.path=/opt/connectors in your worker's properties (see below).
Store all your connector jar files in a folder under ${CONFLUENT_HOME}/share/java. For example: ${CONFLUENT_HOME}/share/java/kafka-connect-my-first-connector. (Needs to start with kafka-connect- prefix to be picked up by the startup scripts). $CONFLUENT_HOME is where you've installed Confluent Platform.
Optionally, increase your logging by changing the log level for Connect in ${CONFLUENT_HOME}/etc/kafka/connect-log4j.properties to DEBUG or even TRACE.
Use Confluent CLI to start all the services, including Kafka Connect. Details here: http://docs.confluent.io/current/connect/quickstart.html
Briefly: confluent start
Note: The Connect worker's properties file currently loaded by the CLI is ${CONFLUENT_HOME}/etc/schema-registry/connect-avro-distributed.properties. That's the file you should edit if you choose to enable classloading isolation but also if you need to change your Connect worker's properties.
Once you have Connect worker running, start your connector by running:
confluent load <connector_name> -d <connector_config.properties>
or
confluent load <connector_name> -d <connector_config.json>
The connector configuration can be either in java properties or JSON format.
Run
confluent log connect to open the Connect worker's log file, or navigate directly to where your logs and data are stored by running
cd "$( confluent current )"
Note: change where your logs and data are stored during a session of the Confluent CLI by setting the environment variable CONFLUENT_CURRENT appropriately. E.g. given that /opt/confluent exists and is where you want to store your data, run:
export CONFLUENT_CURRENT=/opt/confluent
confluent current
Finally, to interactively debug your connector a possible way is to apply the following before starting Connect with Confluent CLI :
confluent stop connect
export CONNECT_DEBUG=y; export DEBUG_SUSPEND_FLAG=y;
confluent start connect
and then connect with your debugger (for instance remotely to the Connect worker (default port: 5005). To stop running connect in debug mode, just run: unset CONNECT_DEBUG; unset DEBUG_SUSPEND_FLAG; when you are done.
I hope the above will make your connector development easier and ... more fun!

i love the accepted answer. one thing - the environment variables didn't work for me... i'm using confluent community edition 5.3.1...
here's what i did that worked...
i installed the confluent cli from here:
https://docs.confluent.io/current/cli/installing.html#tarball-installation
i ran confluent using the command confluent local start
i got the connect app details using the command ps -ef | grep connect
i copied the resulting command to an editor and added the arg (right after java):
-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005
then i stopped connect using the command confluent local stop connect
then i ran the connect command with the arg
brief intermission ---
vs code development is led by erich gamma - of gang of four fame, who also wrote eclipse. vs code is becoming a first class java ide see https://en.wikipedia.org/wiki/Erich_Gamma
intermission over ---
next i launched vs code and opened the debezium oracle connector folder (cloned from here) https://github.com/debezium/debezium-incubator
then i chose Debug - Open Configurations
and entered the highlighted debugging configuration
and then run the debugger - it will hit your breakpoints !!
the connect command should look something like this:
/Library/Java/JavaVirtualMachines/jdk1.8.0_221.jdk/Contents/Home/bin/java -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005 -Xms256M -Xmx2G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.awt.headless=true -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=/var/folders/yn/4k6t1qzn5kg3zwgbnf9qq_v40000gn/T/confluent.CYZjfRLm/connect/logs -Dlog4j.configuration=file:/Users/myuserid/confluent-5.3.1/bin/../etc/kafka/connect-log4j.properties -cp /Users/myuserid/confluent-5.3.1/share/java/kafka/*:/Users/myuserid/confluent-5.3.1/share/java/confluent-common/*:/Users/myuserid/confluent-5.3.1/share/java/kafka-serde-tools/*:/Users/myuserid/confluent-5.3.1/bin/../share/java/kafka/*:/Users/myuserid/confluent-5.3.1/bin/../support-metrics-client/build/dependant-libs-2.12.8/*:/Users/myuserid/confluent-5.3.1/bin/../support-metrics-client/build/libs/*:/usr/share/java/support-metrics-client/* org.apache.kafka.connect.cli.ConnectDistributed /var/folders/yn/4k6t1qzn5kg3zwgbnf9qq_v40000gn/T/confluent.CYZjfRLm/connect/connect.properties

Connector module is executed by the kafka connector framework. For debugging, we can use the standalone mode. we can configure IDE to use the ConnectStandalone main function as entry point.
create debug configure as the following. Need remember to tick "Include dependencies with "Provided" scope if it is maven project
connector properties file need specify the connector class name "connector.class" for debugging
worker properties file can copied from kafka folder /usr/local/etc/kafka/connect-standalone.properties

Java invoke Sqoop fails on hive import

I have a Java Program which run a Sqoop operation by Sqoop.runTool() method. Until now, I can successfully run this operation to get data from oracle to HDFS.
but when I add the parameter --hive-import, it reports an warning in hive's log file in /tmp/root/hive.log:
hive-site.xml not found on CLASSPATH
Mean while I noticed that,run by JavaAPI and by plain sqoop Command，the console message have a different output:
Logging initialized musing configuration in jar:file:/opt/cloudera/parcels/CDH/jars/hive-exec-1.1.0!hive-log4j.properties
Logging initialized musing configuration in jar:file:/opt/cloudera/parcels/CDH/jars/hive-common-1.1.0!hive-log4j.properties
Will be grateful for any help.

Oozie java action logger logs are not shown on Oozie console

I am executing a map-reduce code by calling Driver class in Oozie java action. Map reduce run successfully and I get the output as expected. However, the log statements in my driver class are not shown on oozie job logs. I am using log4j for logging in my driver class.
Do I need to make some configuration changes to see the logs ?. Snippet of my workflow.xml `
<action name="MyAppDriver">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/home/hadoop/work/surjan/outpath/20160430" />
</prepare>
<main-class>com.surjan.driver.MyAppMainDriver</main-class>
<arg>/home/hadoop/work/surjan/PoC/wf-app-dir/MyApp.xml</arg>
<job-xml>/home/hadoop/work/surjan/PoC/wf-app-dir/AppSegmenter.xml</job-xml>
</java>
<ok to="sendEmailSuccess"/>
<error to="sendEmailKill"/>
</action>
`

The logs going into Yarn.
In my case I have a custom java action. If you look in Yarn UI you have to dig into the mapper task that the java action is in. So in my case the oozie wf item was 0070083-200420161025476-oozie-xxx-W and oozie job -log ${wf_id} shows the java action 0070083-200420161025476-oozie-xxx-W#java1 failed with a Java exception. I cannot see any context to that. Looking on the oozie web UI only the "Job Error Log" is populated as per what is shown on the commandline. The actual logging isn't shown. The ooize job -info ${wf_id} status shows failed:
Actions
------------------------------------------------------------------------------------------------------------------------------------
ID Status Ext ID Ext Status Err Code
------------------------------------------------------------------------------------------------------------------------------------
0070083-200420161025476-oozie-xxx-W#:start: OK - OK -
------------------------------------------------------------------------------------------------------------------------------------
0070083-200420161025476-oozie-xxx-W#java1 ERROR job_1587370152078_1090 FAILED/KILLEDJA018
------------------------------------------------------------------------------------------------------------------------------------
0070083-200420161025476-oozie-xxx-W#fail OK - OK E0729
------------------------------------------------------------------------------------------------------------------------------------
You can search for the actual yarn on the web console Yarn Resource Manager UI (not within the "yarn logs" web console which are yarns own logs not the logs of what it is hosting). You can easily grep for the correct id on the commandline by looking for the ooize wf job id:
user#host:~/apidemo$ yarn application --list -appStates FINISHED | grep 0070083-200420161025476
20/04/22 20:42:12 INFO client.AHSProxy: Connecting to Application History server at your.host.com/130.178.58.221:10200
20/04/22 20:42:12 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
20/04/22 20:42:12 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm2]
application_1587370152078_1090 oozie:launcher:T=java:W=java-whatever-sql-task-wf:A=java1:ID=0070083-200420161025476-oozie-xxx-W MAPREDUCE kerberos-id default FINISHED SUCCEEDED 100% https://your.host.com:8090/jobhistory/job/job_1587370152078_1090
user#host:~/apidemo$
Note that oozie says things failed. Yet the status of the action is "FINISHED" and the yarn application status is "SUCCESS" which seems strange.
Helpfully that commandline output also shows the url to the job history. That opens the webpage that takes you to the parent application that ran your java. If you click on the little logs link in the page you see some logs. If you look closely that page said it ran 1 operation of "task type map". If you click on the link in that row it takes you to the actual task which in my case is task_1587370152078_1090_m_000000. You have to click into that to see the first attempt which is attempt_1587370152078_1090_m_000000_0 then on the right hand side you have a tiny logs link which shows some more specific logging.
You can also ask yarn for the logs once you know the application id:
yarn logs -applicationId application_1587370152078_1090
That showed me very detailed logs including the custom java logging that I didn't see on the console easily where I could see what was really going on.
Note that if you are writing custom code you want to let yarn set the log4j properties file rather than supply your own version so that the yarn tools can find your logs. The code will be run with a flag:
-Dlog4j.configuration=container-log4j.properties
The detailed logs show all the jars that are added to the classpath. You should ensure that your custom code uses the same jars and log4j version.

Aspose words logging on tomcat

I'm using aspose.words (JAVA) in version 13.8.0
I'm not able to forward the log output to Tomcats (7) console. At least I think so:
com.aspose.words.Document word = new com.aspose.words.Document(content);
word.getMailMerge().setUseNonMergeFields(true);
org.w3c.dom.Document workObjectXml = createXml(root, "root", "MM.dd.yyyy");
word.getMailMerge().executeWithRegions(new XmlMailMergeDataSet(workObjectXml));
does not produce any log output with the following log4j.properties:
# Comment this line and uncomment the following to allow log writing to a local file
log4j.rootLogger=INFO, A
# log4j.rootLogger=INFO, A, local.file
log4j.appender.A=org.apache.log4j.ConsoleAppender
log4j.appender.A.layout=org.apache.log4j.PatternLayout
log4j.appender.A.layout.ConversionPattern=%d{ISO8601} %-5p %-85.85c - %m%n
## Project
log4j.logger.com.aspose.words=DEBUG
I found a similar issue for aspose.pdf here: http://www.aspose.com/community/forums/thread/495783/log4j-logging-package-issue-in-aspose.pdf.aspx
But according to the post, this was fixed in aspose.pdf earlier than the release date of my library so my assumption is, that the issue I'm facing is not the same but in a different library.

Seems like a log4j configuration issue to me.
Try adding the following line at any initial stage of your program and see if it works.
BasicConfigurator.configure();
I did not test it in a web/Tomcat app, but I have the same issue in a console application. The log messages are created in the log file, but not on the console output. It worked when I called the configure() method.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.