Calling custom java class from python using JPype - java

Getting class not found exception while trying to call java class from python using jpype. Below are version and path details:
JPype version:JPype1-py3
Python: 3.6
Java: 1.8.0_171
Java file path: /home/neha/Downloads/fontAttributes/PDFFontExtractor.java
Python file path: /home/neha/Downloads/call_java.py
Below is Python code: call_java.py
import jpype
from jpype import *
cpath="-Djava.class.path=%s" % ("/home/neha/Downloads")
startJVM(getDefaultJVMPath(), "-ea",cpath)
Test = JClass('fontAttributes.PDFFontExtractor')
Test.getFontAttributes()
java.lang.System.out.println(str)
shutdownJVM()
Output:
/usr/lib/jvm/java-8-oracle/jre/lib/amd64/server/libjvm.so
Traceback (most recent call last):
File "call_java.py", line 14, in <module>
Test = JClass('fontAttributes.PDFFontExtractor')
File "/usr/local/lib/python3.6/dist-packages/jpype/_jclass.py", line 55, in JClass
raise _RUNTIMEEXCEPTION.PYEXC("Class %s not found" % name)
jpype._jexception.ExceptionPyRaisable: java.lang.Exception: Class fontAttributes.PDFFontExtractor not found

Related

pyspark hiveContext error while executing spark-submit application to yarn and remote CDH kerberized env

error occurs while executing
airflow#41166b660d82:~$ spark-submit --master yarn --deploy-mode cluster --keytab keytab_name.keytab --principal --jars keytab_name#REALM --jars /path/to/spark-hive_2.11-2.3.0.jar sranje.py
from airflow docker container not in CDH env (not managed by CDH CM). sranje.py is simple select * from hive table.
App is accepted on CDH yarn and executed twice with this error:
...
2020-12-31 10:11:43 INFO StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
Traceback (most recent call last):
File "sranje.py", line 21, in <module>
source_df = hiveContext.table(hive_source).na.fill("")
File "/dfs/dn4/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0150/container_e29_1608187067076_0150_02_000001/pyspark.zip/pyspark/sql/context.py", line 366, in table
File "/dfs/dn4/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0150/container_e29_1608187067076_0150_02_000001/pyspark.zip/pyspark/sql/session.py", line 721, in table
File "/dfs/dn4/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0150/container_e29_1608187067076_0150_02_000001/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File "/dfs/dn4/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0150/container_e29_1608187067076_0150_02_000001/pyspark.zip/pyspark/sql/utils.py", line 79, in deco
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog':"
2020-12-31 10:11:43 ERROR ApplicationMaster:70 - User application exited with status 1
2020-12-31 10:11:43 INFO ApplicationMaster:54 - Final app status: FAILED, exitCode: 1, (reason: User application exited with status 1)
...
We assume that "some .jar's and java dependencies" are missing. Any ideas?
Details
there is a valid krb ticket before executing spark cmd
if we ommit --jars /path/to/spark-hive_2.11-2.3.0.jar, pyhton error is different
...
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':"
...
versions of spark(2.3.0), hadoop(2.6.0) and java are same as CDH
hive-site.xml, yarn-site.xml etc are also provided and valid
this same spark-submit app executes OK from node inside of CDH cluster
we tried adding additional --jars spark-hive_2.11-2.3.0.jar,spark-core_2.11-2.3.0.jar,spark-sql_2.11-2.3.0.jar,hive-hcatalog-core-2.3.0.jar,spark-hive-thriftserver_2.11-2.3.0.jar
developers use this code as an example:
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from pyspark.context import SparkContext
from pyspark.sql import SparkSession, SQLContext, HiveContext, functions as F
from pyspark.sql.utils import AnalysisException
from datetime import datetime
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
sqlContext = SQLContext(sc)
hiveContext = HiveContext(sc)
current_date = str(datetime.now().strftime('%Y-%m-%d'))
hive_source = "lnz_ch.lnz_cfg_codebook"
source_df = hiveContext.table(hive_source).na.fill("")
print("Number of records: {}".format(source_df.count()))
print("First 20 rows of the table:")
source_df.show(20)
different script, same error
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder.appName("ZekoTest").enableHiveSupport().getOrCreate()
data = spark.sql("SELECT * FROM lnz_ch.lnz_cfg_codebook")
data.show(20)
spark.close()
Thank you.
Hive dependecies are resolved with:
downloading hive.tar.gz with exact version of CDH Hive
created symlinks from hive/ to spark/
ln -s apache-hive-1.1.0-bin/lib/*.jar spark-2.3.0-bin-without-hadoop/jars/
additional jars downloaded from maven repo to spark/jars/
hive-hcatalog-core-2.3.0.jar
slf4j-api-1.7.26.jar
spark-hive_2.11-2.3.0.jar
spark-hive-thriftserver_2.11-2.3.0.jar
refresh env var
HADOOP_CLASSPATH=$(find $HADOOP_HOME -name '*.jar' | xargs echo | tr ' ' ':')
SPARK_DIST_CLASSPATH=$(hadoop classpath)
beeline works, but pyspark throws error
2021-01-07 15:02:20 INFO StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
Traceback (most recent call last):
File "sranje.py", line 21, in <module>
source_df = hiveContext.table(hive_source).na.fill("")
File "/dfs/dn12/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0207/container_e29_1608187067076_0207_01_000001/pyspark.zip/pyspark/sql/context.py", line 366, in table
File "/dfs/dn12/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0207/container_e29_1608187067076_0207_01_000001/pyspark.zip/pyspark/sql/session.py", line 721, in table
File "/dfs/dn12/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0207/container_e29_1608187067076_0207_01_000001/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File "/dfs/dn12/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0207/container_e29_1608187067076_0207_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/dfs/dn12/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0207/container_e29_1608187067076_0207_01_000001/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o31.table.
: java.lang.NoSuchFieldError: METASTORE_CLIENT_SOCKET_LIFETIME
But, that's another question. Thank you all.

Spark 1.6: How do convert an RDD generated from a Scala jar to a pyspark RDD?

I'm trying to create some POC code that demonstrates how a Scala function can be called from PySpark such that the result is a PySpark.RDD.
Here is the code on the Scala side:
object PySpark extends Logger {
def getTestRDD(sc: SparkContext): RDD[Int] = {
sc.parallelize(List.range(1, 10))
}
}
and this is what I'm doing to access it on the PySpark side:
>>> foo = sc._jvm.com.clickfox.combinations.lab.PySpark
>>> jrdd = foo.getTestRDD(sc._jsc.sc())
>>> moo = RDD(jrdd, sc._jsc.sc())
>>> type(moo)
>>> <class 'pyspark.rdd.RDD'>
so far so good - what I get back appears to be an instance of PySpark.RDD. the problems arise when I attempt to use the RDD:
>>> moo.take(1)
>>> Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/spark-1.6.3-bin-hadoop2.6/python/pyspark/rdd.py", line 1267, in take
totalParts = self.getNumPartitions()
File "/usr/local/spark-1.6.3-bin-hadoop2.6/python/pyspark/rdd.py", line 356, in getNumPartitions
return self._jrdd.partitions().size()
File "/usr/local/spark-1.6.3-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/usr/local/spark-1.6.3-bin-hadoop2.6/python/pyspark/sql/utils.py", line 45, in deco
return f(*a, **kw)
File "/usr/local/spark-1.6.3-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 312, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o25.size. Trace:
py4j.Py4JException: Method size([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
I also tried passing in the PySpark context instead of the Java one to see what would happen:
>>> moo = RDD(jrdd, sc)
>>> moo.collect()
>>> Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/spark-1.6.3-bin-hadoop2.6/python/pyspark/rdd.py", line 771, in collect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "/usr/local/spark-1.6.3-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/usr/local/spark-1.6.3-bin-hadoop2.6/python/pyspark/sql/utils.py", line 45, in deco
return f(*a, **kw)
File "/usr/local/spark-1.6.3-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 312, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o21.rdd. Trace:
py4j.Py4JException: Method rdd([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
still no go. is there a way to convert, or at least access, the data inside the Java RDD from PySpark?
EDIT I'm aware that I can convert the RDD to an Array on the Java side of things and iterate through the resultant JavaArray object but I'd like to avoid that if possible.
what I get back appears to be an instance of PySpark.RDD.
Just because it is a valid PySpark RDD it doesn't mean the content can be can be understood by Python. What you pass is an RDD of Java objects. For internal conversions Spark uses to Pyrolite to re-serialize objects between Python and JVM.
This is an internal API, but you can:
from pyspark.ml.common import _java2py
rdd = _java2py(
sc, sc._jvm.com.clickfox.combinations.lab.PySpark.getTestRDD(sc._jsc.sc()))
Note that this is approach is fairly limited and supports only basic types conversions.
You can also use replace RDD with DataFrame:
object PySpark {
def getTestDataFrame(sqlContext: SQLContext): DataFrame = {
sqlContext.range(1, 10)
}
}
from pyspark.sql.dataframe import DataFrame
DataFrame(
sc._jvm.com.clickfox.combinations.lab.PySpark.getTestDataFrame(
sqlContext._jsqlContext),
sqlContext)

JPype class not found, $1 with no inner class

I have these java files:
LDF1File.java
LDFFile.java <-- super class
which generate these class files:
LDF1File.class -- there is no inner class
LDF1File$1.class <-- no idea where this comes from
LDFFile.class
In my python code, I can import LDF1File$1, but not LDF1File. I get:
>>> JClass('aero.blue.bdms.ldf.stream.LDF1File')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/eric/Software/anaconda3/envs/blue3/lib/python3.5/site-packages/JPype1-0.6.1-py3.5-linux-x86_64.egg/jpype/_jclass.py", line 55, in JClass
raise _RUNTIMEEXCEPTION.PYEXC("Class %s not found" % name)
jpype._jexception.RuntimeExceptionPyRaisable: java.lang.RuntimeException: Class aero.blue.bdms.ldf.stream.LDF1File not found
Here's the full code:
from jpype import *
startJVM(getDefaultJVMPath(), "-ea", '-Xms1024m', '-Xmx4096m', '-Djava.class.path=./jars/bdms-chunkjava-lib-1.0.9-SNAPSHOT.jar:./jars/bdms-ldfjava-lib-1.0.9-SNAPSHOT.jar')
LDF1File = JClass('aero.blue.bdms.ldf.stream.LDF1File')
shutdownJVM()
So i'm not sure why there is a class file with a dollar sign in it's name, and I'm not sure why JPype can't locate LDF1File. Just to rule some possible suggestions out, there is no dependency injection, no aspectj stuff, no spring, no guava. This is just plain java.
I had only included in my classpath the jar of the package i was working with and none of the dependencies. Having added all dependency jars to the classpath, it was able to load LDF1File.java

Using a functional java construct (Predicate) from jython

So, I'm attempting to use the selenium java libraries from jython (yes, I know selenium has a python interface, but for good reasons of corporate teamwork accessing the pure java libraries makes more sense if it can be done well).
I'm just trying to do the example script here: http://seleniumhq.org/docs/03_webdriver.html#introducing-the-selenium-webdriver-api-by-example
Which I've implemented with the following jython code:
from org.openqa.selenium.firefox import FirefoxDriver
from org.openqa.selenium import By
from org.openqa.selenium import WebDriver
from org.openqa.selenium import WebElement
from org.openqa.selenium.support.ui import ExpectedCondition
from org.openqa.selenium.support.ui import WebDriverWait
driver = FirefoxDriver()
driver.get('http://www.google.com')
element = driver.findElement(By.name('q'))
# The array wrapper around the string is the only weird thing I encountered
element.sendKeys(["Cheese!"])
print "Page title is: " + driver.getTitle()
class ExpectedConditionTitle(ExpectedCondition):
def apply(d):
print(type(d))
return d.title.toLowerCase().startsWith(["cheese!"])
def equals(d):
pass
print(type(driver))
WebDriverWait(driver, 10).until(ExpectedConditionTitle().apply())
print driver.getTitle()
driver.quit()
And it's puking on the ExpectedCondition bit. I can't figure out how to make a subclass for the variety desired by until. I've gotten the following errors with variations in my code:
last
Traceback (innermost last):
File "Example.py", line 24, in ?
File "Example.py", line 19, in apply
AttributeError: 'instance' object has no attribute 'title'
and
Traceback (innermost last):
File "Example.py", line 24, in ?
File "Example.py", line 19, in apply
AttributeError: getTitle
and
Traceback (innermost last):
File "Example.py", line 22, in ?
TypeError: until(): 1st arg can't be coerced to com.google.common.base.Function or com.google.common.base.Predicate
The selenium ExpectedCondition interface is basically just a front for the Guava Predicate interface.
I'm not well versed enough in python or java to figure this out. Anyone have any ideas how I might accomplish this?

Jython missing functions in sys module

I have an python script which I need to run in my java application. I tried to execute it from jython but I have strange problem:
from sys import getdlopenflags
I get:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: cannot import name getdlopenflags
When I try to check contents of sys:
import sys
dir(sys)
the output is:
['JYTHON_DEV_JAR', 'JYTHON_JAR', 'PYTHON_CACHEDIR', 'PYTHON_CACHEDIR_SKIP', 'PYTHON_CONSOLE_ENCODING', '__delattr__', '__dict__', '__displayhook__', '__excepthook__', '__findattr_ex__', '__name__', '__new__', '__rawdir__', '__setattr__', '__stderr__', '__stdin__', '__stdout__', '_getframe', '_jy_interpreter', '_systemRestart', 'add_classdir', 'add_extdir', 'add_package', 'argv', 'builtin_module_names', 'builtins', 'byteorder', 'classDictInit', 'classLoader', 'cleanup', 'copyright', 'currentWorkingDir', 'defaultencoding', 'determinePlatform', 'displayhook', 'doInitialize', 'exc_clear', 'exc_info', 'excepthook', 'exec_prefix', 'executable', 'exit', 'filesystemencoding', 'getBaseProperties', 'getBuiltin', 'getBuiltins', 'getClassLoader', 'getCurrentWorkingDir', 'getDefaultBuiltins', 'getPath', 'getPathLazy', 'getPlatform', 'getWarnoptions', 'getdefaultencoding', 'getfilesystemencoding', 'getrecursionlimit', 'hexversion', 'initialize', 'isPackageCacheEnabled', 'last_traceback', 'last_type', 'last_value', 'maxint', 'maxunicode', 'meta_path', 'minint', 'modules', 'packageManager', 'path', 'path_hooks', 'path_importer_cache', 'platform', 'prefix', 'ps1', 'ps2', 'recursionlimit', 'registerCloser', 'registry', 'setBuiltins', 'setClassLoader', 'setCurrentWorkingDir', 'setPlatform', 'setWarnoptions', 'setprofile', 'setrecursionlimit', 'settrace', 'shadow', 'stderr', 'stdin', 'stdout', 'subversion', 'toString', 'unregisterCloser', 'version', 'version_info', 'warnoptions']
obviously getdlopenflags is missing. Is it possible to use this function in jython (I have the newest - 2.5.2) According to the documentation on the http://jython.org/docs/library/sys.html the sys.getdlopenflags is present.
Thanks for help
It says "Availability: Unix" in the documentation. The Jython docs seem to have copied that unchanged from the CPython docs. So this function is only available on a Unix installation. Possibly Jython doesn't have it at all - I don't know Java well enough, but since it's supposedly platform-independent, it can't support system-specific functions.

Categories

Resources