In pig, you can pass a configuration from your pig script to pig UDF via UDFContext. For example,
// in pig script
SET my.conf dummy-conf
// in UDF java code
Configuration conf = UDFContext.getUDFContext().getJobConf();
String myConf = conf.get("my.conf");
So, is there a similar way to pass configuration from a hive script to a hive UDF? For example, if I have set MY_CONF='foobar' in a hive script, how can I retrieve that in a java UDF, which needs to consume the value of MY_CONF?
Instead of extending UDF class, you can try subclassing GenericUDF. This class has the following method you can override:
/**
* Additionally setup GenericUDF with MapredContext before initializing.
* This is only called in runtime of MapRedTask.
*
* #param context context
*/
public void configure(MapredContext context) {
}
MapredContext has a method just like UDFContext from Pig to retrieve the Job configuration. So you could just do the following:
#Override
public void configure(MapredContext context) {
Configuration conf = context.getJobConf();
}
As of hive 1.2 there are 2 approaches.
1. Overriding configure method from GenericUDF
#Override
public void configure(MapredContext context) {
super.configure(context);
someProp = context.getJobConf().get(HIVE_PROPERTY_NAME);
}
Above(1) won't work in all the cases. Works only in MapredContext.
Every query has to be force map/reduce jobs, to do that set
set hive.fetch.task.conversion=minimal/none;
set hive.optimize.constant.propagation=false;
.
with above properties set, you will hit major performance problems, especially for smaller queries.
2. Using SessionState
SessionState ss = SessionState.get();
if (ss != null) {
this.hiveConf = ss.getConf();
someProp = this.hiveConf.get(HIVE_PROPERTY_NAME);
LOG.info("Got someProp: " + someProp);
}
Go to hive command line
hive> set MY_CONF='foobar';
your variable should be listed when hitting the command
hive> set;
Now, consider you have below
Jar: MyUDF.jar
UDF calss: MySampleUDF.java which accepts a String value.
Table: employee
hive> ADD JAR /MyUDF.jar
hive> CREATE TEMPORARY FUNCTION testUDF AS 'youpackage.MySampleUDF';
hive> SELECT testUDF(${MY_CONF}) from employee;
there are lots of example, shared , so you can find all required details over google :).
A Small Example which was describe in shared link:
hive> ADD JAR assembled.jar;
hive> create temporary function hello as 'com.test.example.UDFExample';
hive> select hello(firstname) from people limit 10;
Please check link for reference which I normally Used to:
Link1
Link2
Related
I'm developing a neo4j procedure in java. I can test it with the custom data below.
#Test
public void commonTargetTest2() {
// This is in a try-block, to make sure we close the driver after the test
try (Driver driver = GraphDatabase.driver(embeddedDatabaseServer.boltURI(), driverConfig);
Session session = driver.session()) {
// And given I have a node in the database
session.run(
"CREATE (n1:Person {name:'n1'}) CREATE (n2:Person {name:'n2'}) CREATE (n3:Person {name:'n3'}) CREATE (n4:Person {name:'n4'}) CREATE (n5:Person {name:'n5'})"
+ "CREATE (n6:Person {name:'n6'}) CREATE (n7:Person {name:'n7'}) CREATE (n8:Person {name:'n8'}) CREATE (n9:Person {name:'n9'}) CREATE (n10:Person {name:'n10'})"
+ "CREATE (n11:Person {name:'n11'}) CREATE (n12:Person {name:'n12'}) CREATE (n13:Person {name:'n13'})"
+ "CREATE (n14:Person {name:'n14'}) CREATE "
+ "(n1)-[:KNOWS]->(n6),(n2)-[:KNOWS]->(n7),(n3)-[:KNOWS]->(n8),(n4)-[:KNOWS]->(n9),(n5)-[:KNOWS]->(n10),"
+ "(n7)-[:KNOWS]->(n11),(n8)-[:KNOWS]->(n12),(n9)-[:KNOWS]->(n13),"
+ "(n11)-[:KNOWS]->(n14),(n12)-[:KNOWS]->(n14),(n13)-[:KNOWS]->(n14);");
// name of the procedure I defined is "p1", below I'm calling it in cypher
StatementResult result = session
.run("CALL p1([1,3], [], 3, 0) YIELD nodes, edges return nodes, edges");
InternalNode n = (InternalNode) result.single().get("nodes").asList().get(0);
assertThat(n.id()).isEqualTo(13);
}
}
This works fine but the data is newly generated with CREATE statements and it is very small. I want to test my procedure with an existing neo4j database server. So that I can see the performance/results of my procedure with real/big data.
I can also achieve that with the below code. I can connect to an up and running neo4j database.
#Test
public void commonTargetTestOnImdb() {
// This is in a try-block, to make sure we close the driver after the test
try (Driver drv = GraphDatabase.driver("bolt://localhost:7687", AuthTokens.basic("neo4j", "123"));
Session session = drv.session()) {
// find 1 common downstream of 3 nodes
StatementResult result = session.run(
"CALL commonStream([1047255, 1049683, 1043696], [], 3, 2) YIELD nodes, edges return nodes, edges");
InternalNode n = (InternalNode) result.single().get("nodes").asList().get(0);
assertThat(n.id()).isEqualTo(5);
}
}
NOW, my problem is that I can't debug the codes of my procedure if I connect to an existing database. I package a JAR file and put it inside plugin folder of my neo4j database so that neo4j can call my procedure. I think I should debug the JAR file. I'm using vscode and java extensions to debug and run tests. How can I debug JAR file with vscode?
For the record, I find a way to debug my neo4j stored procedure. I'm using Java 8. I used IntelliJ idea. I added the config dbms.jvm.additional=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005 to the neo4j.conf file
Inside IntelliJ IDEA, I added a new configuration for remote debugging.
Here note that syntax in Java 9+ is different. At the end, to give port parameter it uses address=*:5005. There is post about it here. https://stackoverflow.com/a/62754503/3209523
In my application, I need a Java process to run by a trigger-- whenever a change has been made to field dateLastChanged of TrackTable, that process should fire.
I followed through these steps for this:
1.) compiled the following code:
public class UDFClass {
public static int udf(String coCode, String tNo) {
int result = 1;
String command = "<the java command here>"; // line-K
try {
Runtime.getRuntime().exec(command);
} catch (IOException e) {
result = 0;
}
return result;
}
}
2.) put UDFClass.class under sqllib/function directory of the DB2 installation
3.) ran the following function on db2. It ran successfully:
create or replace function theresUpdate(cocode varchar(4), tnumber varchar(20))
returns integer
external name 'UDFClass.udf'
language Java
parameter style Java
not deterministic
no sql
external action
4) successfully ran the following trigger on DB2:
create or replace trigger notify
after update of dateLastChanged
or insert
on TrackTable
REFERENCING new as newRow
for each row
not secured
begin
update performanceParams set thevalue = theresUpdate(newRow.cocode, newRow.tnumber) where thekey = 'theDate';
end
To test, i update TrackTable as follows:
update TrackTable set dateLastChanged = dateLastChanged + 10 where tNumber = ‘21123’
this update query runs successfully without the trigger in (4) above. However, with this trigger, i get the following error:
An error occurred in a triggered SQL statement in trigger "LTR.NOTIFY". Information returned for the error includes SQLCODE "-4301", SQLSTATE "58004" and message tokens "1".. SQLCODE=-723, SQLSTATE=09000, DRIVER=4.18.60
This page indicates that it’s a Java related error. However, the command in line-K of the Java code in (1) is running all fine when I invoke it from the Linux command line.
I tried some other variations of the function in (3)-- deterministic rather than not deterministic, not fenced.
I’m using DB2 version 10.5.
Please help!
I wrote the following MyPythonGateway.java so that I can call my custom java class from Python:
public class MyPythonGateway {
public String findMyNum(String input) {
return MyUtiltity.parse(input).getMyNum();
}
public static void main(String[] args) {
GatewayServer server = new GatewayServer(new MyPythonGateway());
server.start();
}
}
and here is how I used it in my Python code:
def main():
gateway = JavaGateway() # connect to the JVM
myObj = gateway.entry_point.findMyNum("1234 GOOD DAY")
print(myObj)
if __name__ == '__main__':
main()
Now I want to use MyPythonGateway.findMyNum() function from PySpark, not just a standalone python script. I did the following:
myNum = sparkcontext._jvm.myPackage.MyPythonGateway.findMyNum("1234 GOOD DAY")
print(myNum)
However, I got the following error:
... line 43, in main:
myNum = sparkcontext._jvm.myPackage.MyPythonGateway.findMyNum("1234 GOOD DAY")
File "/home/edamameQ/spark-1.5.2/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 726, in __getattr__
py4j.protocol.Py4JError: Trying to call a package.
So what did I miss here? I don't know if I should run a separate JavaApplication of MyPythonGateway to start a gateway server when using pyspark. Please advice. Thanks!
Below is exactly what I need:
input.map(f)
def f(row):
// call MyUtility.java
// x = MyUtility.parse(row).getMyNum()
// return x
What would be the best way to approach this? Thanks!
First of all the error you see usually means the class you're trying to use is not accessible. So most likely it is a CLASSPATH issue.
Regarding general idea there are two important issues:
you cannot access SparkContext inside an action or transformation so using PySpark gateway won't work (see How to use Java/Scala function from an action or a transformation? for some details)). If you want to use Py4J from the workers you'll have to start a separate gateways on each worker machine.
you really don't want to pass data between Python an JVM this way. Py4J is not designed for data intensive tasks.
In PySpark before start calling the method -
myNum = sparkcontext._jvm.myPackage.MyPythonGateway.findMyNum("1234 GOOD DAY")
you have to import MyPythonGateway java class as follows
java_import(sparkContext._jvm, "myPackage.MyPythonGateway")
myPythonGateway = spark.sparkContext._jvm.MyPythonGateway()
myPythonGateway.findMyNum("1234 GOOD DAY")
specify the jar containing myPackage.MyPythonGateway with --jars option in spark-submit
If input.map(f) has inputs as an RDD for example, this might work, since you can't access the JVM variable (attached to spark context) inside the executor for a map function of an RDD (and to my knowledge there is no equivalent for #transient lazy val in pyspark).
def pythonGatewayIterator(iterator):
results = []
jvm = py4j.java_gateway.JavaGateway().jvm
mygw = jvm.myPackage.MyPythonGateway()
for value in iterator:
results.append(mygw.findMyNum(value))
return results
inputs.mapPartitions(pythonGatewayIterator)
all you need to do is compile jar and add to pyspark classpath with --jars or --driver-class-path spark submit options. Then access class and method with below code-
sc._jvm.com.company.MyClass.func1()
where sc - spark context
Tested with Spark 2.3. Keep in mind, you can call JVM class method only from driver program and not executor.
I am currently implementing a shell with limited functionality using Java programming language. The scope of the shell has restricted requirement too. The task is to model a Unix shell as much as I can.
When I am implementing the cd command option, I reference a Basic Shell Commands page, it mentions that a cd is able to go back to the last directory I am in with the command "cd -".
As I am given only a interface with the method public String execute(File presentWorkingDirectory, String stdin).
I will like to know if there is API call from Java which I can retrieve the previous working directory, or if there any implementation for this command?
I know one of the simple implementation is to declare a variable to store the previous working directory. However I am currently having the shell itself (the one that take in the command with options), and each time a command tool is executed, a new thread is created. Hence I do not think it is advisable for the "main" thread to store the previous working directory.
Update (6-Mar-'14): Thank for the suggestion! I have now discussed with the coder for shell, and have added an additional variable to store the previous working directory. Below is the sample code for sharing:
public class CdTool extends ATool implements ICdTool {
private static String previousDirectory;
//Constructor
/**
* Create a new CdTool instance so that it represents an unexecuted cd command.
*
* #param arguments
* the argument that is to be passed in to execute the command
*/
public CdTool(final String[] arguments) {
super(arguments);
}
/**
* Executes the tool with arguments provided in the constructor
*
* #param workingDir
* the current working directory path
*
* #param stdin
* the additional input from the stdin
*
* #return the message to be shown on the shell, null if there is no error
* from the command
*/
#Override
public String execute(final File workingDir, final String stdin) {
setStatusCode(0);
String output = "";
final String newDirectory;
if(this.args[0] == "-" && previousDirectory != null){
newDirectory = previousDirectory;
}
else{
newDirectory = this.args[0];
}
if( !newDirectory.equals(workingDir) &&
changeDirectory(newDirectory) == null){
setStatusCode(DIRECTORY_ERROR_CODE);
output = DIRECTORY_ERROR_MSG;
}
else{
previousDirectory = workingDir.getAbsolutePath();
output = changeDirectory(newDirectory).getAbsolutePath();
}
return output;
}
}
P.S: Please note that this is not the full implementation of the code, and this is not the full functionality of cd.
Real shell (at least Bash) shell stores current working directory path in PWD environment variable and old working directory path in OLDPWD. Rewriting PWD does not change your working directory, but rewriting OLDPWD really changes where cd - will take you.
Try this:
cd /tmp
echo "$OLDPWD" # /home/palec
export OLDPWD='/home'
cd - # changes working directory to /home
I don’t know how you implement the shell functionality (namely how you represent current working directory; usually it’s an inherent property of the process, implemented by the kernel) but I think that you really have to keep the old working directory in an extra variable.
By the way shell also forks for each command executed (except for the internal ones). Current working directory is a property of a process. When a command is started, it can change its inner current working directory, but it does not affect the shell’s one. Only cd command (which is internal) can change shell’s current working directory.
If you want to keep more than one working directory just create a LinkedList where you add each new presentWorkingDirectory at the and and if you want to return use linkedList.popLast to get the last workingDirectory.
What would be the preferable way to update schema_version table and execute modified PL/SQL packages/procedures in flyway without code duplication?
My example would require a class file be created for each PL/SQL code modicaition
public class V2_1__update_scripts extends AbstractMigration {
// update package and procedures
}
AbstractMigration class executes the files in db/update folder:
public abstract class AbstractMigration implements SpringJdbcMigration {
private static final Logger log = LoggerFactory.getLogger(AbstractMigration.class);
#Override
public void migrate(JdbcTemplate jdbcTemplate) throws Exception {
Resource packageFolder = new ClassPathResource("db/update");
Collection<File> files = FileUtils.listFiles(packageFolder.getFile(), new String[]{"sql"}, true);
for (File file : files ) {
log.info("Executing [{}]", file.getAbsolutePath());
String fileContents = FileUtils.readFileToString(file);
jdbcTemplate.execute(fileContents);
}
}
}
Is there any better way of executing PL/SQL code?
I wonder if it's better to duplicate the code into the standard migrations folder. It seems like with the given example you wouldn't then be able to migrate up to version N of the db, as some prior version would execute all the current version of the pl/sql. I'd be interested to see if you settled on a solution for this.
There is no built-in support or other command you have missed.
Of the top of my head, I would think about either the way you presented here or using a generator to produce new migration sql files after an SCM commit.
Let's see if someone else found a better solution.
The version of Flyway current at the time of this writing (v4.2.0) supports the notion of repeatable scripts designed specifically for such situations. Basically any script with a "Create or replace" semantic is a candidate.
Simply name your script as R__mypackage_body.sql or whatever prefix you wish for repeatable scripts. Please see Sql-based migrations and Repeatable migrations for further information.