Apache Pig - Create unique output folder using UDF

Apache Pig - Create unique output folder using UDF - java

I am using Apache Pig from Hue to perform ETL operations on files using the script etl-op.pig. The output is stored into the specified folder in HDFS using the following line:
STORE outval INTO '/user/root/Pig-Output
However next time when the script is run, it says the output folder already exists and doesn't create a separate folder.
Is there any way to create a Java UDF in Pig using Hue so that a unique identifier can be generated and appended to the 'Pig-Output' folder name present in the script ?

You can do it without UDF:
Define a variable like the current unix timestamp:
%default TS `date +%s`
And than use it as e.g. a postfix of your folder:
STORE outval INTO '/user/root/Pig-Output_$TS' ...

Related

Could not load jvm.dll while calling a java mysql udf

In mysql udf it says could not load jvm.dll when i am trying to call java method using javaudflauncher.dll?
I want to call a java method from Mysql on Mysql trigger. I used the sample example provided by https://dejard.bitbucket.io
I have followed the steps as documented
- Setup the enviromental variable for Java & Mysql
- Copied the dll, jar and class files to the plugin directory
- Created the Function.
Till here it works fine. But when I try to call the function it gives me the below error
"ERROR 1123 (HY000): Can't initialize function 'call_java_method'; Could not load jvm.dll"
If anyone can help me out where I am going wrong.

H2: generate insert scripts initialization script

I have full h2 database with lots data in it. I want to launch integration tests agains that data.
Question1: Is it possible to generate *.sql insert files/scripts from full h2 database?
I've trie SCRIPT TO 'fileName' as described here. But it generates only CREATE/ALTER TABLE/CONSTRAINT queries, means creating schema without data.
If answer to the first question is - "Impossible", than:
Question2: Are *.sql insert files the only way to insert initial dataset into h2 db for integration tests?

Question1: Is it possible to generate *.sql insert files/scripts from
full h2 database?
I have just tested with one of my H2 file databases and as result the export exports both structure and data.
I tested with the 1.4.193version of H2.
The both ways of exporting work :
The SCRIPT command from H2 console
org.h2.tools.Script tool from command line.
1) I have tested first the org.h2.tools.Script tool as I had already used it.
Here is the minimal command to export structure and data :
java -cp <whereFoundYourH2Jar> org.h2.tools.Script -url <url>
-user <user> -password <password>
Where :
<whereFoundYourH2Jar> is the classpath where you have the h2.jar lib (I used that one which is my m2 repo).
<url> is the url of your database
<user> is the user of the database
<password> the password of the database
You have more details in the official help of the org.h2.tools.Script tool :
Creates a SQL script file by extracting the schema and data of a database.
Usage: java org.h2.tools.Script <options>
Options are case sensitive. Supported options are:
[-help] or [-?] Print the list of options
[-url "<url>"] The database URL (jdbc:...)
[-user <user>] The user name (default: sa)
[-password <pwd>] The password
[-script <file>] The target script file name (default: backup.sql)
[-options ...] A list of options (only for embedded H2, see SCRIPT)
[-quiet] Do not print progress information
See also http://h2database.com/javadoc/org/h2/tools/Script.html
2) I have tested with SCRIPT command from the H2 console. It also works.
Nevertheless, the result of the SCRIPT command may be misleading.
Look at the official documentation :
If no 'TO fileName' clause is specified, the script is returned as a
result set. This command can be used to create a backup of the
database. For long term storage, it is more portable than copying the
database files.
If a 'TO fileName' clause is specified, then the whole script
(including insert statements) is written to this file, and a result
set without the insert statements is returned.
You have used the SCRIPT TO 'fileName' command. In this case, the whole script
(including insert statements) is written to this file and as result in the H2 console, you have everything but the insert statements.
For example, enter the SCRIPT TO 'D:\yourBackup.sql' command (or a Unix friendly directory if you use it), then open the file, you will see that SQL insertions are present.
As specified in the documentation, if you want to get both structure and insert statements in the output result of the H2 console, don't specify the TO argument.
Just type : SCRIPT.
Question2: Are *.sql insert files the only way to insert initial
dataset into h2 db for integration tests?
As a long time discussed :) you can with DBunit dataset (a solution among others).

Copying a file inside a same hdfs using FileUtil API is taking too much time

I have 1 HDFS and my local system from where I'm executing my program to perform a copy inside a same hdfs system.
Like: hadoop fs -cp /user/hadoop/SrcFile /user/hadoop/TgtFile
I'm using:
FileUtil.copy(FileSystem srcFS,
FileStatus srcStatus,
FileSystem dstFS,
Path dst,
boolean deleteSource,
boolean overwrite,
Configuration conf)
But something weird is happening, when I'm doing copy from command line, it just take a moment to copy but when I do it programmatically it takes a 10 - 15 minute to copy 190 mb file.
For me it look like it's streaming the data via my local system instead of streaming directly because the destination is also on the same filesystem as of source.
Correct me if I'm wrong and also help me to find out the best solution.

You are right in that using FileUtil.copy the streaming is passed through your program (src --> yourprogram --> dst). If hadoops filesystem shell API (hadoop dfs -cp) is faster than you can use the same through Runtime.exec(cmd)
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileUtil.java

Use BerkeleyParser.jar in django (views.py), but get a empty output file

I use the following code in a views.py
def berkeleyParser(infile,outfile):
cmd="java -Xmx1024m -jar nlptools/BerkeleyParser/berkeleyParser-1.7.jar -gr nlptools/BerkeleyParser/chn_sm5.gr < "+infile+" > "+outfile
os.system(cmd)
and then call this function to use berkeley parser.
I think the file path is ok, because the jar can successfully create the output file.
Meanwhile, I use a independent .py code to run the code above (with path modified), and got correct result in output file.
So, I don't know what's wrong with it.

Error in installing DBpedia Spotlight while running the server class in jar

I get the following error:
org.dbpedia.spotlight.exceptions.ConfigurationException: Cannot find spotter file ../dist/src/deb/control/data/usr/share/dbpedia-spotlight/spotter.dict
at org.dbpedia.spotlight.model.SpotterConfiguration.<init>(SpotterConfiguration.java:54)
at org.dbpedia.spotlight.model.SpotlightConfiguration.<init>(SpotlightConfiguration.java:143)
at org.dbpedia.spotlight.web.rest.Server.main(Server.java:70)
Usage:
java -jar dbpedia-spotlight.jar org.dbpedia.spotlight.web.rest.Server [config file]
or:
mvn scala:run "-DaddArgs=[config file]"

Quick solution:
wget http://spotlight.dbpedia.org/download/release-0.5/dbpedia-spotlight-quickstart.zip
unzip dbpedia-spotlight-quickstart.zip
cd dbpedia-spotlight-quickstart/
./run.sh
Explanation:
DBpedia Spotlight looks for ~3.5M things of ~320 types in text and tries to disambiguate them to their global unique identifiers in DBpedia. Therefore it needs data files to accompany its jar. A minuscule example is distributed along with the source, but for real use cases you may need the larger files. After you've downloaded the files, you need to modify the configuration in server.properties with the correct path to the files. The error message you got tells you that one of the necessary files (spotter.dict) could not be found in the path you indicated in your server.properties.
More information available here:
https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Run-from-a-JAR

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache Pig - Create unique output folder using UDF - java

You can do it without UDF: Define a variable like the current unix timestamp: %default TS `date +%s` And than use it as e.g. a postfix of your folder: STORE outval INTO '/user/root/Pig-Output_$TS' ...

Related

Could not load jvm.dll while calling a java mysql udf

H2: generate insert scripts initialization script

Copying a file inside a same hdfs using FileUtil API is taking too much time

Use BerkeleyParser.jar in django (views.py), but get a empty output file

Error in installing DBpedia Spotlight while running the server class in jar

Categories

Resources