Hive UDF in Java fails when creating a table - java

What is the difference between those two queries:
SELECT my_fun(col_name) FROM my_table;
and
CREATE TABLE new_table AS SELECT my_fun(col_name) FROM my_table;
Where my_fun is a java UDF.
I'm asking, because when I create new table (second query) I receive a java error.
Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: Map operator initialization failed
...
Caused by: org.apache.hadoop.hive.ql.exec.UDFArgumentException: Unable to instantiate UDF implementation class com.company_name.examples.ExampleUDF: java.lang.NullPointerException
I found that the source of error is line in my java file:
encoded = Files.readAllBytes(Paths.get(configPath));
But the question is why it works when table is not created and fails if table is created?

The problem might be with the way you read the file. Try to pass the file path as the second argument in the UDF, then read as follows
private BufferedReader getReaderFor(String filePath) throws HiveException {
try {
Path fullFilePath = FileSystems.getDefault().getPath(filePath);
Path fileName = fullFilePath.getFileName();
if (Files.exists(fileName)) {
return Files.newBufferedReader(fileName, Charset.defaultCharset());
}
else
if (Files.exists(fullFilePath)) {
return Files.newBufferedReader(fullFilePath, Charset.defaultCharset());
}
else {
throw new HiveException("Could not find \"" + fileName + "\" or \"" + fullFilePath + "\" in inersect_file() UDF.");
}
}
catch(IOException exception) {
throw new HiveException(exception);
}
}
private void loadFromFile(String filePath) throws HiveException {
set = new HashSet<String>();
try (BufferedReader reader = getReaderFor(filePath)) {
String line;
while((line = reader.readLine()) != null) {
set.add(line);
}
} catch (IOException e) {
throw new HiveException(e);
}
}
The full code for different generic UDF that utilizes file reader can be found here

I think there are several points unclear, so this answer is based on assumptions.
First of all, it is important to understand that hive currently optimize several simple queries and depending on the size of your data, the query that is working for you SELECT my_fun(col_name) FROM my_table; is most likely running locally from the client where you are executing the job, that is why you UDF can access your config file locally available, this "execution mode" is because the size of your data. CTAS trigger a job independent on the input data, this job runs distributed in the cluster where each worker fail accessing your config file.
It looks like you are trying to read your configuration file from the local file system, not from the HDSFS Files.readAllBytes(Paths.get(configPath)), this means that your configuration has to either be replicated in all the worker nodes or be added previously to the distributed cache (you can use add file from this, doc here. You can find another questions here about accessing files from the distributed cache from UDFs.
One additional problem is that you are passing the location of your config file through an environment variable which is not propagated to worker nodes as part of your hive job. You should pass this configuration as a hive config, there is an answer for accessing Hive Config from UDF here assuming that you are extending GenericUDF.

Related

Save a variable when the server is off

In fact I am making a Minecraft plugin and I was wondering how some plugins (without using DB) manage to keep information even when the server is off.
For example if we make a grade plugin and we create a different list or we stack the players who constitute each. When the server will shut down and restart afterwards, the lists will become empty again (as I initialized them).
So I wanted to know if anyone had any idea how to keep this information.
If a plugin want to save informations only for itself, and it don't need to make it accessible from another way (a PHP website for example), you can use YAML format.
Create the config file :
File usersFile = new File(plugin.getDataFolder(), "user-data.yml");
if(!usersFile.exists()) { // don't exist
usersFile.createNewFile();
// OR you can copy file, but the plugin should contains a default file
/*try (InputStream in = plugin.getResource("user-data.yml");
OutputStream out = new FileOutputStream(usersFile)) {
ByteStreams.copy(in, out);
} catch (Exception e) {
e.printStackTrace();
}*/
}
Load the file as Yaml content :
YamlConfiguration config = YamlConfiguration.loadConfiguration(usersFile);
Edit content :
config.set(playerUUID, myVar);
Save content :
config.save(usersFile);
Also, I suggest you to make I/O async (read & write) with scheduler.
Bonus:
If you want to make ONE config file per user, and with default config, do like that :
File oneUsersFile = new File(plugin.getDataFolder(), playerUUID + ".yml");
if(!oneUsersFile.exists()) { // don't exist
try (InputStream in = plugin.getResource("my-def-file.yml");
OutputStream out = new FileOutputStream(oneUsersFile)) {
ByteStreams.copy(in, out); // copy default to current
} catch (Exception e) {
e.printStackTrace();
}
}
YamlConfiguration userConfig = YamlConfiguration.loadConfiguration(oneUsersFile);
PS: the variable plugin is the instance of your plugin, i.e. the class which extends "JavaPlugin".
You can use PersistentDataContainers:
To read data from a player, use
PersistentDataContainer p = player.getPersistentDataContainer();
int blocksBroken = p.get(new NamespacedKey(plugin, "blocks_broken"), PersistentDataType.INTEGER); // You can also use DOUBLE, STRING, etc.
The Namespaced key refers to the name or pointer to the data being stored. The PersistentDataType refers to the type of data that is being stored, which can be any Java primitive type or String. To write data to a player, use
p.set(new NamespacedKey(plugin, "blocks_broken"), PersistentDataType.INTEGER, blocksBroken + 1);

scriptella executor with xml as string instead of file

I am trying to use scriptella in my project to copy data from one db to another, now the application has a frontend which users can use to create mapping between tables and create dynamic queries, now currently once the user submits the frontend queries are passed via a query engine and a scriptella xml is created using freemarker template
however to execute the xml the executor expects a file instead of a xml string currently i am achieving this by creating a xml in temp directory and deleting it post execution of query, is there any way i can skip file creation and execute the query as a xml string
You can create a custom URLStreamHandler that will serve streams directly from memory. This is similar to what was done in AbstractTestCase. It can be registered by calling URL.setURLStreamHandlerFactory. See Registering and using a custom java.net.URL protocol or Is it possible to create an URL pointing to an in-memory object?
After that, use
EtlExecutor.newExecutor(java.net.URL) with the new URI, e.g. new URL("memory://file")
I had a similar use case. I downloaded the code and made a small change in the core. Due to some private functions I had no choice.
in
package scriptella.configuration.ConfigurationFactory
I added the following function:
public ConfigurationEl createConfigurationFromTxt(String xml, final ParametersCallback externalParameters ) {
try {
DocumentBuilder db = DBF.newDocumentBuilder();
db.setEntityResolver(ETL_ENTITY_RESOLVER);
db.setErrorHandler(ETL_ERROR_HANDLER);
final InputStream in = new ByteArrayInputStream(xml.getBytes());
final Document document = db.parse(in);
HierarchicalParametersCallback params = new HierarchicalParametersCallback(
externalParameters == null ? NullParametersCallback.INSTANCE : externalParameters, null);
PropertiesSubstitutor ps = new PropertiesSubstitutor(params);
return new ConfigurationEl(new XmlElement(
document.getDocumentElement(), resourceURL, ps), params);
} catch (IOException e) {
throw new ConfigurationException("Unable to load document: " + e, e);
} catch (Exception e) {
throw new ConfigurationException("Unable to parse document: " + e, e);
}
}
Then from my code I can do something like this:
ConfigurationFactory cf = new ConfigurationFactory();
ConfigurationEl conf = cf.createConfigurationFromTxt(FETCH_ETLS, p);
EtlExecutor exec = new EtlExecutor(conf);

part file empty while running pig in Eclipse using libraries

I ran a sample pig script in mapreduce mode and it ran successfully.
My pigscript:
allsales = load 'sales' as (name,price,country);
bigsales = filter allsales by price >999;
sortedbigsales = order bigsales by price desc;
store sortedbigsales into 'topsales';
Now, I am trying to implement that in eclipse (currently I am running using libraries).
One doubt: Pig Local mode means that we need hadoop installation as default?
IdLocal.java:
public class IdLocal {
public static void main(String[] args) {
try {
PigServer pigServer = new PigServer("local");
runIdQuery(pigServer, "/home/sreeveni/myfiles/pig/data/sales");
} catch (Exception e) {
}
}
public static void runIdQuery(PigServer pigServer, String inputFile)
throws IOException {
pigServer.registerQuery("allsales = load '" + inputFile+ "' as (name,price,country);");
pigServer.registerQuery("bigsales = filter allsales by price >999;");
pigServer.registerQuery("sortedbigsales = order bigsales by price desc;");
pigServer.store("sortedbigsales","/home/sreeveni/myfiles/OUT/topsalesjava");
}
}
The console is showing success for me, but my part file is empty.
Why is it so?
1) local mode pig does not mean that you have to have hadoop installed. you can run it without hadoop and hdfs. Everything will be performed single threaded on your machine and it should read/write from your local filesystem by default.
2) Regarding your empty output, ensure that your input file exists on your local filesystem and that it has records in the 'price' field greater than 999. You could be filtering them all out otherwise. Also, pig defaults to tab separated files. Is your inputFile tab separated? if not, then your schema definition will have the 'name' field hold the entire row in the file, and 'price' and 'country' will always be null.
hope that helps

changing log4j file name programatically in osgi maven bundle not working

Im developing a maven-osgi bundle and deploying in karaf.. In that, a piece of code, should get .cfg files from the karaf/etc and im programatically changing them at runtime.. writeTrace() is invoked within 'for loop' from another class. So that I can create different files and corresponding logging should go in to that file.
public void writeLog(int i,String HostName) {
StringBuilder sb = new StringBuilder();
sb.append("\n HEADER : \n");
....
String str = sb.toString();
String logfile = ("/home/Dev/" + HostName + i);
logger = LoggerFactory.getLogger("TracerLog");
updateLog4jConfiguration(logfile);
logger.error(str + i);}
public void updateLog4jConfiguration(String logFile) {
Properties props = new Properties();
try {
// InputStream configStream = getClass().getResourceAsStream(
// "/home/Temp-files/NumberGenerator/src/main/java/log4j.properties");
InputStream configStream = new FileInputStream("etc/org.ops4j.pax.logging.cfg");
props.load(configStream);
System.out.println(configStream);
configStream.close();
} catch (IOException e) {
System.out.println("Error: Cannot laod configuration file ");
}
props.setProperty("log4j.appender.Tracer.File", logFile);
LogManager.resetConfiguration();
PropertyConfigurator.configure(props);
}
and I am able to see new files created with hostname such as (hostname_1 , hostname_2, etc..) but logging happens only at actual appender configured at karaf/etc... thaat is log.txt..
log4j.logger.TracerLog=TRACE,Tracer
log4j.appender.Tracer=org.apache.log4j.RollingFileAppender
log4j.appender.Tracer.MaxBackupIndex=10
log4j.appender.Tracer.MaxFileSize=500KB
log4j.appender.Tracer.File=/home/Dev/log.txt
I got struck in this error.. Dont know whether it has to do something with the karaf or problem with code..???
Why aren't you just using the ConfigurationAdminService for this, instead of altering the file?
Just reference the configuration admin service from the registry and take the configuration with the PID org.ops4j.pax.logging.
With this approach you will have all configuration properties available for your proposal and it is in your code to alter this. It's also possible for you to add new configuration entries. In the end the combination of ConfigurationAdminService and the felix FileInstaller will even persist your changes back to the configuration file.
Btw. did you know that there is a shell command for configuring configurations, so actually also to alter the configuration for the org.ops4j.pax.logging service?
Just do a:
config:list
to retrieve all configurations available
and a
config:list "(service=org.ops4j.pax.logging)"
to retrieve just this information.

How Can I Read The Next Row From A CSV Data Set Config In JMeter?

I am in the process of creating a test place in JMeter which visits a random amount of pages (from 2 - 10), whose URLs are to be fetched from a CSV Data Set. I have created the CSV Data Set and the samplers which are working fine, except that only one row is read from the Data Set per thread, which is not as a I need - I want a new row to be read after the sampler has completed (or before, I'm not fussed).
I saw that this question is very similar and the solution was to use the Raw Data Source Pre-Processor, which does work but requires arduous alterations to the file in question (adding chunk sizes before each line), which is a bit of a pain when the file is about 500 lines long.
Is there a way I can set the CSV Data Set to advance to the next row on reading, or use some post or pre processor, such as beanshell, in order to do this? I have seen people state that CSVRead can do this, but that states that access is per-thread, which would be no good for me.
As a side note - ultimately all I want to do is access a random line in the file which gets passed to a HTTP sampler, if there is an easier or better way to do this I'm open to suggestions.
You can possibly use for this beanshell (= java) code executed from BeanShell Sampler / BeanShell PostProcessor / BeanShell PreProcessor.
The following code will read all the lines from your file and then select single random:
import java.text.*;
import java.io.*;
import java.util.*;
String [] params = Parameters.split(",");
String csvTest = params[0];
String csvDir = params[0];
ArrayList strList = new ArrayList();
try {
File file = new File(System.getProperty("user.dir") + File.separator + csvDir + File.separator + csvTest);
if (!file.exists()) {
throw new Exception ("ERROR: file " + csvTest + " not found in " + csvDir + " directory.");
}
BufferedReader bufRdr = new BufferedReader(new FileReader(file));
String line = null;
while((line = bufRdr.readLine()) != null) {
strList.add(line);
}
bufRdr.close();
Random rnd = new java.util.Random();
vars.put("csvUrl",strList.get(rnd.nextInt(strList.size())));
}
catch (Exception ex) {
IsSuccess = false;
log.error(ex.getMessage());
System.err.println(ex.getMessage());
}
catch (Throwable thex) {
System.err.println(thex.getMessage());
}
Then you can access extracted URL via variable (${csvUrl} in this example).
I doubt only that reading full file on each iteration (if you have to execute this in loop) is good solution from performance point of view.

Categories

Resources