Get fields by name in Pig? - java

Currently I have a simple pig script which reads from a file on a hadoop fs, as
my_input = load 'input_file' as (A, B, C)
and then I have another line of code which needs to manipulate the fields, like for instance convert them to uppercase (as in the Pig UDF tutorial).
I do something like,
manipulated = FOREACH my_input GENERATE myudf.Upper(A, B, C)
Now in my Upper.java file I know that I can get the value of A, B, C as (assuming they are all Strings)
public String exec(Tuple input) throws IOException
{
//yada yada yada
....
String A = (String) input.get(0);
String B = (String) input.get(1);
String C = (String) input.get(2);
//yada yada yada
....
}
Is there anyway I can get the value of a field by its name? For instance if I need to get like 10 fields, is there no other way than to do input.get(i) from 0 to 9?
I am new to Pig, so I am interested in knowing why this is the case. Is there something like a tuple.getByFieldName('Field Name')?

This is not possible, nor would it be very good design to allow it. Pig field names are like variable names. They allow you to give a memorable name to something that gives you insight into what it means. If you use those names in your UDF, you are forcing every Pig script which uses the UDF to adhere to the same naming scheme. If you decide later that you want to think of your variables a little differently, you can't reflect that in their names because the UDF would not function anymore.
The code that reads data from the input tuple in your UDF is like a function declaration. It establishes how to treat each argument to the function.
If you really want to be able to do this, you can build a map easily enough using the TOMAP builtin function, and have your UDF read from the map. This greatly hurts the reusability of your UDF for the reasons mentioned above, but it is nevertheless a fairly simple workaround.

While I agree that function flexibility would be affected if you use field names, technically it is possible to access fields by names.
The trick is to use inputSchema available through getInputSchema() and get the mapping between field indexes and names from there. You can also override outputSchema and build the mapping there, using inputSchema parameter. Then you would be able to use this mapping in your exec method.

I don't think you can access field by name. You need a structure similar to map to achieve that. In Pig's context, even though you cannot do it by name you can still rely on position if the input (load)'s schema is properly defined and consistent.
The maximum you can do is to validate type of fields you are ingesting in the UDF.
On the other hand, you can use implement "outputSchema" in your UDF to publish its output by name.
UDF Manual

Related

get a getter method from field name to avoid if-else

I have this code, which obviously doesn't look nice - it seems all the if-else can somehow be avoided.
if(sortBy.equals("firstName"))
personList.sort(Comparator.comparing(Person::getFirstName));
else if(sortBy.equals("lastName"))
personList.sort(Comparator.comparing(Person::getLastName));
else if(sortBy.equals("age"))
personList.sort(Comparator.comparing(Person::getAge));
else if(sortBy.equals("city"))
personList.sort(Comparator.comparing(Person::getCity));
else if(sortBy.equals("state"))
personList.sort(Comparator.comparing(Person::getState));
else if(sortBy.equals("zipCode"))
personList.sort(Comparator.comparing(Person::getZipCode));
the function takes sortBy, which is the name of one of the attributes of a Person, and applies a sorting to a personList based on that field. How can I avoid the if-else and write a better looking, possibily one line code?
Currently I have found that I can use a HashMap to create a mapping between a field name and a corresponding comparator.
map.put("age", Comparator.comparing(Person::getAge));
map.put("firstName", Comparator.comparing(Person::getFirstName))
...
And use personList.sort(map.get(sortBy)).
But still felt like it can further be improved without an extra step, to the point where it follows the open-closed principle, and adding a new field to Person would not need us to modify the code. I'm looking for something like
personList.sort(Comparator.comparing(Person::getterOfField(sortBy)))
UPDATE-1
For now, I decided to stick with using a Map<String, Function<Person, Comparable<?>> and I do not like to consider reflection based solutions. But still searching if I can find a similar way as this one where sort is a parameter.
UPDATE-2
I think a one-liner is not a good solution, cuz you wouldn't get a compile time error if one of the fields does not implement Comparator.
In general java doesn't want you to work with it this way1; it is not a structurally typed language, and unlike e.g. javascript or python, objects aren't "hashmaps of strings to thingies".
Also, your request more fundamentally doesn't add up: You can't just go from "field name" to "sort on that": What if the field's type isn't inherently sortable (is not a subtype of Comparator<Self>?)
What if there is a column in whatever view we're talking about / config file that is 'generated'? Imagine you have a field LocalDate birthDate; but you have a column 'birth month'2. You can sort on birth month, no problem. However, given that it's a 'generated value' (not backed directly by a field, instead, derived from a calculation based on field(s)), you can't just sort on this. You can't even sort on the backing field (as that would sort by birth year first, not what you want), nor does 'backing field' make sense; what if the virtual column is based on multiple fields?
It is certainly possible that currently you aren't imagining either virtual columns or fields whose type isn't self-sortable and that therefore you want to deposit a rule that for this class, you close the door on these two notions until a pretty major refactor, but it goes to show perhaps why "java does not work that way" is in fact somewhat 'good' (closely meshes with real life concerns), and why your example isn't as boilerplatey as you may have initially thought: No, it is not, in fact, inevitable. Specifically, you seem to want:
There is an exact 1-to-1 match between 'column sort keys' and field names.
The strategy to deliver on the request to sort on a given column sort key is always the same: Take the column sort key. Find the field (it has the same name); now find its getter. Create a comparator based on comparing get calls; this getter returns a type that has a natural sorting order guaranteed.
Which are 2 non-obvious preconditions that seem to have gotten a bit lost. At any rate, a statement like:
if(sortBy.equals("firstName"))
personList.sort(Comparator.comparing(Person::getFirstName));
encodes these 2 non-obvious properties, and trivially, therefore means it is also possible to add virtual columns as well as sort keys that work differently (for example, sorts on birth month, or, sorts on some explicit comparator you write for this purpose. Or even sorts case insensitively; strings by default do not do that, you'd have to sort by String.CASE_INSENSITIVE_COMPARATOR instead.
It strikes me as a rather badly written app if a change request comes in with: "Hey, could you make the sort option that sorts on patient name be case insensitive?" and you go: "Hooo boy that'll be a personweek+ of refactoring work!", no?
But, if you insist, you have 2 broad options:
Reflection
Reflection lets you write code that programatically gets a list of field names, method names, and can also be used to programatically call them. You can fetch a list of method names and filter out everything except:
instance methods
with no arguments
whose name starts with get
And do a simple-ish get-prefix-to-sort-key conversion (basically, .substring(3) to lop off the get, then lowercase the first character, though note that the rules for getter to field name get contradictory if the first 'word' of the field is a single letter, such as getXAxis, where half of the beanspec documents say the field name is definitely XAxis, as xAxis would have become getxAxis, and the other half say it is ambiguous and could mean the field name is XAxis or xAxis).
It looks something like this:
// intentionally raw type!
Map comparators = new HashMap();
for (Method m : Person.class.getMethods()) {
if (Modifiers.isStatic(m.getModifiers()) continue;
if (m.getParameterCount() != 0) continue;
String n = m.getName();
if (!n.startsWith("get") || n.length() < 4) continue;
n = Character.toLowerCase(n.charAt(3)) + n.substring(4);
comparators.put(n, (a, b) -> {
Object aa = m.invoke(a);
Object bb = m.invoke(b);
return ((Comparable) aa).compareTo(bb);
});
}
MyClass.COMPARATORS = (Map<String, Comparator<?>>) Collections.unmodifiableMap(comparators);
Note how this causes a boatload of errors because you just chucked type checking out the window - there is no actual way to ensure that any given getter type actually is an appropriate Comparable. The warnings are correct and you have to ignore them, no fixing that, if you go by this route.
You also get a ton of checked exceptions issues that you'll have to deal with by catching them and rethrowing something appropriate; possibly RuntimeException or similar if you want to disregard the need to deal with them by callers (some RuntimeException is appropriate if you consider any attempt to add a field of a type that isn't naturally comparable 'a bug').
Annotation Processors
This is a lot more complicated: You can stick annotations on a method, and then have an annotation processor that sees these and generates a source file that does what you want. This is more flexible and more 'compile time checked', in that you can e.g. check that things are of an appropriate type, or add support for mentioning a class in the annotation that is an implementation of Comparable<T>, T being compatible with the type of the field you so annotate. You can also annotate methods themselves (e.g. a public Month getBirthMonth() method). I suggest you search the web for an annotation processor tutorial, it'd be a bit much to stuff an example in an SO answer. Expect to spend a few days learning and writing it, it won't be trivial.
[1] This is a largely objective statement. Falsifiable elements: There are no field-based 'lambda accessors'; no foo::fieldName support. Java does not support structural typing and there is no way to refer to things in the language by name alone, only by fully qualified name (you can let the compiler infer things, but the compiler always translates what you write to a fully "named" (package name, type name that the thing you are referring to is in, and finally the name of the method or field) and then sticks that in the class file).
[2] At least in the Netherlands it is somewhat common to split patient populations up by birth month (as a convenient way to split a population into 12 roughly equally sized, mostly arbitrary chunks) e.g. for inviting them in for a checkup or a flu shot or whatnot.
Assuming that the sortBy values and the corresponding getters are known at compile, this would be a good place to use a string switch statement:
Function<Person.String> getter = null;
switch (sortBy) {
case "firstName":
getter = Person::getFirstName; break;
case "lastName":
getter = Person::getLastName; break;
...
}
personList.sort(Comparator.comparing(getter));
If you use a recent version of Java (Java 12 and later) you could use a switch expression rather than a switch statement.
Function<Person.String> getter;
getter = switch (sortBy) {
case "firstName" -> Person::getFirstName;
case "lastName" -> Person::getLastName;
...
default -> null;
}
personList.sort(Comparator.comparing(getter));
Note: you should do a better job (than my dodgy code) of dealing with the case where the sortBy value is not recognized.
As keshlam suggested, I think using the reflection API is the best fitting answer to your question, but keep in mind that using it in production code is generally discouraged.
Note: if you add a new Person-attribute which isn't itself Comparable, you'll have to resort to a custom Comparator anyway. With that in mind, you might want to keep the Map<String, Comparator<?>> solution you already have.

Dynamic Named SQL Fields

So i've got a bot that serves as a roleplaying mamager handeling combat, skill points and the like, i'm trying to make my code a bit more general so i can have less pages since they all do the same thing they just have different initilizers but i ran into a snag i need to check if the user has a minimum in a particular stat Strength, perceptions, agility, etc
so i call
mainSPECIAL = rows[0].Strength;
Here's the rub, weathers it strength, percpetion, intelligence, luck, whatever i'm always going to be checking Rows[0].that attribute ie Rows[0].Luck for luck perks, and i already set earlier in my initilizers
var PERKSPECIALName = "Strength";
But i can't call
mainSPECIAL = rows[0].PERKSPECIALName but there should be a way to do that right? so that when it sees "rows[0].PERKSPECIALName" it looks up "PERKSPECIALName" and then fetches the value of rows[0].Strength
For this you need to use reflection:
Field f1 = rows[0].getClass().getField(PERKSPECIALName);
Integer attribute = (Integer) f1.get(rows[0]);
Where "Integer" is the type of the element your pulling from the object (the type of strength)
The field must be declared as public! I think there is a way to obtain them when they are not public but it requires more code.
Seems like you have a set of integers that you need to identify with a constant identifier. You might find an EnumMap useful. Have a look at How to use enumMap in java.
Or if you want to only use a string to identify which perk you want to reference, just use a Map.
Java doesn't have reference-to-member like some other languages, so if you don't want to change your data structure, you are looking at using lambda functions or heavier language features to increase re-use, which seems like overkill for what you're trying to do.

Is there a Parameter Tree implementation in Java?

Java program takes a long list of inputs(parameters), churns a bit and spits some output.
I need a way to organize these parameters in a sane way so in the input txt file I want to write them like this:
parameter1 = 12
parameter2 = 10
strategy1.parameter1 = "goofy"
strategy2.parameter4 = 100.0
Then read this txt file, turn it into a Java object I can pass around to objects when I instantiate them.
I now pyqtgraph has ParameterTree which is handy to use; is there something similar in Java? I am sure others must have had the same need so I don't want to reinvent the wheel.
(other tree structures would also be fine, of course, I just wanted something easy to read)
One way is to turn input.txt into input.json:
{
"parameter1": 12,
"parameter2": 10,
"strategy1": {
"parameter1": "goofy"
},
"strategy2": {
"parameter4": 100.0
}
}
Then use Jackson to deserialize input.json into one of these:
A Map<String, Object> instance, which you could navigate in depth to get all your parameters
An instance of some class of your own that mimics input.json's structure, where your parameters would reside
A JsonNode instance that would be the root of the tree
(1) has the advantage that it's easy and you don't have to create any class to read the parameters, however you'd need to traverse the map, downcast the values you get from it, and you'd need to know the keys in advance (keys match json object's attribute names).
(2) has the advantage that everything would be correctly typed upon deserialization; no need to downcast anything, since every type would be a field of your own classes which represent the structure of the parameters. However, if the structure of your input.json file changed, you would need to change the structure of your classes as well.
(3) is the most flexible of all, and I believe it's the option that is closest to what you have in mind, nonetheless is the most tedious to work with, since it's too low-level. Please refer to this article for further details.

4 Key Value HashMap? Array? Best Approach?

I've got loads of the following to implement.
validateParameter(field_name, field_type, field_validationMessage, visibleBoolean);
Instead of having 50-60 of these in a row, is there some form of nested hashmap/4d array I can use to build it up and loop through them?
Whats the best approach for doing something like that?
Thanks!
EDIT: Was 4 items.
What you could do is create a new Class that holds three values. (The type, the boolean, and name, or the fourth value (you didn't list it)). Then, when creating the HashMap, all you have to do is call the method to get your three values. It may seem like more work, but all you would have to do is create a simple loop to go through all of the values you need. Since I don't know exactly what it is that you're trying to do, all I can do is provide an example of what I'm trying to do. Hope it applies to your problem.
Anyways, creating the Class to hold the three(or four) values you need.
For example,
Class Fields{
String field_name;
Integer field_type;
Boolean validationMessageVisible;
Fields(String name, Integer type, Boolean mv) {
// this.field_name = name;
this.field_type = type;
this.validationMessageVisible = mv;
}
Then put them in a HashMap somewhat like this:
HashMap map = new HashMap<String, Triple>();
map.put(LOCAL STRING FOR NAME OF FIELD, new Field(new Integer(YOUR INTEGER),new Boolean(YOUR BOOLEAN)));
NOTE: This is only going to work as long as these three or four values can all be stored together. For example if you need all of the values to be stored separately for whatever reason it may be, then this won't work. Only if they can be grouped together without it affecting the function of the program, that this will work.
This was a quick brainstorm. Not sure if it will work, but think along these lines and I believe it should work out for you.
You may have to make a few edits, but this should get you in the right direction
P.S. Sorry for it being so wordy, just tried to get as many details out as possible.
The other answer is close but you don't need a key in this case.
Just define a class to contain your three fields. Create a List or array of that class. Loop over the list or array calling the method for each combination.
The approach I'd use is to create a POJO (or some POJOs) to store the values as attributes and validate attribute by attribute.
Since many times you're going to have the same validation per attribute type (e.g. dates and numbers can be validated by range, strings can be validated to ensure they´re not null or empty, etc), you could just iterate on these attributes using reflection (or even better, using annotations).
If you need to validate on the POJO level, you can still reuse these attribute-level validators via composition, while you add more specific validations are you´re going up in the abstraction level (going up means basic attributes -> pojos -> pojos that contain other pojos -> etc).
Passing several basic types as parameters of the same method is not good because the parameters themselves don't tell much and you can easily exchange two parameters of the same type by accident in the method call.

What's the best pattern to handle a table row datastructure?

The Facts
I have the following datastructure consisting of a table and a list of attributes (simplified):
class Table {
List<Attribute> m_attributes;
}
abstract class Attribute {}
class LongAttribute extends Attribute {}
class StringAttribute extends Attribute {}
class DateAttribute extends Attribute {}
...
Now I want to do different actions with this datastructure:
print it in XML notation
print it in textual form
create an SQL insert statement
create an SQL update statement
initialize it from a SQL result set
First Try
My first attempt was to put all these functionality inside the Attribute, but then the Attribute was overloaded with very different responsibilities.
Alternative
It feels like a visitor pattern could do the job very well instead, but on the other side it looks like overkill for this simple structure.
Question
What's the most elegant way to solve this?
I would look at using a combination of JAXB and Hibernate.
JAXB will let you marshall and unmarshall from XML. By default, properties are converted to elements with the same name as the property, but that can be controlled via #XmlElement and #XmlAttribute annotations.
Hibernate (or JPA) are the standard ways of moving data objects to and from a database.
The Command pattern comes to mind, or a small variation of it.
You have a bunch of classes, each of which is specialized to do a certain thing with your data class. You can keep these classes in a hashmap or some other structure where an external choice can pick one for execution. To do your thing, you call the selected Command's execute() method with your data as an argument.
Edit: Elaboration.
At the bottom level, you need to do something with each attribute of a data row.
This indeed sounds like a case for the Visitor pattern: Visitor simulates a double
dispatch operation, insofar as you are able to combine a variable "victim" object
with a variable "operation" encapsulated in a method.
Your attributes all want to be xml-ed, text-ed, insert-ed updat-ed and initializ-ed.
So you end up with a matrix of 5 x 3 classes to do each of these 5 operations
to each of 3 attribute types. The rest of the machinery of the visitor pattern
will traverse your list of attributes for you and apply the correct visitor for
the operation you chose in the right way for each attribute.
Writing 15 classes plus interface(s) does sound a little heavy. You can do this
and have a very general and flexible solution. On the other hand, in the time
you've spent thinking about a solution, you could have hacked together the code
to it for the currently known structure and crossed your fingers that the shape
of your classes won't change too much too often.
Where I thought of the command pattern was for choosing among a variety of similar
operations. If the operation to be performed came in as a String, perhaps in a
script or configuration file or such, you could then have a mapping from
"xml" -> XmlifierCommand
"text" -> TextPrinterCommand
"serial" -> SerializerCommand
...where each of those Commands would then fire up the appropriate Visitor to do
the job. But as the operation is more likely to be determined in code, you probably
don't need this.
I dunno why you'd store stuff in a database yourself these days instead of just using hibernate, but here's my call:
LongAttribute, DateAttribute, StringAttribute,… all have different internals (i.e. fields specific to them not present in Attribute class), so you cannot create one generic method to serialize them all. Now XML, SQL and plain text all have different properties when serializing to them. There's really no way you can avoid writing O(#subclasses of Attribute #output formats)* different methods of serializing.
Visitor is not a bad pattern for serializing. True, it's a bit overkill if used on non-recursive structures, but a random programmer reading your code will immediately grasp what it is doing.
Now for deserialization (from XML to object, from SQL to object) you need a Factory.
One more hint, for SQL update you probably want to have something that takes old version of the object, new version of the object and creates update query only on the difference between them.
In the end, I used the visitor pattern. Now looking back, it was a good choice.

Categories

Resources