I need to call the groupBy method on a spark dataset by way of the java interop through clojure.
I only need to call this for one column, but the only groupBy signatures I can get to work involve multiple column names. The api seems to indicate that I should be able to use only one column name, but I cannot get this to work. What I really need is a good example to work from. What am I missing?
This does not work . . .
(-> a-dataset
(.groupBy "a-column")
This does . . .
(-> b-dataset
(.groupBy "b-column", (into-array ["c-column"])
The error message I receive says there is no groupBy method for dataset.
I know it is looking for a Column, but I don't know how to give it one.
I don't know a thing about Spark but think we can understand it better by looking at this example from the Spark API documentation into Clojure:
// To create Dataset<Row> using SparkSession
Dataset<Row> people = spark.read().parquet("...");
Dataset<Row> department = spark.read().parquet("...");
people.filter(people.col("age").gt(30))
.join(department, people.col("deptId").equalTo(department.col("id")))
.groupBy(department.col("name"), people.col("gender"))
.agg(avg(people.col("salary")), max(people.col("age")));
We can assume that you already have a DataSet and you want to call .groupBy on it. The method that you are probably calling is the one that takes Column... as an argument. You were in the right path in that variadic argument methods in Java collect the arguments as array, so this is just like receiving a Column[] as argument.
The problem is then, how to get a Column from a DataSet? It seems you can call dataset.col(String colName) to get it. Putting everything together:
(.groupBy my-dataset (into-array Column [(.col my-dataset "a-column")]))
Again, I don't have how to verify this, but I think this should help.
Related
I have this code that's been giving me an unexpectedly wrong result that i couldn't solve :
// A method that calls the collectDataRDD(logValues, rowData) method :
// ....
// my collectDataRDD(Values, rowData) method :
The problem is that when i try to run methods like getStatus() or getValidationDate() on Data Objects which are the values of my Tuple2, it only gives one output for all the objects in my JavaRDD which is wrong, because the JavaRDD contains multiple different Objects. However when i checked the keys of my tuple2 it gave me correct results.
I have tried everything and still couldn't figure it out. Can anyone please help me solve this.
THanks a lot in advance.
Verify if
ticketsrdd.foreach((Tuple2<String, Data> rowData) -> {
collectLogDataRDD(logValues, rowData);
}
is what you want to do. This function is called for each element one by one and Tuple2 will have only one entry in that case.
JavaRDD<Tuple2<String, Data>> ticketsrdd=TransformToRDD.transformToRDD(transformer.transform());
DataStore.setData(tickets);
Will be kind of Map<String, Tuple2>. And your Tuple2 will have one key as String and one value as Data.
Now when you say Data ticket = rowData._2; you are getting 1 Data object from 1 tuple. So for each tuple in ticketsrdd its going to called collectLogDataRDD.
Lets say ticketsrdd has 100 element then its going to call collectLogDataRDD 100 times, and each time ticket.getStatus(); will also be be call.
This is what code is doing. What different behavior do you expect?
Currently I have a simple pig script which reads from a file on a hadoop fs, as
my_input = load 'input_file' as (A, B, C)
and then I have another line of code which needs to manipulate the fields, like for instance convert them to uppercase (as in the Pig UDF tutorial).
I do something like,
manipulated = FOREACH my_input GENERATE myudf.Upper(A, B, C)
Now in my Upper.java file I know that I can get the value of A, B, C as (assuming they are all Strings)
public String exec(Tuple input) throws IOException
{
//yada yada yada
....
String A = (String) input.get(0);
String B = (String) input.get(1);
String C = (String) input.get(2);
//yada yada yada
....
}
Is there anyway I can get the value of a field by its name? For instance if I need to get like 10 fields, is there no other way than to do input.get(i) from 0 to 9?
I am new to Pig, so I am interested in knowing why this is the case. Is there something like a tuple.getByFieldName('Field Name')?
This is not possible, nor would it be very good design to allow it. Pig field names are like variable names. They allow you to give a memorable name to something that gives you insight into what it means. If you use those names in your UDF, you are forcing every Pig script which uses the UDF to adhere to the same naming scheme. If you decide later that you want to think of your variables a little differently, you can't reflect that in their names because the UDF would not function anymore.
The code that reads data from the input tuple in your UDF is like a function declaration. It establishes how to treat each argument to the function.
If you really want to be able to do this, you can build a map easily enough using the TOMAP builtin function, and have your UDF read from the map. This greatly hurts the reusability of your UDF for the reasons mentioned above, but it is nevertheless a fairly simple workaround.
While I agree that function flexibility would be affected if you use field names, technically it is possible to access fields by names.
The trick is to use inputSchema available through getInputSchema() and get the mapping between field indexes and names from there. You can also override outputSchema and build the mapping there, using inputSchema parameter. Then you would be able to use this mapping in your exec method.
I don't think you can access field by name. You need a structure similar to map to achieve that. In Pig's context, even though you cannot do it by name you can still rely on position if the input (load)'s schema is properly defined and consistent.
The maximum you can do is to validate type of fields you are ingesting in the UDF.
On the other hand, you can use implement "outputSchema" in your UDF to publish its output by name.
UDF Manual
I have an application.conf file with a structure like the following:
poller {
datacenters = []
}
I would like to override "datacenters" on the command line.
For other configuration keys whose values are simple types (strings, numbers) I can override using -Dpath.to.config.value=<value>, and this works fine.
However, I can't seem to find a way to do this for lists. In the example above, I tried to set "datacenters" to ["SJC", "IAD"] like so: -Dpoller.datacenters="['SJC', 'IAD']", but I get an exception that the key value is a string, not a list.
Is there a way to signal to the typesafe config library that this value is a list?
An alternative syntax is implemented in version 1.0.1 for this:
-Dpoller.datacenters.0=SJC -Dpoller.datacenters.1=IAD
I had the same issue some weeks ago, and finally dived into the source code to understand what's going on:
This feature is not implemented, it's not possible to define a list using command line argument
Fixing it wouldn't be that hard, but someone need to take time to do it.
I don't know anything about the OutputTypes. I'm trying something like this:
output=collection.mapReduce(map,reduce,null,
MapReduceCommand.OutputType.INLINE,null);
collection=output.getOutputCollection();
But the collection is null, because of the INLINE output type. I need the reduced collection because I need to reduce it further. How could I do this?
I found the solution to this finally
output=collection.mapReduce(map,reduce,"mymap",MapReduceCommand.OutputType. REDUCE,null);
collection=output.getOutputCollection();
note that you cannot store in same target "mymap" again and again. You have to use different name when you are looping like "mymap".concat(Integer.toString(i))
i had a situation where i had to use this setDataVector function. I was puzzled to see there was an extra second argument(Vector columnIdentifiers) in the function. I'm just resetting the data. Why do i need to send the column identifiers?? And it doesn't take the old column identifiers by default if i don't pass the second argument. Irritating to add initialize a vector with column identifiers only for this purpose. Any idea why it's been done like that?
From the actual code, it looks to me like the method could have been better named. Something like setDataAndColumns() makes more sense. The internal code looks like this:
this.dataVector = nonNullVector(dataVector);
this.columnIdentifiers = nonNullVector(columnIdentifiers);
Passing in null for columnIdentifiers will simply remove all the columns in the table. I guess your controller class needs to keep a copy of the columnIdentifiers to pass in as required.
The setDataVector(...) method is invoked by all the constructor methods which require you to include both parameters.