I'm trying to filter a Spark DataFrame using a list in Java.
java.util.List<Long> selected = ....;
DataFrame result = df.filter(df.col("something").isin(????));
The problem is that isin(...) method accepts Scala Seq or varargs.
Passing in JavaConversions.asScalaBuffer(selected) doesn't work either.
Any ideas?
Use stream method as follows:
df.filter(col("something").isin(selected.stream().toArray(String[]::new))))
A bit shorter version would be:
df.filter(col("something").isin(selected.toArray()));
Related
I have list of IDs in an ArrayList, Can anyone help me how to get the count of unique ID's from the list.
Assume the ArrayList may contains:
ADD5C9
AA6F39
AA3D0D
AA48C9
8B9D48
63A859
ADD5C9
ADA162
AD9AD5
8B9D48
please find the code for the list
Thanks and Best Regards
Add all list elements to a Set to remove any duplicate values:
Set<String> set = new HashSet<>(theArrayList);
int numberOfUniques = set.size();
Be aware that since JMeter 3.1 you should be using JSR223 Test Elements and Groovy language for scripting.
In Groovy it would be quite enough to call unique() function for the ArrayList and it will remove all the duplicates:
def array = ['ADD5C9',
'AA6F39',
'AA3D0D',
'AA48C9',
'8B9D48',
'63A859',
'ADD5C9',
'ADA162',
'AD9AD5',
'8B9D48']
def unique = array.unique()
unique.each { value -> log.info(value) }
More information: Apache Groovy - Why and How You Should Use It
Java-8+ solution
Or, You can use Stream::distinct like so :
long count = list.stream().distinct().count();
How to use Or operation inside when function in Spark Java API. I want something like this but I get a compiler error.
Dataset<Row> ds = ds1.withColumn("Amount2", when(ds2.col("Type").equalTo("A") Or ds2.col("Type").equalTo("B"), "Amount1").otherwise(0))
Can somebody guide me please with a sample expression.
You should use or method:
ds2.col("Type").equalTo("A").or(ds2.col("Type").equalTo("B"))
With equalTo isin should work as well:
ds2.col("Type").isin("A", "B")
Im traslating a old enterprise App who uses C# with Linq queries to Java 8. I have some of those queries who I'm not able to reproduce using Lambdas as I dont know how C# works with those.
For example, in this Linq:
from register in registers
group register by register.muleID into groups
select new Petition
{
Data = new PetitionData
{
UUID = groups.Key
},
Registers = groups.ToList<AuditRegister>()
}).ToList<Petition>()
I undestand this as a GroupingBy on Java 8 Lambda, but what's the "select new PetitionData" inside of the query? I don't know how to code it in Java.
I have this at this moment:
Map<String, List<AuditRegister>> groupByMuleId =
registers.stream().collect(Collectors.groupingBy(AuditRegister::getMuleID));
Thank you and regards!
The select LINQ operation is similar to the map method of Stream in Java. They both transform each element of the sequence into something else.
collect(Collectors.groupingBy(AuditRegister::getMuleID)) returns a Map<String, List<AuditRegister>> as you know. But the groups variable in the C# version is an IEnumerable<IGrouping<string, AuditRegister>>. They are quite different data structures.
What you need is the entrySet method of Map. It turns the map into a Set<Map.Entry<String, List<AuditRegister>>>. Now, this data structure is more similar to IEnumerable<IGrouping<string, AuditRegister>>. This means that you can create a stream from the return value of entry, call map, and transform each element into a Petition.
groups.Key is simply x.getKey(), groups.ToList() is simply x.getValue(). It should be easy.
I suggest you to create a separate method to pass into the map method:
// you can probably came up with a more meaningful name
public static Petition mapEntryToPetition(Map.Entry<String, List<AuditRegister>> entry) {
Petition petition = new Petition();
PetitionData data = new PetitionData();
data.setUUID(entry.getKey());
petition.setData(data);
petition.setRegisters(entry.getValue());
return petition;
}
I am trying to use map function on DataFrame in Spark using Java. I am following the documentation which says
map(scala.Function1 f, scala.reflect.ClassTag evidence$4)
Returns a new RDD by applying a function to all rows of this DataFrame.
While using the Function1 in map , I need to implement all the functions. I have seen some questions related to this , but the solution provided converts the DataFrame into RDD.
How can I use the map function in DataFrame without converting it into a RDD also what is the second parameter of map ie scala.reflect.ClassTag<R> evidence$4
I am using Java 7 and Spark 1.6.
I know your question is about Java 7 and Spark 1.6, but in Spark 2 (and obviously Java 8), you can have a map function as part of a class, so you do not need to manipulate Java lambdas.
The call would look like:
Dataset<String> dfMap = df.map(
new CountyFipsExtractorUsingMap(),
Encoders.STRING());
dfMap.show(5);
The class would look like:
/**
* Returns a substring of the values in the id2 column.
*
* #author jgp
*/
private final class CountyFipsExtractorUsingMap
implements MapFunction<Row, String> {
private static final long serialVersionUID = 26547L;
#Override
public String call(Row r) throws Exception {
String s = r.getAs("id2").toString().substring(2);
return s;
}
}
You can find more details in this example on GitHub.
I think map is not the right way to use on a DataFrame. Maybe you should have a look at the examples in the API
There they show how to operate on DataFrames
You can use the dataset directly, need not convert the read data to RDD, its unnecessary consumption of resource.
dataset.map(mapfuncton{...}, encoder); this should suffice your needs.
Because you don't give any specific problems, there're some common alternatives to map in DataFrame like select, selectExpr, withColumn. If the spark sql builtin functions can't fit your task, you can use UTF.
I have some legacy code using the Dataframe but now it needs to process Dataset[String]. At the moment I must make it work so I'm looking for a quick and easy way to convert from Dataset[String] to Dataframe so my method can work on it. Can anyone with Spark knowledge help?
As Alberto Bonsanto said in his comment, you can use the toDF method:
import sqlContext.implicits._
val ds = Seq("a", "b").toDS
val df = ds.toDF