I am Using spark-sql 2.4.1 and java 8.
val country_df = Seq(
("us",2001),
("fr",2002),
("jp",2002),
("in",2001),
("fr",2003),
("jp",2002),
("in",2003)
).toDF("country","data_yr")
> val col_df = country_df.select("country").where($"data_yr" === 2001)
val data_df = Seq(
("us_state_1","fr_state_1" ,"in_state_1","jp_state_1"),
("us_state_2","fr_state_2" ,"in_state_2","jp_state_1"),
("us_state_3","fr_state_3" ,"in_state_3","jp_state_1")
).toDF("us","fr","in","jp")
> data_df.select("us","in").show()
how to populate this select clause (of data_df) dynamically , from the country_df for given year ?
i.e. From first dataframe , i will get values of column , those are
the columns i need to select from second datafame. How can this be
done ?
Tried this :
List<String> aa = col_df.select(functions.lower(col("data_item_code"))).map(row -> row.mkString(" ",", "," "), Encoders.STRING()).collectAsList();
data_df.select(aa.stream().map(s -> new Column(s)).toArray(Column[]::new));
Error :
.AnalysisException: cannot resolve '` un `' given input columns: [abc,.....all columns ...]
So what is wrong here , and how to fix this ?
You can try with the below code.
Select the column name from the first dataset.
List<String> columns = country_df.select("country").where($"data_yr" === 2001).as(Encoders.STRING()).collectAsList();
Use the column names in selectexpr in second dataset.
public static Seq<String> convertListToSeq(List<String> inputList) {
return JavaConverters.asScalaIteratorConverter(inputList.iterator()).asScala().toSeq();
}
//using selectExpr
data_df.selectExpr(convertListToSeq(columns)).show(true);
scala> val colname = col_df.rdd.collect.toList.map(x => x(0).toString).toSeq
scala> data_df.select(colname.head, colname.tail: _*).show()
+----------+----------+
| us| in|
+----------+----------+
|us_state_1|in_state_1|
|us_state_2|in_state_2|
|us_state_3|in_state_3|
+----------+----------+
Using pivot you can get the values as column names directly like this:
val selectCols = col_df.groupBy().pivot($"country").agg(lit(null)).columns
data_df.select(selectCols.head, selectCols.tail: _*)
Related
I have a JSON node on which I have to write PSQL query,
My table schema name(String),tagValues(jsonb). Example tagValue data is given below
Name_TagsTable
uid | name(String)| tagValues(jsonb)
-----+-------------------+-----------------------------
1 | myName | { "tags": [{"key":"key1","value" : "value1"}, {"key":"key1","value" : "value2"}, {"key":"key3","value" : "value3"}, {"key":"key4","value" : "value4"}] }
I need a query that gives me names for which
at least one of the tag in the tags list satisfy the condition
key = 'X' and value = 'Y'
Help me with the query. I am using PSQL 10.0
You can use the contains operator #> which also works with arrays
select *
from name_tagstable
where tagvalues -> 'tags' #> '[{"key": "x", "value": "y"}]';
I am facing issues while retrieving/searching through Postgres JSONB column using CriteriaBuilder.
Postgres Query : SELECT * FROM table where lower(jsonb_extract_path(tags,'Case')::text) = lower('"normal"') as text));
Tags: datatype of tags column is JSONB.
Above query is working fine using postgres client.
But I was facing issues while trying to get it done from CriteriaBuilder.
Following is my code:
private Specification<TemplateMetadata> findByTagsFilter(Map<String, String[]> requestParameterMap) {
return (root, query, builder) -> {
List<Predicate> predicates = new ArrayList<>();
if (!StringUtils.isEmpty(requestParameterMap)) {
predicates = requestParameterMap.entrySet().stream()
.filter(keyValue1 -> validateKeyValues(keyValue1.getKey()))
.map(keyValue2 -> Arrays.asList(keyValue2.getValue()).stream()
.map(v -> builder.equal(builder.function("jsonb_extract_path", String.class, builder.lower(root.<String>get("tags")), builder.literal(keyValue2.getKey())) , builder.lower(builder.literal("\""+v+"\"")).as(JsonDataType.class)))
.flatMap(list -> list.distinct())
.collect(Collectors.toList());
}
return builder.and(
predicates.toArray(new Predicate[predicates.size()])
);
};
}
requestParameterMap is filled by REST (API URI should looks like : ?testtag1=testtag1&testtag2=testtag2)
I am getting follwoing Error
ERROR: function lower(jsonb) does not exist\n Hint: No function matches the given name and argument types. You might need to add explicit type casts.\n Position: 444
Can some one help me on this.
I have a DataFrame containing below:
TradeId|Source
ABC|"USD,333.123,20170605|USD,-789.444,20170605|GBP,1234.567,20150602"
I want to pivot this data so it turns into below
TradeId|CCY|PV
ABC|USD|333.123
ABC|USD|-789.444
ABC|GBP|1234.567
The number of CCY|PV|Date triplets in the column "Source" is not fixed. I could do it in ArrayList but that requires to load the data in JVM and defeats the whole point of Spark.
Lets say my DataFrame looks as below:
DataFrame tradesSnap = this.loadTradesSnap(reportRequest);
String tempTable = getTempTableName();
tradesSnap.registerTempTable(tempTable);
tradesSnap = tradesSnap.sqlContext().sql("SELECT TradeId, Source FROM " + tempTable);
If you read databricks pivot, it says " A pivot is an aggregation where one (or more in the general case) of the grouping columns has its distinct values transposed into individual columns." And this is not what you desire I guess
I would suggest you to use withColumn and functions to get the final output you desire. You can do as following considering dataframe is what you have
+-------+----------------------------------------------------------------+
|TradeId|Source |
+-------+----------------------------------------------------------------+
|ABC |USD,333.123,20170605|USD,-789.444,20170605|GBP,1234.567,20150602|
+-------+----------------------------------------------------------------+
You can do the following using explode, split and withColumn to get the desired output
val explodedDF = dataframe.withColumn("Source", explode(split(col("Source"), "\\|")))
val finalDF = explodedDF.withColumn("CCY", split($"Source", ",")(0))
.withColumn("PV", split($"Source", ",")(1))
.withColumn("Date", split($"Source", ",")(2))
.drop("Source")
finalDF.show(false)
The final output is
+-------+---+--------+--------+
|TradeId|CCY|PV |Date |
+-------+---+--------+--------+
|ABC |USD|333.123 |20170605|
|ABC |USD|-789.444|20170605|
|ABC |GBP|1234.567|20150602|
+-------+---+--------+--------+
I hope this solves your issue
Rather than pivoting, what you are trying to achieve looks more like flatMap.
To put it simply, by using flatMap on a Dataset you apply to each row a function (map) that itself would produce a sequence of rows. Each set of rows is then concatenated into a single sequence (flat).
The following program shows the idea:
import org.apache.spark.sql.SparkSession
case class Input(TradeId: String, Source: String)
case class Output(TradeId: String, CCY: String, PV: String, Date: String)
object FlatMapExample {
// This function will produce more rows of output for each line of input
def splitSource(in: Input): Seq[Output] =
in.Source.split("\\|", -1).map {
source =>
println(source)
val Array(ccy, pv, date) = source.split(",", -1)
Output(in.TradeId, ccy, pv, date)
}
def main(args: Array[String]): Unit = {
// Initialization and loading
val spark = SparkSession.builder().master("local").appName("pivoting-example").getOrCreate()
import spark.implicits._
val input = spark.read.options(Map("sep" -> "|", "header" -> "true")).csv(args(0)).as[Input]
// For each line in the input, split the source and then
// concatenate each "sub-sequence" in a single `Dataset`
input.flatMap(splitSource).show
}
}
Given your input, this would be the output:
+-------+---+--------+--------+
|TradeId|CCY| PV| Date|
+-------+---+--------+--------+
| ABC|USD| 333.123|20170605|
| ABC|USD|-789.444|20170605|
| ABC|GBP|1234.567|20150602|
+-------+---+--------+--------+
You can now take the result and save it to a CSV, if you want.
There is input_file_name function in Apache Spark which is used by me to add new column to Dataset with the name of file which is currently being processed.
The problem is that I'd like to somehow customize this function to return only file name, ommitting the full path to it on s3.
For now, I am doing replacement of the path on the second step using map function:
val initialDs = spark.sqlContext.read
.option("dateFormat", conf.dateFormat)
.schema(conf.schema)
.csv(conf.path).withColumn("input_file_name", input_file_name)
...
...
def fromFile(fileName: String): String = {
val baseName: String = FilenameUtils.getBaseName(fileName)
val tmpFileName: String = baseName.substring(0, baseName.length - 8) //here is magic conversion ;)
this.valueOf(tmpFileName)
}
But I'd like to use something like
val initialDs = spark.sqlContext.read
.option("dateFormat", conf.dateFormat)
.schema(conf.schema)
.csv(conf.path).withColumn("input_file_name", **customized_input_file_name_function**)
In Scala:
#register udf
spark.udf
.register("get_only_file_name", (fullPath: String) => fullPath.split("/").last)
#use the udf to get last token(filename) in full path
val initialDs = spark.read
.option("dateFormat", conf.dateFormat)
.schema(conf.schema)
.csv(conf.path)
.withColumn("input_file_name", get_only_file_name(input_file_name))
Edit: In Java as per comment
#register udf
spark.udf()
.register("get_only_file_name", (String fullPath) -> {
int lastIndex = fullPath.lastIndexOf("/");
return fullPath.substring(lastIndex, fullPath.length - 1);
}, DataTypes.StringType);
import org.apache.spark.sql.functions.input_file_name
#use the udf to get last token(filename) in full path
Dataset<Row> initialDs = spark.read()
.option("dateFormat", conf.dateFormat)
.schema(conf.schema)
.csv(conf.path)
.withColumn("input_file_name", get_only_file_name(input_file_name()));
Borrowing from a related question here, the following method is more portable and does not require a custom UDF.
Spark SQL Code Snippet: reverse(split(path, '/'))[0]
Spark SQL Sample:
WITH sample_data as (
SELECT 'path/to/my/filename.txt' AS full_path
)
SELECT
full_path
, reverse(split(full_path, '/'))[0] as basename
FROM sample_data
Explanation:
The split() function breaks the path into it's chunks and reverse() puts the final item (the file name) in front of the array so that [0] can extract just the filename.
Full Code example here :
spark.sql(
"""
|WITH sample_data as (
| SELECT 'path/to/my/filename.txt' AS full_path
| )
| SELECT
| full_path
| , reverse(split(full_path, '/'))[0] as basename
| FROM sample_data
|""".stripMargin).show(false)
Result :
+-----------------------+------------+
|full_path |basename |
+-----------------------+------------+
|path/to/my/filename.txt|filename.txt|
+-----------------------+------------+
commons io is natural/easiest import in spark means(no need to add additional dependency...)
import org.apache.commons.io.FilenameUtils
getBaseName(String fileName)
Gets the base name, minus the full path and extension, from a full fileName.
val baseNameOfFile = udf((longFilePath: String) => FilenameUtils.getBaseName(longFilePath))
Usage is like...
yourdataframe.withColumn("shortpath" ,baseNameOfFile(yourdataframe("input_file_name")))
.show(1000,false)
I'm trying to create a multi-value parameter in SpagoBI.
Here is my data set query whose last line appears to be causing an issue.
select C."CUSTOMERNAME", C."CITY", D."YEAR", P."NAME"
from "CUSTOMER" C, "DAY" D, "PRODUCT" P, "TRANSACTIONS" T
where C."CUSTOMERID" = T."CUSTOMERID"
and D."DAYID" = T."DAYID"
and P."PRODUCTID" = T."PRODUCTID"
and _CITY_
I created before open script in my dataset which looks like this:
this.queryText = this.queryText.replace(_CITY_, " CUSTOMER.CITY in ( "+params["cp"].value+" ) ");
My parameter is set as string, display type dynamic list box.
When I run the report I'm getting that error.
org.eclipse.birt.report.engine.api.EngineException: There are errors evaluating script "
this.queryText = this.queryText.replace(_CITY_, " CUSTOMER.CITY in ( "+params["cp"].value+" ) ");
":
Fail to execute script in function __bm_beforeOpen(). Source:
Could anyone please help me?
Hello I managed to solve the problem. Here is my code:
var substring = "" ;
var strParamValsSelected=reportContext.getParameterValue("citytext");
substring += "?," + strParamValsSelected ;
this.queryText = this.queryText.replace("'xxx'",substring);
As You can see the "?" is necessary before my parameter. Maybe It will help somebody. Thank You so much for Your comments.
If you are using SpagoBI server and High charts (JFreeChart Engine) / JSChat Engine you can just use ($P{param_url}) in query,
or build dynamic query using Java script / groovy Script
so your query could also be:
select C."CUSTOMERNAME", C."CITY", D."YEAR", P."NAME"
from "CUSTOMER" C, "DAY" D, "PRODUCT" P, "TRANSACTIONS" T
where C."CUSTOMERID" = T."CUSTOMERID"
and D."DAYID" = T."DAYID"
and P."PRODUCTID" = T."PRODUCTID"
and CUSTOMER."CITY" in ('$P{param_url}')