I have a dataframe and want to add a column of type String with null values.
How can it be done using Spark Java API.
I used lit functions, but getting error when tried writing the DF and saveAsTable.
Was able to solve by using lit function on the column with null value and type cast the column to String type.
df.withColumn(
"col_name", functions.lit(null)
).withColumn("col_name",
df.col("channel_name").cast(DataTypes.StringType)
)
df.withColumn("col_name", lit(null).cast("string"))
or
import org.apache.spark.sql.types.StringType
df.withColumn("col_name", lit(null).cast(StringType))
Related
I have a MySQL table and I load it on spark. The table contains a column with geometry type.
When I load the table on spark, the column with geometry type becomes with binary type in data frame.
My questions are:
Why the geometry type in MySQL becomes binary type on spark ?
Is there any alternative to fix that ?
I need your help!
Thank you!
Geometry is a special data type.
Before use it, you should convert it to text or bynary.
Conversion info: https://dev.mysql.com/doc/refman/5.6/en/gis-format-conversion-functions.html
Or you can use GeoSpark:
var spatialDf = sparkSession.sql( """ |SELECT ST_GeomFromWKT(_c0) AS countyshape, _c1, _c2 |FROM rawdf """.stripMargin) spatialDf.createOrReplaceTempView("spatialdf") spatialDf.show()
Full tutorial below:
https://datasystemslab.github.io/GeoSpark/tutorial/sql/
Using spark-sql-2.4.1v with java8.
I am trying to join two data sets as below:
computed_df.as('s).join(accumulated_results_df.as('f),$"s.company_id" === $"f.company_id","inner")
Which is working fine in databrick's notebooks.
But when I try to implement the same in my spark java code in my Ide.
It wont recognize the "$" function/operator even after including
import static org.apache.spark.sql.functions.*;
So what should be done to use it in my spark java code ?
thanks
The answer is org.apache.spark.sql.Column. See This.
public class Column
...
df("columnName") // On a specific DataFrame.
col("columnName") // A generic column no yet associcated with a DataFrame.
col("columnName.field") // Extracting a struct field
col("`a.column.with.dots`") // Escape `.` in column names.
$"columnName" // Scala short hand for a named column.
expr("a + 1") // A column that is constructed from a parsed SQL Expression.
lit("abc") // A column that produces a literal (constant) value.
I have a database containing null values in some columns and I am converting the dataframe formed from the database to Json file. The problem here is that I am not getting the null columns. Here is the code as well as the output:
dataFrame.show();
dataFrame.na().fill("null").coalesce(1)
.write()
.mode("append")
.format("Json")
.option("nullValue", "")
.save("D:\\XML File Testing\\"+"JsonParty1");
The dataframe.show() gives the following output:
[![Dataframe as processed by the spark][1]][1]
[1]: https://i.stack.imgur.com/XxAQC.png
Here is how it is being saved in the File (I am pasting just 1 column just to show you the example):
{"EMPNO":7839,"ENAME":"KING","JOB":"PRESIDENT","HIREDATE":"1981-11-17T00:00:00.000+05:30","SAL":5000.00,"DEPTNO":10}
As you can see my "MGR" and "comm" column is missing because it is showing null in the dataframe. Surprisingly this thing works when the dataframe is formed from a file(Structured, example:delimited txt file) containing empty values(the spark dataframe takes it as null). Tried various approaches but still failed to get the null columns in the Json file. Any help would be much appreciated.
Try this:
import org.apache.spark.sql.functions._
dataFrame.withColumn("json", to_json(struct(dataFrame.columns.map(col):_*)
.select("json").write.mode("append").text("D:\\XML File Testing\\"+"JsonParty1")
I am using the below mentioned MongoDB query in Java to find the maximun value of field price:
DBCursor cursor = coll.find(query,fields).sort(new BasicDBObject("price",1)).limit(1);
fields argument passing to coll.find function here is having the price field only.
So I am getting the output in the form:
{"price" : value}
Is there any way to get value only in the output without the field name and braces etc, so that it can be assigned to a variable or returned to the calling function etc.
Or if there is any other query or mechanism available that I can use for the same purpose.
Pls suggest..
Thanks & Regards
You can get value of price from the DBCursor object as follows.
while (cursor.hasNext()) {
Double price = (Double) cursor.next().get("price");
}
On the mongo shell you can do it as follows :
db.priceObj.find({},{_id:0, price:1}).sort({price:-1}).limit(1)[0].price
You cannot do this due to the fact that MongoDB communicates using BSON.
A single value like you want would be invalid BSON. It is easy enough to filter it out your side.
I am having a query wherein I am fetching out sum of a column from a table formed through sub query.
Something in the lines:
select temp.mySum as MySum from (select sum(myColumn) from mySchema.myTable) temp;
However, I don't want MySum to be null when temp.mySum is null. Instead I want MySum to carry string 'value not available' when temp.mySum is null.
Thus I tried to use coalesce in the below manner:
select coalesce(temp.mySum, 'value not available') as MySum from (select sum(myColumn) from mySchema.myTable) temp;
However above query is throwing error message:
Message: The data type, length or value of argument "2" of routine "SYSIBM.COALESCE" is incorrect.
This message is because of datatype incompatibility between argument 1 and 2 of coalesce function as mentioned in the first answer below.
However, I am directly using this query in Jasper to send values to Excel sheet report:
hashmap.put("myQuery", this.myQuery);
JasperReport jasperReportOne = JasperCompileManager.compileReport(this.reportJRXML);
JasperPrint jasperPrintBranchCd = JasperFillManager.fillReport(jasperReportOne , hashmap, con);
jprintList.add(jasperPrintOne);
JRXlsExporter exporterXLS = new JRXlsExporter();
exporterXLS.setParameter(JRExporterParameter.JASPER_PRINT_LIST, jprintList);
exporterXLS.exportReport();
In the excel sheet, I am getting value as null when the value is not available. I want to show 'value unavailable' in the report.
How could this be achieved ?
Thanks for reading!
The arguments to coalesce must be compatible. That's not the case if the first is numeric (as mySum probably is) and the second is a string.
For example, the following PubLib doco has a table indicating compatibility between various types, at least for the DB2 I work with (the mainframe one) - no doubt there are similar restrictions for the iSeries and LUW variants as well.
You can try something like coalesce(temp.mySum, 0) instead or convert the first argument to a string with something like char(). Either of those should work since they make the two arguments compatible.