Database DataFrame Null values not coming to Json File - java

I have a database containing null values in some columns and I am converting the dataframe formed from the database to Json file. The problem here is that I am not getting the null columns. Here is the code as well as the output:
dataFrame.show();
dataFrame.na().fill("null").coalesce(1)
.write()
.mode("append")
.format("Json")
.option("nullValue", "")
.save("D:\\XML File Testing\\"+"JsonParty1");
The dataframe.show() gives the following output:
[![Dataframe as processed by the spark][1]][1]
[1]: https://i.stack.imgur.com/XxAQC.png
Here is how it is being saved in the File (I am pasting just 1 column just to show you the example):
{"EMPNO":7839,"ENAME":"KING","JOB":"PRESIDENT","HIREDATE":"1981-11-17T00:00:00.000+05:30","SAL":5000.00,"DEPTNO":10}
As you can see my "MGR" and "comm" column is missing because it is showing null in the dataframe. Surprisingly this thing works when the dataframe is formed from a file(Structured, example:delimited txt file) containing empty values(the spark dataframe takes it as null). Tried various approaches but still failed to get the null columns in the Json file. Any help would be much appreciated.

Try this:
import org.apache.spark.sql.functions._
dataFrame.withColumn("json", to_json(struct(dataFrame.columns.map(col):_*)
.select("json").write.mode("append").text("D:\\XML File Testing\\"+"JsonParty1")

Related

Spark Exception while writing a Dataframe to an External Partitioned Hive Table using Spark Java

I am running a spark job on multinode cluster and trying to insert(append) a dataframe to an external Hive Table which is partitioned by 2 columns- date and hr.
dataframe.write().insertInto(hiveTable);
Hive table structure is as below:
CREATE EXTERNAL TABLE `database.hiveTable`(
`col1` string,
`col2` string,
`col3_json` string,
)
PARTITIONED BY (
`dt` string,
`hr` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION '/data/hdfs/tmp/test';
Note: col3_json column will have data in json string like:
{"group":[{"action":"Change","gid":"111","isId":"Y"},{"action":"Add","gid":"111","isId":"Y"},{"action":"Delete","gid":"111","isId":"N"}]}
The data is getting successfully inserted when the table is not partitioned.
But it is throwing below error, when the data is inserted into the above partitioned table:
org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
.
.
.
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$PathComponentTooLongException): The maximum path component name limit of hr=%7B%22group%22%7B%22action%22%3A%22Change%22,%22gid%22%22,%22isId%22%3A%22Y%22},%7B%22action%22Add%22,%22gid%111%22,%22isId%%22},%7B%22action%22%3A%22Delete%22,%22gid%2524%22,%22isId%22N%22}%5D} in directory /data/hdfs/tmp/test/.hive-staging_hive_2020-01-04_00-27-05_24_76879687968796-1/-ext-10000/_temporary/0/_temporary/attempt_20200104002705_0027_m_000000_0/dt=N is exceeded: limit=255 length=399
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxComponentLength(FSDirectory.java:1113)
```
I notice that the error has few strings which are present in the json data like: group, Change, gid, etc.
Not sure if this is related to the json data being inserted to the col3_json.
Please suggest.

Duplicate column name in spark read csv

I read csv file, which has a duplicate column.
I want to preserve the name of the column in dataframe.
I tried to add this option in my sparkcontext conf spark.sql.caseSensitive and put it true , but unfortunately it has no effect.
The duplicate column name is NU_CPTE. Spark tried to rename it by adding number of column 0, 7
NU_CPTE0|CD_EVT_FINANCIER|TYP_MVT_ELTR|DT_OPERN_CLI|LI_MVT_ELTR| MT_OPERN_FINC|FLSENS|NU_CPTE7
SparkSession spark= SparkSession
.builder()
.master("local[2]")
.appName("Application Test")
.getOrCreate();
spark.sparkContext().getConf().set("spark.sql.caseSensitive","true");
Dataset<Row> df=spark.read().option("header","true").option("delimiter",";").csv("FILE_201701.csv");
df.show(10);
I want something like this as result:
NU_CPTE|CD_EVT_FINANCIER|TYP_MVT_ELTR|DT_OPERN_CLI|LI_MVT_ELTR| MT_OPERN_FINC|FLSENS|NU_CPTE
Spark is fixed to allow the duplicate column names with the number appended. Hence you are getting the numbers appended to the duplicate column names. Please find the below link
https://issues.apache.org/jira/browse/SPARK-16896
The way you're trying to set the caseSensitive property will indeed be ineffective. Try replacing:
spark.sparkContext().getConf().set("spark.sql.caseSensitive","true");
with:
spark.sql("set spark.sql.caseSensitive=true");
However, this still assumes your original columns have some sort of difference in casing. If they have the same casing, they will still be identical and will be suffixed with the column number.

Add a null value column in Spark Data Frame using Java

I have a dataframe and want to add a column of type String with null values.
How can it be done using Spark Java API.
I used lit functions, but getting error when tried writing the DF and saveAsTable.
Was able to solve by using lit function on the column with null value and type cast the column to String type.
df.withColumn(
"col_name", functions.lit(null)
).withColumn("col_name",
df.col("channel_name").cast(DataTypes.StringType)
)
df.withColumn("col_name", lit(null).cast("string"))
or
import org.apache.spark.sql.types.StringType
df.withColumn("col_name", lit(null).cast(StringType))

Replace " " string with null in hadoop and spark

Hi I have a project which uses hdfs, hive, spark technologies . When i import data , for numeric fields if data is not present it will be replaced with null. But for strings it will be replaced by empty string " " . For solving this issue I used this line while i create table in hive.
TBLPROPERTIES('serialization.null.format'='');
But still when i convert this into spark data frame the empty strings are represented as "" instead of null
What can be the reason..?
Is some of the properties in hive does not support in spark..?
#Manu,
Please use this as an abstract for your conversion problem into spark data frames:
## Create a sample DataFrame
testDF = sqlContext.createDataFrame([Row(col1='foo', col2=1), Row(col1='', col2=2), Row(col1=None, col2='')])
These function will convert all the empty strings to null as you requested
def blank_if_null(z):
return when(col(z) != "", col(z)).otherwise(None)
dfWithEmptyReplaced = testDF.withColumn("col1", blank_as_null("col1"))

Extract/Parse XML data/element from BLOB column in Oracle

I have a 2 tables, CONFIGURATION_INFO and CONFIGURATION_FILE. I use the below query to find out all employee files
select i.cfg_id, Filecontent
from CONFIGURATION_INFO i,
CONFIGURATION_FILE f
where i.cfg_id=f.cfg_id
but I also need to parse or extract data from the blob column Filecontent and display all cfg_id whose xml tag PCVERSION starts with 8. Is there any way?
XML tag that needs to be extracted is <CSMCLIENT><COMPONENT><PCVERSION>8.1</PCVERSION></COMPONENT></CSMCLIENT>
XML
It need not be any query, even if it is a java or groovy code, it would help me.
Note: Some of the XMLs might be as big as 5MB.
So basically the data from the table CONFIGURATION_INFO, for the column Filecontent is BLOB?
So the syntax to query out the XML from the BLOB Content from database is using this function XMLType.
This function is converting the datatype of your column from BLOB to XMLType. Then parsing it using XPath function.
Oracle Database
select
xmltype(Filecontent, 871).extract('//CSMCLIENT/COMPONENT/PCVERSION/text()').getstringval()
from CONFIGURATION_INFO ...
do the rest of WHERE logic on your own.
Usally you know what the data in the BLOB column, so you can parse in the SQL query..
If it is a text column (varchar or something like that) you can use to_char(coloumName).
There are a lot of functions that you can use you can find them in this link
Usually you will use to_char/to_date/hexToRow/rowTohex
convert blob to file link

Categories

Resources