Replace " " string with null in hadoop and spark - java

Hi I have a project which uses hdfs, hive, spark technologies . When i import data , for numeric fields if data is not present it will be replaced with null. But for strings it will be replaced by empty string " " . For solving this issue I used this line while i create table in hive.
TBLPROPERTIES('serialization.null.format'='');
But still when i convert this into spark data frame the empty strings are represented as "" instead of null
What can be the reason..?
Is some of the properties in hive does not support in spark..?

#Manu,
Please use this as an abstract for your conversion problem into spark data frames:
## Create a sample DataFrame
testDF = sqlContext.createDataFrame([Row(col1='foo', col2=1), Row(col1='', col2=2), Row(col1=None, col2='')])
These function will convert all the empty strings to null as you requested
def blank_if_null(z):
return when(col(z) != "", col(z)).otherwise(None)
dfWithEmptyReplaced = testDF.withColumn("col1", blank_as_null("col1"))

Related

Converting String to Double with TableSource, Table or DataSet object in Java

I have imported data from a CSV file into Flink Java. One of the attributes I had to import as string (attribute Result) because of parsing errors. Now I want to convert the String to a Double. But I dont know how to do this with a object of the TableSource, Table or DataSet class. See my code below for this.
I've looked into flink documentation and tried some solutions with Map and FlatMap classes. But I did not find the solution for this.
BatchTableEnvironment tableEnv = BatchTableEnvironment.create(fbEnv);
//Get H data from CSV file.
TableSource csvSource = CsvTableSource.builder()
.path("Path")
.fieldDelimiter(";")
.field("ID", Types.INT())
.field("result", Types.STRING())
.field("unixDateTime", Types.LONG())
.build();
// register the TableSource
tableEnv.registerTableSource("HTable", csvSource);
Table HTable = tableEnv.scan("HTable");
DataSet<Row> result = tableEnv.toDataSet(HTable, Row.class);
I think it should work to use a combination of replace and cast to convert the strings to doubles, as in "SELECT id, CAST(REPLACE(result, ',', '.') AS DOUBLE) AS result, ..." or the equivalent using the table API.

What should be included/imported to recognize "$" operation join function of in my spark -java code?

Using spark-sql-2.4.1v with java8.
I am trying to join two data sets as below:
computed_df.as('s).join(accumulated_results_df.as('f),$"s.company_id" === $"f.company_id","inner")
Which is working fine in databrick's notebooks.
But when I try to implement the same in my spark java code in my Ide.
It wont recognize the "$" function/operator even after including
import static org.apache.spark.sql.functions.*;
So what should be done to use it in my spark java code ?
thanks
The answer is org.apache.spark.sql.Column. See This.
public class Column
...
df("columnName") // On a specific DataFrame.
col("columnName") // A generic column no yet associcated with a DataFrame.
col("columnName.field") // Extracting a struct field
col("`a.column.with.dots`") // Escape `.` in column names.
$"columnName" // Scala short hand for a named column.
expr("a + 1") // A column that is constructed from a parsed SQL Expression.
lit("abc") // A column that produces a literal (constant) value.

Database DataFrame Null values not coming to Json File

I have a database containing null values in some columns and I am converting the dataframe formed from the database to Json file. The problem here is that I am not getting the null columns. Here is the code as well as the output:
dataFrame.show();
dataFrame.na().fill("null").coalesce(1)
.write()
.mode("append")
.format("Json")
.option("nullValue", "")
.save("D:\\XML File Testing\\"+"JsonParty1");
The dataframe.show() gives the following output:
[![Dataframe as processed by the spark][1]][1]
[1]: https://i.stack.imgur.com/XxAQC.png
Here is how it is being saved in the File (I am pasting just 1 column just to show you the example):
{"EMPNO":7839,"ENAME":"KING","JOB":"PRESIDENT","HIREDATE":"1981-11-17T00:00:00.000+05:30","SAL":5000.00,"DEPTNO":10}
As you can see my "MGR" and "comm" column is missing because it is showing null in the dataframe. Surprisingly this thing works when the dataframe is formed from a file(Structured, example:delimited txt file) containing empty values(the spark dataframe takes it as null). Tried various approaches but still failed to get the null columns in the Json file. Any help would be much appreciated.
Try this:
import org.apache.spark.sql.functions._
dataFrame.withColumn("json", to_json(struct(dataFrame.columns.map(col):_*)
.select("json").write.mode("append").text("D:\\XML File Testing\\"+"JsonParty1")

Add a null value column in Spark Data Frame using Java

I have a dataframe and want to add a column of type String with null values.
How can it be done using Spark Java API.
I used lit functions, but getting error when tried writing the DF and saveAsTable.
Was able to solve by using lit function on the column with null value and type cast the column to String type.
df.withColumn(
"col_name", functions.lit(null)
).withColumn("col_name",
df.col("channel_name").cast(DataTypes.StringType)
)
df.withColumn("col_name", lit(null).cast("string"))
or
import org.apache.spark.sql.types.StringType
df.withColumn("col_name", lit(null).cast(StringType))

Inserting filename as rowkey using HBase MapReduce

using Java API, I'm trying to Put() to HBase 1.1.x the content of some files. To do so, I have created WholeFileInput class (ref : Using WholeFileInputFormat with Hadoop MapReduce still results in Mapper processing 1 line at a time ) to make MapReduce read the entire file instead of one line. But unfortunately, I cannot figure out how to form my rowkey from the given filename.
Example:
Input:
file-123.txt
file-524.txt
file-9577.txt
...
file-"anotherNumber".txt
Result on my HBase table:
Row-----------------Value
123-----------------"content of 1st file"
524-----------------"content of 2nd file"
...etc
If anyone has already faced this situation to help me with it
Thanks in advance.
Your
rowkey
can be like this
rowkey = prefix + (filenamepart or full file name) + Murmurhash(fileContent)
where your prefix can be between what ever presplits you have done with your table creation time.
For ex :
create 'tableName', {NAME => 'colFam', VERSIONS => 2, COMPRESSION => 'SNAPPY'},
{SPLITS => ['0','1','2','3','4','5','6','7']}
prefix can be any random id generated between range of pre-splits.
This kind of row key will avoid hot-spotting also if data increases.
& Data will be spread across region server.

Categories

Resources