Writing CSV file using Spark and java - handling empty values and quotes

Writing CSV file using Spark and java - handling empty values and quotes - java

Initial data is in Dataset<Row> and I am trying to write to pipe delimited file and I want each non empty cell and non null values to be placed in quotes. Empty or null values should not contain quotes
result.coalesce(1).write()
.option("delimiter", "|")
.option("header", "true")
.option("nullValue", "")
.option("quoteAll", "false")
.csv(Location);
Expected output:
"London"||"UK"
"Delhi"|"India"
"Moscow"|"Russia"
Current Output:
London||UK
Delhi|India
Moscow|Russia
If I change the "quoteAll" to "true", output I am getting is:
"London"|""|"UK"
"Delhi"|"India"
"Moscow"|"Russia"
Spark version is 2.3 and java version is java 8

Java answer. CSV escape is not just adding " symbols around. You should handle " inside strings. So let's use StringEscapeUtils and define UDF that will call it. Then just apply the UDF to each of the column.
import org.apache.commons.text.StringEscapeUtils;
import org.apache.spark.sql.Column;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import static org.apache.spark.sql.functions.*;
import org.apache.spark.sql.expressions.UserDefinedFunction;
import org.apache.spark.sql.types.DataTypes;
import java.util.Arrays;
public class Test {
void test(Dataset<Row> result, String Location) {
// define UDF
UserDefinedFunction escape = udf(
(String str) -> str.isEmpty()?"":StringEscapeUtils.escapeCsv(str), DataTypes.StringType
);
// call udf for each column
Column columns[] = Arrays.stream(result.schema().fieldNames())
.map(f -> escape.apply(col(f)).as(f))
.toArray(Column[]::new);
// save the result
result.select(columns)
.coalesce(1).write()
.option("delimiter", "|")
.option("header", "true")
.option("nullValue", "")
.option("quoteAll", "false")
.csv(Location);
}
}
Side note: coalesce(1) is a bad call. It collect all data on one executor. You can get executor OOM in production for huge dataset.

EDIT & Warning: Did not see java tag. This is Scala solution that uses foldLeft as a loop to go over all columns. If this is replaced by a Java friendly loop, everything should work as is. I will try and look back at this at the later time.
A programmatic solution could be
val columns = result.columns
val randomColumnName = "RND"
val result2 = columns.foldLeft(result) { (data, column) =>
data
.withColumnRenamed(column, randomColumnName)
.withColumn(column,
when(col(randomColumnName).isNull, "")
.otherwise(concat(lit("\""), col(randomColumnName), lit("\"")))
)
.drop(randomColumnName)
}
This will produce the strings with " around them and write empty strings in nulls. If you need to keep nulls, just keep them.
Then just write it down:
result2.coalesce(1).write()
.option("delimiter", "|")
.option("header", "true")
.option("quoteAll", "false")
.csv(Location);

This is certainly not a efficient answer and I am modifying this based on one given by Artem Aliev, but thought it would be useful to few people, so posting this answer
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import static org.apache.spark.sql.functions.*;<br/>
import org.apache.spark.sql.expressions.UserDefinedFunction;
import org.apache.spark.sql.types.DataTypes;<br/>
public class Quotes {<br/>
private static final String DELIMITER = "|";
private static final String Location = "Give location here";
public static void main(String[] args) {
SparkSession sparkSession = SparkSession.builder()
.master("local")
.appName("Spark Session")
.enableHiveSupport()
.getOrCreate();
Dataset<Row> result = sparkSession.read()
.option("header", "true")
.option("delimiter",DELIMITER)
.csv("Sample file to read"); //Give the details of file to read here
UserDefinedFunction udfQuotesNonNull = udf(
(String abc) -> (abc!=null? "\""+abc+"\"":abc),DataTypes.StringType
);
result = result.withColumn("ind_val", monotonically_increasing_id()); //inducing a new column to be used for join as there is no identity column in source dataset
Dataset<Row> dataset1 = result.select((udfQuotesNonNull.apply(col("ind_val").cast("string")).alias("ind_val"))); //Dataset used for storing temporary results
Dataset<Row> dataset = result.select((udfQuotesNonNull.apply(col("ind_val").cast("string")).alias("ind_val"))); //Dataset used for storing output
String[] str = result.schema().fieldNames();
dataset1.show();
for(int j=0; j<str.length-1;j++)
{
dataset1 = result.select((udfQuotesNonNull.apply(col("ind_val").cast("string")).alias("ind_val")),(udfQuotesNonNull.apply(col(str[j]).cast("string")).alias("\""+str[j]+"\"")));
dataset=dataset.join(dataset1,"ind_val"); //Joining based on induced column
}
result = dataset.drop("ind_val");
result.coalesce(1).write()
.option("delimiter", DELIMITER)
.option("header", "true")
.option("quoteAll", "false")
.option("nullValue", null)
.option("quote", "\u0000")
.option("spark.sql.sources.writeJobUUID", false)
.csv(Location);
}
}

Related

Apache Spark Dataset. foreach with Aerospike client

I want to retrieve rows from Apache Hive via Apache Spark and put each row to Aerospike cache.
Here is a simple case.
var dataset = session.sql("select * from employee");
final var aerospikeClient = aerospike; // to remove binding between lambda and the service class itself
dataset.foreach(row -> {
var key = new Key("namespace", "set", randomUUID().toString());
aerospikeClient.add(
key,
new Bin(
"json-repr",
row.json()
)
);
});
I get an error:
Caused by: java.io.NotSerializableException: com.aerospike.client.reactor.AerospikeReactorClient
Obviously I can't make AerospikeReactorClient serializable. I tried to add dataset.collectAsList() and that did work. But as far as understood, this method loads all the content into one node. There might an enormous amount of data. So, it's not the option.
What are the best practices to deal with such problems?

You can write directly from a data frame. No need to loop through the dataset.
Launch the spark shell and import the com.aerospike.spark.sql._ package:
$ spark-shell
scala> import com.aerospike.spark.sql._
import com.aerospike.spark.sql._
Example of writing data into Aerospike
val TEST_COUNT= 100
val simpleSchema: StructType = new StructType(
Array(
StructField("one", IntegerType, nullable = false),
StructField("two", StringType, nullable = false),
StructField("three", DoubleType, nullable = false)
))
val simpleDF = {
val inputBuf= new ArrayBuffer[Row]()
for ( i <- 1 to num_records){
val one = i
val two = "two:"+i
val three = i.toDouble
val r = Row(one, two, three)
inputBuf.append(r)
}
val inputRDD = spark.sparkContext.parallelize(inputBuf.toSeq)
spark.createDataFrame(inputRDD,simpleSchema)
}
//Write the Sample Data to Aerospike
simpleDF.write
.format("aerospike") //aerospike specific format
.option("aerospike.writeset", "spark-test") //write to this set
.option("aerospike.updateByKey", "one")//indicates which columns should be used for construction of primary key
.option("aerospike.write.mode","update")
.save()

I managed to overcome this issue by creating AerospikeClient manually inside foreach lambda.
var dataset = session.sql("select * from employee");
dataset.foreach(row -> {
var key = new Key("namespace", "set", randomUUID().toString());
newAerospikeClient(aerospikeProperties).add(
key,
new Bin(
"json-repr",
row.json()
)
);
});
Now I only have to declare AerospikeProperties as Serializable.

how to select columns from list dynamically in dataframe plus a fixed column

I'm using spark-sql-2.4.1v with java8.
I have dynamic list of columns is are passed into my function.
i.e.
List<String> cols = Arrays.asList("col_1","col_2","col_3","col_4");
Dataset<Row> df = //which has above columns plus "id" ,"name" plus many other columns;
Need to select cols + "id" + "name"
I am doing as below
Dataset<Row> res_df = df.select("id", "name", cols.stream().toArray( String[]::new));
this is giving compilation error. so how to handle this use-case.
Tried :
When I do something like below :
List<String> cols = new ArrayList<>(Arrays.asList("col_1","col_2","col_3","col_4"));
cols.add("id");
cols.add("name");
Giving error
Exception in thread "main" java.lang.UnsupportedOperationException
at java.util.AbstractList.add(AbstractList.java:148)
at java.util.AbstractList.add(AbstractList.java:108)

You could create array of Columns and pass it to the select statement.
import org.apache.spark.sql.*;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;
List<String> cols = new ArrayList<>(Arrays.asList("col_1","col_2","col_3","col_4"));
cols.add("id");
cols.add("name");
Column[] cols2 = cols.stream()
.map(s->new Column(s)).collect(Collectors.toList())
.toArray(new Column[0]);
settingsDataset.select(cols2).show();

You have a bunch of ways to achieve this, relying on different select method signatures.
One of the possible solutions, with the assumption cols List is immutable and is not controlled by your code:
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import scala.collection.JavaConverters;
public class ATest {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL basic example")
.master("local[2]")
.getOrCreate();
List<String> cols = Arrays.asList("col_1", "col_2");
Dataset<Row> df = spark.sql("select 42 as ID, 'John' as NAME, 1 as col_1, 2 as col_2, 3 as col_3, 4 as col4");
df.show();
ArrayList<String> newCols = new ArrayList<>();
newCols.add("NAME");
newCols.addAll(cols);
df.select("ID", JavaConverters.asScalaIteratorConverter(newCols.iterator()).asScala().toSeq())
.show();
}
}

Converting Timestamp to epoch in Spark (Java)

I have a column with type Timestamp with the format yyyy-MM-dd HH:mm:ss in a dataframe.
The column is sorted by time where the earlier date is at the earlier row
When I ran this command
List<Row> timeRows = df.withColumn(ts, df.col(ts).cast("long")).select(ts).collectAsList();
I face a strange issue where the value of the later date is smaller than the earlier date. Example:
[670] : 1550967304 (2019-02-23 04:30:15)
[671] : 1420064100 (2019-02-24 08:15:04)
Is this the correct way to convert to Epoch or is there another way?

Try using unix_timestamp to convert the string date time to the timestamp. According to the document:
unix_timestamp(Column s, String p) Convert time string with given
pattern (see
[http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html
]) to Unix time stamp (in seconds), return null if fail.
import org.apache.spark.functions._
val format = "yyyy-MM-dd HH:mm:ss"
df.withColumn("epoch_sec", unix_timestamp($"ts", format)).select("epoch_sec").collectAsList()
Also, see https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-functions-datetime.html

You should use the built in function unix_timestamp() in org.apache.spark.sql.functions
https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/functions.html#unix_timestamp()

I think you are looking at using: unix_timestamp()
Which you can import from:
import static org.apache.spark.sql.functions.unix_timestamp;
And use like:
df = df.withColumn(
"epoch",
unix_timestamp(col("date")));
And here is a full example, where I tried to mimic your use-case:
package net.jgp.books.spark.ch12.lab990_others;
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.from_unixtime;
import static org.apache.spark.sql.functions.unix_timestamp;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
/**
* Use of from_unixtime() and unix_timestamp().
*
* #author jgp
*/
public class EpochTimestampConversionApp {
/**
* main() is your entry point to the application.
*
* #param args
*/
public static void main(String[] args) {
EpochTimestampConversionApp app = new EpochTimestampConversionApp();
app.start();
}
/**
* The processing code.
*/
private void start() {
// Creates a session on a local master
SparkSession spark = SparkSession.builder()
.appName("expr()")
.master("local")
.getOrCreate();
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField(
"event",
DataTypes.IntegerType,
false),
DataTypes.createStructField(
"original_ts",
DataTypes.StringType,
false) });
// Building a df with a sequence of chronological timestamps
List<Row> rows = new ArrayList<>();
long now = System.currentTimeMillis() / 1000;
for (int i = 0; i < 1000; i++) {
rows.add(RowFactory.create(i, String.valueOf(now)));
now += new Random().nextInt(3) + 1;
}
Dataset<Row> df = spark.createDataFrame(rows, schema);
df.show();
df.printSchema();
// Turning the timestamps to Timestamp datatype
df = df.withColumn(
"date",
from_unixtime(col("original_ts")).cast(DataTypes.TimestampType));
df.show();
df.printSchema();
// Turning back the timestamps to epoch
df = df.withColumn(
"epoch",
unix_timestamp(col("date")));
df.show();
df.printSchema();
// Collecting the result and printing out
List<Row> timeRows = df.collectAsList();
for (Row r : timeRows) {
System.out.printf("[%d] : %s (%s)\n",
r.getInt(0),
r.getAs("epoch"),
r.getAs("date"));
}
}
}
And the output should be:
...
[994] : 1551997326 (2019-03-07 14:22:06)
[995] : 1551997329 (2019-03-07 14:22:09)
[996] : 1551997330 (2019-03-07 14:22:10)
[997] : 1551997332 (2019-03-07 14:22:12)
[998] : 1551997333 (2019-03-07 14:22:13)
[999] : 1551997335 (2019-03-07 14:22:15)
Hopefully this helps.

Building decision tree pipeline for kyro-encoded Datasets in spark 2.0.2 with Java

I'm trying to build a version of the decision tree classification example from Spark 2.0.2 org.apache.spark.examples.ml.JavaDecisionTreeClassificationExample. I can't use this directly because it uses libsvm-encoded data. I need to avoid libsvm (undocumented AFAIK) to classify ordinary datasets more easily. I'm trying to adapt the example to use a kyro-encoded dataset instead.
The issue originates in the map call below, particularly the consequences of using Encoders.kyro as the encoder as instructed by SparkML feature vectors and Spark 2.0.2 Encoders in Java
public SMLDecisionTree(Dataset<Row> incomingDS, final String label, final String[] features)
{
this.incomingDS = incomingDS;
this.label = label;
this.features = features;
this.mapSet = new StringToDoubleMapperSet(features);
this.sdlDS = incomingDS
.select(label, features)
.filter(new FilterFunction<Row>()
{
public boolean call(Row row) throws Exception
{
return !row.getString(0).equals(features[0]); // header
}
})
.map(new MapFunction<Row, LabeledFeatureVector>()
{
public LabeledFeatureVector call(Row row) throws Exception
{
double labelVal = mapSet.addValue(0, row.getString(0));
double[] featureVals = new double[features.length];
for (int i = 1; i < row.length(); i++)
{
Double val = mapSet.addValue(i, row.getString(i));
featureVals[i - 1] = val;
}
return new LabeledFeatureVector(labelVal, Vectors.dense(featureVals));
}
// https://stackoverflow.com/questions/36648128/how-to-store-custom-objects-in-a-dataset
}, Encoders.kryo(LabeledFeatureVector.class));
Dataset<LabeledFeatureVector>[] splits = sdlDS.randomSplit(new double[] { 0.7, 0.3 });
this.trainingDS = splits[0];
this.testDS = splits[1];
}
This impacts the StringIndexer and VectorIndexer from the original spark example which are unable to handle the resulting kyro-encoded dataset. Here is the pipeline building code taken from the spark decision tree example code:
public void run() throws IOException
{
sdlDS.show();
StringIndexerModel labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(df);
VectorIndexerModel featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4) // treat features with > 4 distinct values as continuous.
.fit(df);
DecisionTreeClassifier classifier = new DecisionTreeClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures");
IndexToString labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels());
Pipeline pipeline = new Pipeline().setStages(new PipelineStage[]
{ labelIndexer, featureIndexer, classifier, labelConverter });
This code apparently expects a dataset with "label" and "features" columns with the label and a Vector of double-encoded features. The problem is that kyro produces a single column named "values" that seems to hold a byte array. I know of no documentation for how to convert this to what the original StringIndexer and VectorIndexer expect. Can someone help? Java please.

Don't use Kryo encoder in the first place. It is very limited in general and not applicable here at all. The simplest solution here is to drop custom class and use Row encoder. First you'll need a bunch of imports:
import org.apache.spark.sql.catalyst.encoders.RowEncoder;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.ml.linalg.*;
and a schema:
List<StructField> fields = new ArrayList<>();
fields.add(DataTypes.createStructField("label", DoubleType, false));
fields.add(DataTypes.createStructField("features", new VectorUDT(), false));
StructType schema = DataTypes.createStructType(fields);
Encoder can be defined like this:
Encoder<Row> encoder = RowEncoder.apply(schema);
and use as shown below:
Dataset<Row> inputDs = spark.read().json(sc.parallelize(Arrays.asList(
"{\"lablel\": 1.0, \"features\": \"foo\"}"
)));
inputDs.map(new MapFunction<Row, Row>() {
public Row call(Row row) {
return RowFactory.create(1.0, Vectors.dense(1.0, 2.0));
}
}, encoder);

Whats the best way to read multiline input format to one record in spark?

Below is the input file(csv) looks like:
Carrier_create_date,Message,REF_SHEET_CREATEDATE,7/1/2008
Carrier_create_time,Message,REF_SHEET_CREATETIME,8:53:57
Carrier_campaign,Analog,REF_SHEET_CAMPAIGN,25
Carrier_run_no,Analog,REF_SHEET_RUNNO,7
Below is the list of columns each rows has:
(Carrier_create_date, Carrier_create_time, Carrier_campaign, Carrier_run_no)
Desired output as dataframe:
7/1/2008,8:53:57,25,7
Basically the input file has column name and value on each rows.
What I have tried so far is:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkContext, SparkConf}
object coater4CR {
// Define the application Name
val AppName: String = "coater4CR"
// Set the logging level.ERROR)
Logger.getLogger("org.apache").setLevel(Level.ERROR)
def main(args: Array[String]): Unit = {
// define the input parmeters
val input_file = "/Users/gangadharkadam/myapps/NlrPraxair/src/main/resources/NLR_Praxair›2008›3QTR2008›Coater_4›C025007.csv"
// Create the Spark configuration and the spark context
println("Initializing the Spark Context...")
val conf = new SparkConf().setAppName(AppName).setMaster("local")
// Define the Spark Context
val sc = new SparkContext(conf)
// Read the csv file
val inputRDD = sc.wholeTextFiles(input_file)
.flatMap(x => x._2.split(" "))
.map(x => {
val rowData = x.split("\n")
var Carrier_create_date: String = ""
var Carrier_create_time: String = ""
var Carrier_campaign: String = ""
var Carrier_run_no: String = ""
for (data <- rowData) {
if (data.trim().startsWith("Carrier_create_date")) {
Carrier_create_date = data.split(",")(3)
} else if (data.trim().startsWith("Carrier_create_time")) {
Carrier_create_time = data.split(",")(3)
} else if (data.trim().startsWith("Carrier_campaign")) {
Carrier_campaign = data.split(",")(3)
} else if (data.trim().startsWith("Carrier_run_no")) {
Carrier_run_no = data.split(",")(3)
}
}
(Carrier_create_date, Carrier_create_time, Carrier_campaign, Carrier_run_no)
}).foreach(println)
}
}
issues with the above code
when I run the above code I am getting an empty list as below
(,,,)
when I change
Carrier_campaign = data.split(",")(3)
to
Carrier_campaign = data.split(",")(2)
I am getting the below output which is somewhat closer
(REF_SHEET_CREATEDATE,REF_SHEET_CREATETIME,REF_SHEET_CAMPAIGN,REF_SHEET_RUNNO)
(,,,)
some how the above code is not able to pick the last column position from the data row but is working for column positions 0,1,2.
So my questions are-
whats wrong with the above code
whats the efficient approach to read this multiline input and load it in tabular format to database
Appreciate any help/pointers on this. Thanks.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Writing CSV file using Spark and java - handling empty values and quotes - java

Related

Apache Spark Dataset. foreach with Aerospike client

how to select columns from list dynamically in dataframe plus a fixed column

Converting Timestamp to epoch in Spark (Java)

Building decision tree pipeline for kyro-encoded Datasets in spark 2.0.2 with Java

Whats the best way to read multiline input format to one record in spark?

Categories

Resources