how to convert text file to parquet with java spark

how to convert text file to parquet with java spark - java

I'm trying to convert a text file into a parquet file. I can only find "how to convert to parquet" from other file format or code written in scala/python.
Here is what I came up with
import org.apache.parquet.schema.MessageType;
import org.apache.parquet.schema.MessageTypeParser;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.*;
private static final StructField[] fields = new StructField[]{
new StructField("timeCreate", DataTypes.StringType, false, Metadata.empty()),
new StructField("cookieCreate", DataTypes.StringType, false,Metadata.empty())
};//simplified
private static final StructType schema = new StructType(fields);
public static void main(String[] args) throws IOException {
SparkSession spark = SparkSession
.builder().master("spark://levanhuong:7077")
.appName("Convert text file to Parquet")
.getOrCreate();
spark.conf().set("spark.executor.memory", "1G");
WriteParquet(spark, args);
}
public static void WriteParquet(SparkSession spark, String[] args){
JavaRDD<String> data = spark.read().textFile(args[0]).toJavaRDD();
JavaRDD<Row> output = data.map((Function<String, Row>) s -> {
DataModel model = new DataModel(s);
return RowFactory.create(model);
});
Dataset<Row> df = spark.createDataFrame(output.rdd(),schema);
df.printSchema();
df.show(2);
df.write().parquet(args[1]);
}
args[0] is a path to input file, args[1] is a path to the output file. here is the simplified DataModel. DateTime fields are properly formated in set() function
public class DataModel implements Serializable {
DateTime timeCreate;
DateTime cookieCreate;
public DataModel(String data){
String model[] = data.split("\t");
setTimeCreate(model[0]);
setCookieCreate(model[1]);
}
And here is the error. Error log point to df.show(2) but i think the error was caused by map(). I'm not sure why since I don't see any casting in the code
>java.lang.ClassCastException: cannot assign instance of
java.lang.invoke.SerializedLambda to field org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.fun$1
of type org.apache.spark.api.java.function.Function in instance
of org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1
I think this is enough to recreate the error, please tell me if I need to provide any more information.

A little bit other approach can be used, working fine:
JavaRDD<String> data = spark().read().textFile(args[0]).toJavaRDD();
JavaRDD<DataModel> output = data.map(s -> {
String[] parts = s.split("\t");
return new DataModel(parts[0], parts[1]);
});
Dataset<Row> result = spark().createDataFrame(output, DataModel.class);
Class "DataModel" is better looks as simple TO, without functionality:
public class DataModel implements Serializable {
private final String timeCreate;
private final String cookieCreate;
public DataModel(String timeCreate, String cookieCreate) {
this.timeCreate = timeCreate;
this.cookieCreate = cookieCreate;
}
public String getTimeCreate() {
return timeCreate;
}
public String getCookieCreate() {
return cookieCreate;
}
}

Related

org.apache.avro.UnresolvedUnionException: Not in union [{"type":"bytes","logicalType":"decimal","precision":18,"scale":4},"null"]: 0.0000

I am trying to read data stored in a hive table in s3, covert it to Avro format and then consume the Avro records to build the final object and push it to a kafka topic. In the object I am trying to publish, I have a nested object that has fields with string and decimal types (CarCostDetails). When this object is null, I am able to push records to kafka, but if this object is populated with any value (0, +/-) then I get this exception org.apache.avro.UnresolvedUnionException: Not in union [{"type":"bytes","logicalType":"decimal","precision":18,"scale":4},"null"]: 40000.0000
when I do producer.send()
I am not defining the schema in my project. I am using a predefined schema as an external dependency in my project
Example:
CarDataLoad.scala
class CarDataLoad extends ApplicationRunner with Serializable {
override def run(args: ApplicationArguments): Unit = {
val spark = new SparkSession.Builder()
.appName("s3-to-kafka")
.enableHiveSupport
.getOrCreate()
getData(spark)
}
def getData(sparkSession: SparkSession){
val avroPath = copyToAvro(sparkSession)
val car = sparkSession.read.avro(avroPath)
import sparkSession.implicits._
val avroData = car.select(
$"car_specs",
$"car_cost_details",
$"car_key"
)
ingestDataframeToKafka(sparkSession, avroData)
}
def copyToAvro(sparkSession: SparkSession): String = {
sourceDf = sparkSession.read.table("sample_table")
val targetPath = s"s3://some/target/path"
//write to a path (internal libraries to do that) in avro format
targetPath
}
def ingestDataframeToKafka(sparkSession: SparkSession, dataframe: sql.DataFrame): Unit ={
val batchProducer: CarProducerClass = new CarProducerClass(kafkaBootstapServers, kafkaSchemaRegistryUrl,
kafkaClientIdConfig, topic)
dataframe.collect.foreach(
row => {
val result = batchProducer.publishRecord(row)
}
)
batchProducer.closeProducer();
}
}
Producer class -
CarProducerClass.java
import org.apache.kafka.clients.producer.*;
import org.apache.spark.sql.Row;
import java.io.Serializable;
import java.math.BigDecimal;
import java.sql.Timestamp;
import java.util.*;
public class CarProducerClass {
private void initializeProducer() {
log.info("Initializing producer");
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaBootstapServers);
props.put("schema.registry.url", kafkaSchemaRegistryUrl);
props.put("acks", "1");
props.put("batch.size", 16384);
props.put("buffer.memory", 33554432);
props.put("retries",3);
props.put(ProducerConfig.CLIENT_ID_CONFIG, kafkaClientIdConfig);
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "io.confluent.kafka.serializers.KafkaAvroSerializer");
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "io.confluent.kafka.serializers.KafkaAvroSerializer");
props.put("key.subject.name.strategy", "io.confluent.kafka.serializers.subject.TopicNameStrategy");
props.put("value.subject.name.strategy", "io.confluent.kafka.serializers.subject.TopicRecordNameStrategy");
log.info("Created producer");
producer = new KafkaProducer(props);
}
}
public Boolean publishRecord(Row row) {
Boolean publishRecordFlag = false;
if (producer == null) {
initializeProducer();
}
Car.Builder car = new Car.newBuilder();
car.setCarSpecs(buildCarSpecs(row.getAs("car_specs")))
car.setCarCostDetails(buildCarCostDetails(row.getAs("car_cost_details")))
CarKey.Builder carKey = new CarKey.Builder();
Row car_key = row.getAs("car_key");
carKey.setKey(car_key.getAs("car_id"))
try{
ProducerRecord<CarKey, Car> producerRecord
= new ProducerRecord(topic, null, System.currentTimeMillis(), carKey.build(), car.build());
//Exception occurs here
RecordMetadata metadata = (RecordMetadata) producer.send(producerRecord).get();
} catch (Exception e){
log.info("Exception caught");
e.printStackTrace();
}
public CarSpecs buildCarSpecs (Row car_specs){
CarSpecs.Builder kafkaCarSpecs = CarSpecs.newBuilder();
kafkaCarSpecs.setCarName("CX5");
kafkaCarSpecs.setCarBrand("Mazda");
}
public CostDetails buildCarCostDetails (Row car_cost_details){
CarSpecs.Builder kafkaCarSpecs = CarSpecs.newBuilder();
kafkaCarSpecs.setPurchaseCity(car_cost_details.getAs("purchase_city"));
kafkaCarSpecs.setPurchaseState(car_cost_details.getAs("purchase_state"));
kafkaCarSpecs.setBasePrice((BigDecimal)car_cost_details.getAs("base_price"));
kafkaCarSpecs.setTax((BigDecimal)car_cost_details.getAs("tax"));
kafkaCarSpecs.setTotalCost((BigDecimal)car_cost_details.getAs("total_cost"));
kafkaCarSpecs.setOtherCosts((BigDecimal)car_cost_details.getAs("other_costs"));
}
public void closeProducer(){
producer.close();
}}
Avro Schema (predefined in another project that is productionalized)
CarSpecs.avdl
protocol CarSpecsProtocol {
record CarSpecs {
string name;
string brand;
}
}
CarCostDetails.avdl
protocol CarCostDetailsProtocol {
record CarCostDetails {
string purchase_city;
string purchase_state;
decimal(18, 4) base_price;
union { decimal(18,4), null} tax;
union { decimal(18,4), null} total_cost;
union { decimal(18,4), null} other_costs;
}
}
Car.avdl
protocol CarProtocol {
import idl "CarCostDetails.avdl";
import idl "CarSpecs.avdl";
record Car {
union { null, CarSpecs} car_specs = null;
union { null, CarCostDetails} car_cost_details = null;
}
}
CarKey.avdl
protocol CarKeyProtocol {
record CarKey {
string id;
}
}
Avro generated Java Objects
#AvroGenerated
public class CarSpecs extends SpecificRecordBase implements SpecificRecord {
//basic generated fields like Schema SCHEMA$, SpecificData MODEL$ etc
private String name;
private String brand;
}
#AvroGenerated
import java.math.BigDecimal;
public class CarCostDetails extends SpecificRecordBase implements SpecificRecord {
//basic generated fields like Schema SCHEMA$, SpecificData MODEL$ etc
private String purchaseCity;
private String purchaseState;
private BigDecimal basePrice;
private BigDecimal tax;
private BigDecimal totalCost;
private BigDecimal otherCosts;
}
#AvroGenerated
public class Car extends SpecificRecordBase implements SpecificRecord {
//basic generated fields like Schema SCHEMA$, SpecificData MODEL$ etc
private CarSpecs carSpecs;
private CarCostDetails carCostDetails;
}
#AvroGenerated
public class CarKey extends SpecificRecordBase implements SpecificRecord {
//basic generated fields like Schema SCHEMA$, SpecificData MODEL$ etc
private String id;
}
What I have already tried:
Passing the spark-avro package in spark command like so --packages org.apache.spark:spark-avro_2.11:2.4.3
Ordering the fields like they are in the actual schema
Setting a default value of 0 for all decimal/BigDecimal fields
Checking if the source's datatype for these fields is java.Math.BigDecimal. It is.
Explicitly casting the value to BigDecimal (like in example above)
All the above still result in org.apache.avro.UnresolvedUnionException

Add decimal conversion to global configuration (do it once during runtime before sending any messages to Kafka, e.g., in initializeProducer):
import org.apache.avro.specific.SpecificData;
import org.apache.avro.Conversions;
SpecificData.get().addLogicalTypeConversion(new Conversions.DecimalConversion());
You might seen similar line in static constructor generated from Avro schema applied to MODEL$, so remember to add all conversions used in your messages.
Following observations are based on avro 1.10.1 library source and runtime behavior.
MODEL$ configuration should be applied (see SpecificData.getForClass), but might not be if SpecificData and your message class are loaded by different class loaders (that was case in my application – two separate OSGI bundles).
In this case getForClass falls back to global instance.
Then GenericData.resolveUnion throws UnresolvedUnionException because conversionsByClass does not contain value with BigDecimal.class key
and getSchemaName overridden in SpecificData returns Schema.Type.STRING for BigDecimal class (and few others, see SpecificData.stringableClasses).
This STRING is then matched to values defined in union schema (getIndexNamed) and not found (because it is not "bytes" or "null").

How do I map a heterogeneous JSON array to a Java object?

I've downloaded a large amount of historic crypto market data via an API. It is formatted like this:
[
[1601510400000,"4.15540000","4.16450000","4.15010000","4.15030000","4483.01000000",1601510459999,"18646.50051400",50,"2943.27000000","12241.83706500","0"],
...
[1609490340000,"4.94020000","4.95970000","4.93880000","4.94950000","5307.62000000",1609490399999,"26280.03711000",98,"3751.46000000","18574.22402400","0"]
]
I take that to be an array of arrays, the inner one containing heterogeneous types (always the same types in the same order). As an intermediate step I've saved it to text files but I'd like to read it back and map it onto an array of objects of this type:
public class MinuteCandle {
private long openTime;
private double openValue;
private double highValue;
private double lowValue;
private double closeValue;
private double volume;
private long closeTime;
private double quoteAssetVolume;
private int numberOfTrades;
private double takerBuyBaseAssetVolume;
private double takerBuyQuoteAssetVolume;
private double someGarbageData;
//...
}
I'm using the Spring Framework and the jackson library for json mapping. Is it doable with that or should I manually parse the text somehow?

Use JsonFormat and annotate your class with it where you specify shape as an ARRAY:
#JsonFormat(shape = JsonFormat.Shape.ARRAY)
class MinuteCandle
Also, consider to use BigDecimal instead of double if you want to store a price.
See also:
A realistic example where using BigDecimal for currency is strictly
better than using double
How to deserialise anonymous array of mixed types with Jackson

I would do this in two steps:
Read the JSON content into a list of List<Object> with Jackson.
Convert each List<Object> into a MinuteCandle object
and collect these objects into a list of MinuteCandles.
import java.io.File;
import java.util.ArrayList;
import java.util.List;
import com.fasterxml.jackson.core.type.TypeReference;
import com.fasterxml.jackson.databind.ObjectMapper;
public class Main {
public static void main(String[] args) throws Exception {
ObjectMapper objectMapper = new ObjectMapper();
File file = new File("example.json");
List<List<Object>> lists = objectMapper.readValue(file, new TypeReference<List<List<Object>>>() {});
List<MinuteCandle> minuteCandles = new ArrayList<>();
for (List<Object> list : lists) {
minuteCandles.add(MinuteCandle.createFromList(list));
}
}
}
The conversion from List<Object> to MinuteCandle (step 2 from above)
could be achieved by adding a static method in your MinuteCandle class.
public static MinuteCandle createFromList(List<Object> list) {
MinuteCandle m = new MinuteCandle();
m.openTime = (Long) list.get(0);
m.openValue = Double.parseDouble((String) list.get(1));
m.highValue = Double.parseDouble((String) list.get(2));
m.lowValue = Double.parseDouble((String) list.get(3));
m.closeValue = Double.parseDouble((String) list.get(4));
m.volume = Double.parseDouble((String) list.get(5));
m.closeTime = (Long) list.get(6);
m.quoteAssetVolume = Double.parseDouble((String) list.get(7));
m.numberOfTrades = (Integer) list.get(8);
m.takerBuyBaseAssetVolume = Double.parseDouble((String) list.get(9));
m.takerBuyQuoteAssetVolume = Double.parseDouble((String) list.get(10));
m.someGarbageData = Double.parseDouble((String) list.get(11));
return m;
}

Assuming the text stored in the file is valid JSON, similar to the solution in How to Read JSON data from txt file in Java? one can use com.google.gson.Gson as follows :
import com.google.gson.Gson;
import java.io.FileReader;
import java.io.Reader;
public class Main {
public static void main(String[] args) throws Exception {
try (Reader reader = new FileReader("somefile.txt")) {
Gson gson = new Gson();
MinuteCandle[] features = gson.fromJson(reader, MinuteCandle[].class);
}
}
}

OpenCSV, how to write into csv with custom processing?

I am trying to write List of POJO objects into a csv. I use opencsv and the code is very minimal:
StatefulBeanToCsv sbc = new StatefulBeanToCsvBuilder(writer)
.withSeparator(CSVWriter.DEFAULT_SEPARATOR)
.build();
I use Custom converters while reading, can I do something similar for write also?
For e.g.:
if field is of type List, it gets written as "[a,b]". But I
would like to do something like this: "a,b".
A field is of type LocalDataTime, I would like to write it in the format "MM/dd/yyyy"
and discard time completely in the output csv.
I want output to be something like this:
date of issue,items
"02/22/2020","a,b"
Instead of:
date of issue,items
"2020-02-22T00:00","[a,b]"
Thank you so much, appreciate any help
:)

You can use the annotations #CsvDate for set custom date format and #CsvBindAndSplitByName for the conversion of the list to string.
Please find below example:
import static java.time.temporal.ChronoUnit.MONTHS;
import com.opencsv.CSVWriter;
import com.opencsv.bean.CsvBindAndSplitByName;
import com.opencsv.bean.CsvBindByName;
import com.opencsv.bean.CsvDate;
import com.opencsv.bean.StatefulBeanToCsv;
import com.opencsv.bean.StatefulBeanToCsvBuilder;
import java.io.FileWriter;
import java.io.Writer;
import java.time.LocalDateTime;
import java.util.List;
public class Main {
public static void main(String[] args) throws Exception {
Writer writer = new FileWriter("example.csv");
StatefulBeanToCsv<Item> sbc = new StatefulBeanToCsvBuilder<Item>(writer)
.withSeparator(CSVWriter.DEFAULT_SEPARATOR)
.build();
List<Item> items = List.of(
new Item(LocalDateTime.now().minus(4, MONTHS), List.of("1", "s")),
new Item(LocalDateTime.now().minus(1, MONTHS), List.of("1", "d")),
new Item(LocalDateTime.now().minus(3, MONTHS), List.of("1", "2", "3"))
);
sbc.write(items);
writer.close();
}
public static class Item {
#CsvBindByName(column = "date")
#CsvDate(value = "yyyy-MM-dd'T'hh:mm")
private LocalDateTime date;
#CsvBindAndSplitByName(column = "list", elementType = String.class, writeDelimiter = ",")
private List<String> array;
Item(LocalDateTime date, List<String> array) {
this.date = date;
this.array = array;
}
public LocalDateTime getDate() {
return date;
}
public void setDate(LocalDateTime date) {
this.date = date;
}
public List<String> getArray() {
return array;
}
public void setArray(List<String> array) {
this.array = array;
}
}
}
The output of example.csv:
"DATE","LIST"
"2020-03-10T02:37","1,s"
"2020-06-10T02:37","1,d"
"2020-04-10T02:37","1,2,3"

Jackson: Serializing object with list of tags [duplicate]

Let's say I have a bean:
public class Msg {
private int code;
private Object data;
... Getter/setters...
}
And I convert it into JSON or XML with this kind of test code:
public String convert() {
Msg msg = new Msg();
msg.setCode( 42 );
msg.setData( "Are you suggesting coconuts migrate?" );
ObjectMapper mapper = new ObjectMapper();
return mapper.writeValueAsString( msg );
}
The output will be somehow like that :
{"code":42,"data":"Are you suggesting coconuts migrate?"}
Now let's say I want to replace the 'data' attribute with some dynamic name:
public String convert(String name) {
Msg msg = new Msg();
msg.setCode( 42 );
msg.setData( "Are you suggesting coconuts migrate?" );
ObjectMapper mapper = new ObjectMapper();
// ...DO SOMETHING WITH MAPPER ...
return mapper.writeValueAsString( msg );
}
If I call the function convert( "toto") I woukld like to have this output:
{"code":42,"toto":"Are you suggesting coconuts migrate?"}
If I call the function convert( "groovy") I woukld like to have this output:
{"code":42,"groovy":"Are you suggesting coconuts migrate?"}
Of course I could do a String replace after JSON creation, but if you have an answer with a programmatic approach I'll take it.
Thanks

You can use PropertyNamingStrategy class to override class property. See simple implementation of this class:
class ReplaceNamingStrategy extends PropertyNamingStrategy {
private static final long serialVersionUID = 1L;
private Map<String, String> replaceMap;
public ReplaceNamingStrategy(Map<String, String> replaceMap) {
this.replaceMap = replaceMap;
}
#Override
public String nameForGetterMethod(MapperConfig<?> config, AnnotatedMethod method, String defaultName) {
if (replaceMap.containsKey(defaultName)) {
return replaceMap.get(defaultName);
}
return super.nameForGetterMethod(config, method, defaultName);
}
}
Example program could look like this:
import java.io.IOException;
import java.util.Collections;
import java.util.Map;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.databind.PropertyNamingStrategy;
import com.fasterxml.jackson.databind.cfg.MapperConfig;
import com.fasterxml.jackson.databind.introspect.AnnotatedMethod;
public class JacksonProgram {
public static void main(String[] args) throws IOException {
Msg msg = new Msg();
msg.setCode(42);
msg.setData("Are you suggesting coconuts migrate?");
System.out.println(convert(msg, "test"));
System.out.println(convert(msg, "toto"));
System.out.println(convert(msg, "groovy"));
}
public static String convert(Msg msg, String name) throws IOException {
ObjectMapper mapper = new ObjectMapper();
mapper.setPropertyNamingStrategy(new ReplaceNamingStrategy(Collections.singletonMap("data", name)));
return mapper.writeValueAsString(msg);
}
}
Above program prints:
{"code":42,"test":"Are you suggesting coconuts migrate?"}
{"code":42,"toto":"Are you suggesting coconuts migrate?"}
{"code":42,"groovy":"Are you suggesting coconuts migrate?"}

One possibility would be to use so-called "any getter":
public class Msg {
public int code;
#JsonAnyGetter
public Map<String,Object> otherFields() {
Map<String,Object> extra = new HashMap<String,Object>();
extra.put("data", findDataObject()); // or whatever mechanism you want
extra.put("name", "Some Name");
return extra;
}
}
so that you can return arbitrary set of dynamic properties.
There is also matching "any getter" (#JsonAnyGetter) mechanism you can use to accept additional properties.

How to read a untidy csv file in java and create a corresponding ArrayList of an object?

I want to read this csvFile into an array of Flight class objects in which each index will refer to an object containing a record from the csvFile.
Here is a blueprint of Flight class. Its not complete so I am only providing the data memebers.
public class Flight {
private String flightID;
private String source;
private String destination;
private <some clas to handle time > dep;
private <some clas to handle time> arr;
private String[] daysOfWeek;
private <some clas to handle date> efff;
private <some clas to handle date> efft;
private <some clas to handle dates> exc;
}
I want to implement a function something like :
public class DataManager {
public List<Flight> readSpiceJet() {
return new ArrayList<Flight>();
}
}
Feel free to modify this and please help me. :)
Thanks in advance.

You can try OpenCSV Framework.
Have a look at this example:
import java.io.FileReader;
import java.util.List;
import com.opencsv.CSVReader;
import com.opencsv.bean.ColumnPositionMappingStrategy;
import com.opencsv.bean.CsvToBean;
public class ParseCSVtoJavaBean
{
public static void main(String args[])
{
CSVReader csvReader = null;
try
{
/**
* Reading the CSV File
* Delimiter is comma
* Default Quote character is double quote
* Start reading from line 1
*/
csvReader = new CSVReader(new FileReader("Employee.csv"),',','"',1);
//mapping of columns with their positions
ColumnPositionMappingStrategy mappingStrategy =
new ColumnPositionMappingStrategy();
//Set mappingStrategy type to Employee Type
mappingStrategy.setType(Employee.class);
//Fields in Employee Bean
String[] columns = new String[]{"empId","firstName","lastName","salary"};
//Setting the colums for mappingStrategy
mappingStrategy.setColumnMapping(columns);
//create instance for CsvToBean class
CsvToBean ctb = new CsvToBean();
//parsing csvReader(Employee.csv) with mappingStrategy
List empList = ctb.parse(mappingStrategy,csvReader);
//Print the Employee Details
for(Employee emp : empList)
{
System.out.println(emp.getEmpId()+" "+emp.getFirstName()+" "
+emp.getLastName()+" "+emp.getSalary());
}
}
catch(Exception ee)
{
ee.printStackTrace();
}
finally
{
try
{
//closing the reader
csvReader.close();
}
catch(Exception ee)
{
ee.printStackTrace();
}
}
}
}
EDIT 1:
To parse dates:
String dateString;
Date date;
public void setDateString(String dateString) {
// This method can parse the dateString and set date object as well
}
public void setDate(Date date) {
//parse here
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

how to convert text file to parquet with java spark - java

Related

org.apache.avro.UnresolvedUnionException: Not in union [{"type":"bytes","logicalType":"decimal","precision":18,"scale":4},"null"]: 0.0000

How do I map a heterogeneous JSON array to a Java object?

OpenCSV, how to write into csv with custom processing?

Jackson: Serializing object with list of tags [duplicate]

How to read a untidy csv file in java and create a corresponding ArrayList of an object?

Categories

Resources