Spark - Java - Create Parquet/Avro Without Using Dataframes of Spark SQL

Spark - Java - Create Parquet/Avro Without Using Dataframes of Spark SQL - java

I want to get output of a Spark application(which we only use core Spark and people working on the project do not want to change it to Spark SQL) as Parquet or Avro files.
When I look for these two file types, I couldn't find any example without DataFrames, or in general Spark SQL. Can I achieve this without using SparkSQL?
My data is tabular, it has columns but in the processing, all data will be used, not a single column. It's columns are decided at runtime, so there is no "name,ID,adress" kinda generic columns. It looks like this:
No f1 f2 f3 ...
1, 123.456, 123.457, 123.458, ...
2, 123.789, 123.790, 123.791, ...
...

You can’t save an rdd in parquet without converting it to dataframe. Rdd does not have schema but parquet file is in columnar format which needs schema, so we need to convert it to dataframe.
You can use createdataframe api

I tried this and it works like a champ...
public class ParquetHelper{
static ParquetWriter<GenericData.Record> writer = null;
private static Schema schema;
public ParquetHelper(Schema schema, String pathName){
try {
Path path = new Path(pathName);
writer = AvroParquetWriter.
<GenericData.Record>builder(path)
.withRowGroupSize(ParquetWriter.DEFAULT_BLOCK_SIZE)
.withPageSize(ParquetWriter.DEFAULT_PAGE_SIZE)
.withSchema(schema)
.withConf(new Configuration())
.withCompressionCodec(CompressionCodecName.SNAPPY)
.withValidation(true)
.withDictionaryEncoding(false)
.build();
this.schema = schema;
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
/*
*
*/
public static void writeToParquet(JavaRDD<Record> empRDDRecords) throws IOException {
empRDDRecords.foreach(record -> {
if(null != record && new RecordValidator().validate(record, schema).isEmpty()){
writeToParquet(record);
}// TODO collect bad records here
});
writer.close();
}
}

Related

How to convert Dataset<Row> to List<GenericRecord>

Would like to know how to convert Dataset<Row> to List<GenericRecord>.
I'm speaking on:
org.apache.avro.generic.GenericRecord
org.apache.spark.sql.Dataset
org.apache.spark.sql.Row
Dataset<Row> data = spark.sql(SQL_QUERY)
The result is different per SQL_QUERY, therefore the schema can be different per use case.
Important to know that I'm reading from an Iceberg table, saving files as .avro under the hood.
My current thinking is to find a way to convert each Row of the Dataset<Row> to bytes[] and then to:
public static List<GenericRecord> deserialize(byte[] bytes) {
List<GenericRecord> records = new ArrayList<>();
try {
DataFileReader<GenericRecord> reader = new DataFileReader<>(
new SeekableByteArrayInput(bytes),
new ExpectedSpecificDatumReader()
);
while (reader.hasNext()) {
records.add(reader.next(null));
}
reader.close();
} catch (Exception e) {
throw new Error(e);
}
return records;
}
Would appreciate your help here :)

iceberg has an utility class that may help you: org.apache.iceberg.spark.SparkValueConverter

Java Stream a Large SQL Query into API CSV File

I am writing a Service that obtains data from large sql query in database (over 100,000 records) and streams into an API CSV File. Is there any java library function that does it, or any way to make the code below more efficient? Currently using Java 8 in Spring boot environment.
Code is below with sql repository method, and service for csv. Preferably trying to write to csv file, while data is being fetched from sql concurrently as query make take 2-3 min for user .
We are using Snowflake DB.
public class ProductService {
private final ProductRepository productRepository;
private final ExecutorService executorService;
public ProductService(ProductRepository productRepository) {
this.productRepository = productRepository;
this.executorService = Executors.newFixedThreadPool(20);
}
public InputStream getproductExportFile(productExportFilters filters) throws IOException {
PipedInputStream is = new PipedInputStream();
PipedOutputStream os = new PipedOutputStream(is);
executorService.execute(() -> {
try {
Stream<productExport> productStream = productRepository.getproductExportStream(filters);
Field[] fields = Stream.of(productExport.class.getDeclaredFields())
.peek(f -> f.setAccessible(true))
.toArray(Field[]::new);
String[] headers = Stream.of(fields)
.map(Field::getName).toArray(String[]::new);
CSVFormat csvFormat = CSVFormat.DEFAULT.builder()
.setHeader(headers)
.build();
OutputStreamWriter outputStreamWriter = new OutputStreamWriter(os);
CSVPrinter csvPrinter = new CSVPrinter(outputStreamWriter, csvFormat);
productStream.forEach(productExport -> writeproductExportToCsv(productExport, csvPrinter, fields));
outputStreamWriter.close();
csvPrinter.close();
} catch (Exception e) {
logger.warn("Unable to complete writing to csv stream.", e);
} finally {
try {
os.close();
} catch (IOException ignored) { }
}
});
return is;
}
private void writeProductExportToCsv(productExport productExport, CSVPrinter csvPrinter, Field[] fields) {
Object[] values = Stream.of(fields).
map(f -> {
try {
return f.get(productExport);
} catch (IllegalAccessException e) {
return null;
}
})
.toArray();
try {
csvPrinter.printRecord(values);
csvPrinter.flush();
} catch (IOException e) {
logger.warn("Unable to write record to file.", e);
}
}
public Stream<PatientExport> getProductExportStream(ProductExportFilters filters) {
MapSqlParameterSource parameterSource = new MapSqlParameterSource();
parameterSource.addValue("customerId", filters.getCustomerId().toString());
parameterSource.addValue("practiceId", filters.getPracticeId().toString());
StringBuilder sqlQuery = new StringBuilder("SELECT * FROM dbo.Product ");
sqlQuery.append("\nWHERE CUSTOMERID = :customerId\n" +
"AND PRACTICEID = :practiceId\n"
);

Streaming allows you to transfer the data, little by little, without having to load it all into the server’s memory. You can do your operations by using the extractData() method in ResultSetExtractor. You can find javadoc about ResultSetExtractor here.
You can view an example using ResultSetExtractor here.
You can also easily create your JPA queries as ResultSet using JdbcTemplate. You can take a look at an example here. to use ResultSetExtractor.

There is product which we bought some time ago for our company, we got even the source code back then. https://northconcepts.com/ We were also evaluating Apache Camel which had similar support but it didnt suite our goal. If you really need speed you should go to lowest level possible - pure JDBC and as simple as possible csv writer.
Nortconcepts library itself provides capability to read from jdbc and write to CSV on lower level. We found few tweaks which have sped up the transmission and processing. With single thread we are able to stream 100 000 records (with 400 columns) within 1-2 minutes.

Given that you didn't specify which database you use I can give you only generic answers.
In general code like this is network limited, as JDBC resultset is usually transferred in "only n rows" packages, and when you exhaust one, only then database triggers fetching of next packet. This property is often called fetch-size, and you should greatly increase it. By default settings, most of databases transfer 10-100 rows in one fetch. In spring you can use setFetchSize property. Some benchmarks here.
There are other similar low level stuff which you could do. For example, Oracle jdbc driver has "InsensitiveResultSetBufferSize" - how big in bytes is a buffer which holds result set. But dose things tend to be database specific.
Thus being said, the best way to really increase speed of your transfer is to actually launch multiple queries. Divide your data on some column value, and than launch multiple parallel queries. Essentially, if you can design data to support parallel queries working on easily distinguished subsets, bottleneck can be transferred to a network or CPU throughput.
For example one of your columns might be 'timestamp'. Instead having one query to fetch all rows, fetch multiple subset of rows with query like this:
SELECT * FROM dbo.Product
WHERE CUSTOMERID = :customerId
AND PRACTICEID = :practiceId
AND :lowerLimit <= timestamp AND timestamp < :upperLimit
Launch this query in parallel with different timestamp ranges. Aggregate result of those subqueries in shared ConcurrentLinkedQueue and build CSV there.
With similar approach I regularly read 100000 rows/sec on 80 column table from Oracle DB. That is 40-60 MB/sec sustained transfer rate from a table which is not even locked.

skip header while reading a CSV file in Apache Beam

I want to skip header line from a CSV file. As of now I'm removing the header manually before loading it to google storage.
Below is my code :
PCollection<String> financeobj =p.apply(TextIO.read().from("gs://storage_path/Financials.csv"));
PCollection<ClassFinance> pojos5 = financeobj.apply(ParDo.of(new DoFn<String, ClassFinance>() { // converting String into classtype
private static final long serialVersionUID = 1L;
#ProcessElement
public void processElement(ProcessContext c) {
String[] strArr = c.element().split(",");
ClassFinance fin = new ClassFinance();
fin.setBeneficiaryFinance(strArr[0]);
fin.setCatlibCode(strArr[1]);
fin.set_rNR_(Double.valueOf(strArr[2]));
fin.set_rNCS_(Double.valueOf(strArr[3]));
fin.set_rCtb_(Double.valueOf(strArr[4]));
fin.set_rAC_(Double.valueOf(strArr[5]));
c.output(fin);
}
}));
I have checked the existing question in stackoverflow but I dont find it promising : Skipping header rows - is it possible with Cloud DataFlow?
Any help ?
Edit : I have tried something like below and it worked :
PCollection<String> financeobj = p.apply(TextIO.read().from("gs://google-bucket/final_input/Financials123.csv"));
PCollection<ClassFinance> pojos5 = financeobj.apply(ParDo.of(new DoFn<String, ClassFinance>() { // converting String into classtype
private static final long serialVersionUID = 1L;
#ProcessElement
public void processElement(ProcessContext c) {
String[] strArr2 = c.element().split(",");
String header = Arrays.toString(strArr2);
ClassFinance fin = new ClassFinance();
if(header.contains("Beneficiary"))
System.out.println("Header");
else {
fin.setBeneficiaryFinance(strArr2[0].trim());
fin.setCatlibCode(strArr2[1].trim());
fin.setrNR(Double.valueOf(strArr2[2].trim().replace("", "0")));
fin.setrNCS(Double.valueOf(strArr2[3].trim().replace("", "0")));
fin.setrCtb(Double.valueOf(strArr2[4].trim().replace("", "0")));
fin.setrAC(Double.valueOf(strArr2[5].trim().replace("", "0")));
c.output(fin);
}
}
}));

The older Stack Overflow post that you shared (Skipping header rows - is it possible with Cloud DataFlow?) does contain the answer to your question.
This option is currently not available in the Apache Beam SDK, although there is an open Feature Request in the Apache Beam JIRA issue tracker, BEAM-123. Note that, as of writing, this feature request is still open and unresolved, and it has been like that for 2 years already. However, it looks like some effort is being done in that sense, and the latest update in the issue is from February 2018, so I would advise you to stay updated on that JIRA issue, as it was last moved to the sdk-java-core component, and it may be getting more attention there.
With that information in mind, I would say that the approach you are using (removing the header before uploading the file to GCS) is the best option for you. I would refrain from doing it manually, as you can easily script that and automate the remove header ⟶ upload file process.
EDIT:
I have been able to come up with a simple filter using a DoFn. It might not be the most elegant solution (I am not an Apache Beam expert myself), but it does work, and you may be able to adapt it to your needs. It requires that you know beforehand the header of the CSV files being uploaded (as it will be filtering by element content), but again, take this just as a template that you may be able to modify to your needs:
public class RemoveCSVHeader {
// The Filter class
static class FilterCSVHeaderFn extends DoFn<String, String> {
String headerFilter;
public FilterCSVHeaderFn(String headerFilter) {
this.headerFilter = headerFilter;
}
#ProcessElement
public void processElement(ProcessContext c) {
String row = c.element();
// Filter out elements that match the header
if (!row.equals(this.headerFilter)) {
c.output(row);
}
}
}
// The main class
public static void main(String[] args) throws IOException {
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
PCollection<String> vals = p.apply(TextIO.read().from("gs://BUCKET/FILE.csv"));
String header = "col1,col2,col3,col4";
vals.apply(ParDo.of(new FilterCSVHeaderFn(header)))
.apply(TextIO.write().to("out"));
p.run().waitUntilFinish();
}
}

This code works for me. I have used Filter.by() to filter out the header row from csv file.
static void run(GcsToDbOptions options) {
Pipeline p = Pipeline.create(options);
// Read the CSV file from GCS input file path
p.apply("Read Rows from " + options.getInputFile(), TextIO.read()
.from(options.getInputFile()))
// filter the header row
.apply("Remove header row",
Filter.by((String row) -> !((row.startsWith("dwid") || row.startsWith("\"dwid\"")
|| row.startsWith("'dwid'")))))
// write the rows to database using prepared statement
.apply("Write to Auths Table in Postgres", JdbcIO.<String>write()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(dataSource(options)))
.withStatement(INSERT_INTO_MYTABLE)
.withPreparedStatementSetter(new StatementSetter()));
PipelineResult result = p.run();
try {
result.getState();
result.waitUntilFinish();
} catch (UnsupportedOperationException e) {
// do nothing
} catch (Exception e) {
e.printStackTrace();
}}

https://medium.com/#baranitharan/the-textio-write-1be1c07fbef0
The TextIO.Write in Dataflow now has withHeader function to add a header row to the data. This function was added in verison 1.7.0.
So you can add a header to your csv like this:
TextIO.Write.named("WriteToText")
.to("/path/to/the/file")
.withHeader("col_name1,col_name2,col_name3,col_name4")
.withSuffix(".csv"));
The withHeader function automatically adds a newline character at the end of the header row.

Apache Flink transform DataStream (source) to a List?

My question is how to transform from a DataStream to a List, for example in order to be able to iterate through it.
The code looks like :
package flinkoracle;
//imports
//....
public class FlinkOracle {
final static Logger LOG = LoggerFactory.getLogger(FlinkOracle.class);
public static void main(String[] args) {
LOG.info("Starting...");
// get the execution environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
TypeInformation[] fieldTypes = new TypeInformation[]{BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO};
RowTypeInfo rowTypeInfo = new RowTypeInfo(fieldTypes);
//get the source from Oracle DB
DataStream<?> source = env
.createInput(JDBCInputFormat.buildJDBCInputFormat()
.setDrivername("oracle.jdbc.driver.OracleDriver")
.setDBUrl("jdbc:oracle:thin:#localhost:1521")
.setUsername("user")
.setPassword("password")
.setQuery("select * from table1")
.setRowTypeInfo(rowTypeInfo)
.finish());
source.print().setParallelism(1);
try {
LOG.info("----------BEGIN----------");
env.execute();
LOG.info("----------END----------");
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
LOG.info("End...");
}
}
Thanks a lot in advance.
Br
Tamas

Flink provides an iterator sink to collect DataStream results for testing and debugging purposes. It can be used as follows:
import org.apache.flink.contrib.streaming.DataStreamUtils;
DataStream<Tuple2<String, Integer>> myResult = ...
Iterator<Tuple2<String, Integer>> myOutput = DataStreamUtils.collect(myResult)
You can copy an iterator to a new list like this:
while (iter.hasNext())
list.add(iter.next());
Flink also provides a bunch of simple write*() methods on DataStream that are mainly intended for debugging purposes. The data flushing to the target system depends on the implementation of the OutputFormat. This means that not all elements sent to the OutputFormat are immediately shown up in the target system. Note: These write*() methods do not participate in Flink’s checkpointing, and in failure cases, those records might be lost.
writeAsText() / TextOutputFormat
writeAsCsv(...) / CsvOutputFormat
print() / printToErr()
writeUsingOutputFormat() / FileOutputFormat
writeToSocket
Source: link.
You may need to add the following dependency to use DataStreamUtils:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-contrib</artifactId>
<version>0.10.2</version>
</dependency>

In newer versions, DataStreamUtils::collect has been deprecated. Instead you can use DataStream::executeAndCollect which, if given a limit, will return a List of at most that size.
var list = source.executeAndCollect(100);
If you do not know how many elements there are, or if you simply want to iterate through the results without loading them all into memory at once, then you can use the no-arg version to get a CloseableIterator
try (var iterator = source.executeAndCollect()) {
// do something
}

Read an Access database which uses linked tables with Excel sheets

I am trying to read data from an Access database using the Java library Jackcess. The database has several tables and queries, some of which are linked tables pointing to Excel sheets on the file-system.
I saw that I can use a LinkResolver to intercept the resolving of the linked data, but it expects a full-blown Database, not just data for one single table.
I can easily use Apache POI to open the Excel file and extract the necessary data, but I don't know how I can pass the data in the LinkResolver.
What is the simplest way to provide the location of the Excel file or read the data from the Excel file and pass it back to Jackcess so it can load the linked data successfully?

At this point in time, the LinkResolver API is only built for loading "remote" Table instances from other databases. It was not built to be a general purpose API for any type of external file. You could certainly file a feature request with the Jackcess project.
UPDATE:
As of the 2.1.7 release, jackcess provides the CustomLinkResolver utility to facilitate loading linked tables from files which are not access databases (using a temporary db).

I came up with the following initial implementation of a LinkResolver which builds a temporary database with the content from the Excel file. It still lacks some things like Close-handling and temp-file-removal of the temporary database, but it seems to work for basic purposes.
/**
* Sample LinkResolver which reads the data from an Excel file
* The data is read from the first sheet and needs to contain a
* header-row with column-names and then data-rows with string/numeric values.
*/
public class ExcelFileLinkResolver implements LinkResolver {
private final LinkResolver parentResolver;
private final String fileNameInDB;
private final String tableName;
private final File excelFile;
public ExcelFileLinkResolver(LinkResolver parentResolver, String fileNameInDB, File excelFile, String tableName) {
this.parentResolver = parentResolver;
this.fileNameInDB = fileNameInDB;
this.excelFile = excelFile;
this.tableName = tableName;
}
#Override
public Database resolveLinkedDatabase(Database linkerDb, String linkeeFileName) throws IOException {
if(linkeeFileName.equals(fileNameInDB)) {
// TODO: handle close or create database in-memory if possible
File tempFile = File.createTempFile("LinkedDB", ".mdb");
Database linkedDB = DatabaseBuilder.create(Database.FileFormat.V2003, tempFile);
try (Workbook wb = WorkbookFactory.create(excelFile, null, true)) {
TableBuilder tableBuilder = new TableBuilder(tableName);
Table table = null;
List<Object[]> rows = new ArrayList<>();
for(org.apache.poi.ss.usermodel.Row row : wb.getSheetAt(0)) {
if(table == null) {
for(Cell cell : row) {
tableBuilder.addColumn(new ColumnBuilder(cell.getStringCellValue()
// column-names cannot contain some characters
.replace(".", ""),
DataType.TEXT));
}
table = tableBuilder.toTable(linkedDB);
} else {
List<String> values = new ArrayList<>();
for(Cell cell : row) {
if(cell.getCellTypeEnum() == CellType.NUMERIC) {
values.add(Double.toString(cell.getNumericCellValue()));
} else {
values.add(cell.getStringCellValue());
}
}
rows.add(values.toArray());
}
}
Preconditions.checkNotNull(table, "Did not have a row in " + excelFile);
table.addRows(rows);
} catch (InvalidFormatException e) {
throw new IllegalStateException(e);
}
return linkedDB;
}
return parentResolver.resolveLinkedDatabase(linkerDb, linkeeFileName);
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.