skip header while reading a CSV file in Apache Beam - java

I want to skip header line from a CSV file. As of now I'm removing the header manually before loading it to google storage.
Below is my code :
PCollection<String> financeobj =p.apply(TextIO.read().from("gs://storage_path/Financials.csv"));
PCollection<ClassFinance> pojos5 = financeobj.apply(ParDo.of(new DoFn<String, ClassFinance>() { // converting String into classtype
private static final long serialVersionUID = 1L;
#ProcessElement
public void processElement(ProcessContext c) {
String[] strArr = c.element().split(",");
ClassFinance fin = new ClassFinance();
fin.setBeneficiaryFinance(strArr[0]);
fin.setCatlibCode(strArr[1]);
fin.set_rNR_(Double.valueOf(strArr[2]));
fin.set_rNCS_(Double.valueOf(strArr[3]));
fin.set_rCtb_(Double.valueOf(strArr[4]));
fin.set_rAC_(Double.valueOf(strArr[5]));
c.output(fin);
}
}));
I have checked the existing question in stackoverflow but I dont find it promising : Skipping header rows - is it possible with Cloud DataFlow?
Any help ?
Edit : I have tried something like below and it worked :
PCollection<String> financeobj = p.apply(TextIO.read().from("gs://google-bucket/final_input/Financials123.csv"));
PCollection<ClassFinance> pojos5 = financeobj.apply(ParDo.of(new DoFn<String, ClassFinance>() { // converting String into classtype
private static final long serialVersionUID = 1L;
#ProcessElement
public void processElement(ProcessContext c) {
String[] strArr2 = c.element().split(",");
String header = Arrays.toString(strArr2);
ClassFinance fin = new ClassFinance();
if(header.contains("Beneficiary"))
System.out.println("Header");
else {
fin.setBeneficiaryFinance(strArr2[0].trim());
fin.setCatlibCode(strArr2[1].trim());
fin.setrNR(Double.valueOf(strArr2[2].trim().replace("", "0")));
fin.setrNCS(Double.valueOf(strArr2[3].trim().replace("", "0")));
fin.setrCtb(Double.valueOf(strArr2[4].trim().replace("", "0")));
fin.setrAC(Double.valueOf(strArr2[5].trim().replace("", "0")));
c.output(fin);
}
}
}));

The older Stack Overflow post that you shared (Skipping header rows - is it possible with Cloud DataFlow?) does contain the answer to your question.
This option is currently not available in the Apache Beam SDK, although there is an open Feature Request in the Apache Beam JIRA issue tracker, BEAM-123. Note that, as of writing, this feature request is still open and unresolved, and it has been like that for 2 years already. However, it looks like some effort is being done in that sense, and the latest update in the issue is from February 2018, so I would advise you to stay updated on that JIRA issue, as it was last moved to the sdk-java-core component, and it may be getting more attention there.
With that information in mind, I would say that the approach you are using (removing the header before uploading the file to GCS) is the best option for you. I would refrain from doing it manually, as you can easily script that and automate the remove header ⟶ upload file process.
EDIT:
I have been able to come up with a simple filter using a DoFn. It might not be the most elegant solution (I am not an Apache Beam expert myself), but it does work, and you may be able to adapt it to your needs. It requires that you know beforehand the header of the CSV files being uploaded (as it will be filtering by element content), but again, take this just as a template that you may be able to modify to your needs:
public class RemoveCSVHeader {
// The Filter class
static class FilterCSVHeaderFn extends DoFn<String, String> {
String headerFilter;
public FilterCSVHeaderFn(String headerFilter) {
this.headerFilter = headerFilter;
}
#ProcessElement
public void processElement(ProcessContext c) {
String row = c.element();
// Filter out elements that match the header
if (!row.equals(this.headerFilter)) {
c.output(row);
}
}
}
// The main class
public static void main(String[] args) throws IOException {
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
PCollection<String> vals = p.apply(TextIO.read().from("gs://BUCKET/FILE.csv"));
String header = "col1,col2,col3,col4";
vals.apply(ParDo.of(new FilterCSVHeaderFn(header)))
.apply(TextIO.write().to("out"));
p.run().waitUntilFinish();
}
}

This code works for me. I have used Filter.by() to filter out the header row from csv file.
static void run(GcsToDbOptions options) {
Pipeline p = Pipeline.create(options);
// Read the CSV file from GCS input file path
p.apply("Read Rows from " + options.getInputFile(), TextIO.read()
.from(options.getInputFile()))
// filter the header row
.apply("Remove header row",
Filter.by((String row) -> !((row.startsWith("dwid") || row.startsWith("\"dwid\"")
|| row.startsWith("'dwid'")))))
// write the rows to database using prepared statement
.apply("Write to Auths Table in Postgres", JdbcIO.<String>write()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(dataSource(options)))
.withStatement(INSERT_INTO_MYTABLE)
.withPreparedStatementSetter(new StatementSetter()));
PipelineResult result = p.run();
try {
result.getState();
result.waitUntilFinish();
} catch (UnsupportedOperationException e) {
// do nothing
} catch (Exception e) {
e.printStackTrace();
}}

https://medium.com/#baranitharan/the-textio-write-1be1c07fbef0
The TextIO.Write in Dataflow now has withHeader function to add a header row to the data. This function was added in verison 1.7.0.
So you can add a header to your csv like this:
TextIO.Write.named("WriteToText")
.to("/path/to/the/file")
.withHeader("col_name1,col_name2,col_name3,col_name4")
.withSuffix(".csv"));
The withHeader function automatically adds a newline character at the end of the header row.

Related

Transform a large jsonl file with unknown json properties into csv using apache beam google dataflow and java

How to convert a large jsonl file with unknown json properties into csv using Apache Beam, google dataflow and java
Here is my scenario:
A large jsonl file is in google storage
Json properties are unknown, so using Apache Beam's Schema can not be defined in Beam's pipeline.
Use Apache beam, google dataflow and java to convert jsonl to csv
Once transformation is done, store csv in google storage (same bucket where jsonl is stored)
Notify by some means, like transformation_done=true if possible (rest api or event)
Any help or guidance would be helpful, as I am new to Apache beam, though I am reading the doc from Apache Beam.
I have edited the question with an example JSONL data
{"Name":"Gilbert", "Session":"2013", "Score":"24", "Completed":"true"}
{"Name":"Alexa", "Session":"2013", "Score":"29", "Completed":"true"}
{"Name":"May", "Session":"2012B", "Score":"14", "Completed":"false"}
{"Name":"Deloise", "Session":"2012A", "Score":"19", "Completed":"true"}
While json key's are there in an input file but it's not known while transforming.
I'll explain that by an example, suppose I have three clients and each got it's own google storage, so each upload their own jsonl file with different json properties.
Client 1: Input Jsonl File
{"city":"Mumbai", "pincode":"2012A"}
{"city":"Delhi", "pincode":"2012N"}
Client 2: Input Jsonl File
{"Relation":"Finance", "Code":"2012A"}
{"Relation":"Production", "Code":"20XXX"}
Client 3: Input Jsonl File
{"Name":"Gilbert", "Session":"2013", "Score":"24", "Completed":"true"}
{"Name":"Alexa", "Session":"2013", "Score":"29", "Completed":"true"}
Question: How could I write A Generic beam pipeline which transforms all three as shown below
Client 1: Output CSV file
["city", "pincode"]
["Mumbai","2012A"]
["Delhi", "2012N"]
Client 2: Output CSV file
["Relation", "Code"]
["Finance", "2012A"]
["Production","20XXX"]
Client 3: Output CSV file
["Name", "Session", "Score", "true"]
["Gilbert", "2013", "24", "true"]
["Alexa", "2013", "29", "true"]
Edit: Removed the previous ans as questions have been modified with examples.
There is no generic way provided by anyone to achieve such result. You have to write the logic yourself depending on your requirements and how you are handling the pipeline.
Below there are some examples but you need to verify these for your case as I have only tried these on a small JSONL file.
TextIO
Approach 1
If you can collect the header value of the output csv
then it will be much easier. But getting the header beforehand itself another challenge.
//pipeline
pipeline.apply("ReadJSONLines",
TextIO.read().from("FILE URL"))
.apply(ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processLines(#Element String line, OutputReceiver<String> receiver) {
String values = getCsvLine(line, false);
receiver.output(values);
}
}))
.apply("WriteCSV",
TextIO.write().to("FileName")
.withSuffix(".csv")
.withoutSharding()
.withDelimiter(new char[] { '\r', '\n' })
.withHeader(getHeader()));
private static String getHeader() {
String header = "";
//your logic to get the header line.
return header;
}
probable ways to get the header line(Only assumptions may not work in your case) :
You can have a text file in GCS which will store the header of a particular JSON File. And in your logic you can fetch the header by reading the file , check this SO thread about how to read files from GCS
You can try to pass the header as a runtime argument but that depends how you are configuring and executing your pipeline.
Approach 2
This is a workaround I found for small JsonFiles(~10k lines). This below example may not work for large files.
final int[] count = { 0 };
pipeline.apply(//read file)
.apply(ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processLines(#Element String line, OutputReceiver<String> receiver) {
// check if its the first processing element. If yes then create the header
if (count[0] == 0) {
String header = getCsvLine(line, true);
receiver.output(header);
count[0]++;
}
String values = getCsvLine(line, false);
receiver.output(values);
}
}))
.apply(//write file)
FileIO
As mentioned by Saransh in comments by using FileIO all you have to do is read the JSONL line by line manually and then convert those into comma separated format.EG:
pipeline.apply(FileIO.match().filepattern("FILE PATH"))
.apply(FileIO.readMatches())
.apply(FlatMapElements
.into(TypeDescriptors.strings())
.via((FileIO.ReadableFile f) -> {
List<String> output = new ArrayList<>();
try (BufferedReader br = new BufferedReader(Channels.newReader(f.open(), "UTF-8"))) {
String line = br.readLine();
while (line != null) {
if (output.size() == 0) {
String header = getCsvLine(line, true);
output.add(header);
}
String result = getCsvLine(line, false);
output.add(result);
line = br.readLine();
}
} catch (IOException e) {
throw new RuntimeException("Error while reading", e);
}
return output;
}))
.apply(//write to gcs)
In the above examples I have used a getCsvLine method(created for code usability) which takes a single line from the file and converts it into a comma separated format.To parse the JSON object I have used GSON.
/**
* #param line take each JSONL line
* #param isHeader true : Returns output combining the JSON keys || false:
* Returns output combining the JSON values
**/
public static String getCsvLine(String line, boolean isHeader) {
List<String> values = new ArrayList<>();
// convert the line into jsonobject
JsonObject jsonObject = JsonParser.parseString(line).getAsJsonObject();
// iterate json object and collect all values
for (Map.Entry<String, JsonElement> entry : jsonObject.entrySet()) {
if (isHeader)
values.add(entry.getKey());
else
values.add(entry.getValue().getAsString());
}
String result = String.join(",", values);
return result;
}

Stackoverflowerror while using distinct in apache spark

I use Spark 2.0.1.
I am trying to find distinct values in a JavaRDD as below
JavaRDD<String> distinct_installedApp_Ids = filteredInstalledApp_Ids.distinct();
I see that this line is throwing the below exception
Exception in thread "main" java.lang.StackOverflowError
at org.apache.spark.rdd.RDD.checkpointRDD(RDD.scala:226)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:84)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
..........
The same stacktrace is repeated again and again.
The input filteredInstalledApp_Ids has large input with millions of records.Will thh issue be the number of records or is there a efficient way to find distinct values in JavaRDD. Any help would be much appreciated. Thanks in advance. Cheers.
Edit 1:
Adding the filter method
JavaRDD<String> filteredInstalledApp_Ids = installedApp_Ids
.filter(new Function<String, Boolean>() {
#Override
public Boolean call(String v1) throws Exception {
return v1 != null;
}
}).cache();
Edit 2:
Added the method used to generate installedApp_Ids
public JavaRDD<String> getIdsWithInstalledApps(String inputPath, JavaSparkContext sc,
JavaRDD<String> installedApp_Ids) {
JavaRDD<String> appIdsRDD = sc.textFile(inputPath);
try {
JavaRDD<String> appIdsRDD1 = appIdsRDD.map(new Function<String, String>() {
#Override
public String call(String t) throws Exception {
String delimiter = "\t";
String[] id_Type = t.split(delimiter);
StringBuilder temp = new StringBuilder(id_Type[1]);
if ((temp.indexOf("\"")) != -1) {
String escaped = temp.toString().replace("\\", "");
escaped = escaped.replace("\"{", "{");
escaped = escaped.replace("}\"", "}");
temp = new StringBuilder(escaped);
}
// To remove empty character in the beginning of a
// string
JSONObject wholeventObj = new JSONObject(temp.toString());
JSONObject eventJsonObj = wholeventObj.getJSONObject("eventData");
int appType = eventJsonObj.getInt("appType");
if (appType == 1) {
try {
return (String.valueOf(appType));
} catch (JSONException e) {
return null;
}
}
return null;
}
}).cache();
if (installedApp_Ids != null)
return sc.union(installedApp_Ids, appIdsRDD1);
else
return appIdsRDD1;
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
I assume the main dataset is in inputPath. It appears that it's a comma-separated file with JSON-encoded values.
I think you could make your code a bit simpler by combination of Spark SQL's DataFrames and from_json function. I'm using Scala and leave converting the code to Java as a home exercise :)
The lines where you load a inputPath text file and the line parsing itself can be as simple as the following:
import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...
val dataset = spark.read.csv(inputPath)
You can display the content using show operator.
dataset.show(truncate = false)
You should see the JSON-encoded lines.
It appears that the JSON lines contain eventData and appType fields.
val jsons = dataset.withColumn("asJson", from_json(...))
See functions object for reference.
With JSON lines, you can select the fields of your interest:
val apptypes = jsons.select("eventData.appType")
And then union it with installedApp_Ids.
I'm sure the code gets easier to read (and hopefully to write too). The migration will give you extra optimizations that you may or may not be able to write yourself using assembler-like RDD API.
And the best is that filtering out nulls is as simple as using na operator that gives DataFrameNaFunctions like drop. I'm sure you'll like them.
It does not necessarily answer your initial question, but this java.lang.StackOverflowError might get away just by doing the code migration and the code gets easier to maintain, too.

Read an Access database which uses linked tables with Excel sheets

I am trying to read data from an Access database using the Java library Jackcess. The database has several tables and queries, some of which are linked tables pointing to Excel sheets on the file-system.
I saw that I can use a LinkResolver to intercept the resolving of the linked data, but it expects a full-blown Database, not just data for one single table.
I can easily use Apache POI to open the Excel file and extract the necessary data, but I don't know how I can pass the data in the LinkResolver.
What is the simplest way to provide the location of the Excel file or read the data from the Excel file and pass it back to Jackcess so it can load the linked data successfully?
At this point in time, the LinkResolver API is only built for loading "remote" Table instances from other databases. It was not built to be a general purpose API for any type of external file. You could certainly file a feature request with the Jackcess project.
UPDATE:
As of the 2.1.7 release, jackcess provides the CustomLinkResolver utility to facilitate loading linked tables from files which are not access databases (using a temporary db).
I came up with the following initial implementation of a LinkResolver which builds a temporary database with the content from the Excel file. It still lacks some things like Close-handling and temp-file-removal of the temporary database, but it seems to work for basic purposes.
/**
* Sample LinkResolver which reads the data from an Excel file
* The data is read from the first sheet and needs to contain a
* header-row with column-names and then data-rows with string/numeric values.
*/
public class ExcelFileLinkResolver implements LinkResolver {
private final LinkResolver parentResolver;
private final String fileNameInDB;
private final String tableName;
private final File excelFile;
public ExcelFileLinkResolver(LinkResolver parentResolver, String fileNameInDB, File excelFile, String tableName) {
this.parentResolver = parentResolver;
this.fileNameInDB = fileNameInDB;
this.excelFile = excelFile;
this.tableName = tableName;
}
#Override
public Database resolveLinkedDatabase(Database linkerDb, String linkeeFileName) throws IOException {
if(linkeeFileName.equals(fileNameInDB)) {
// TODO: handle close or create database in-memory if possible
File tempFile = File.createTempFile("LinkedDB", ".mdb");
Database linkedDB = DatabaseBuilder.create(Database.FileFormat.V2003, tempFile);
try (Workbook wb = WorkbookFactory.create(excelFile, null, true)) {
TableBuilder tableBuilder = new TableBuilder(tableName);
Table table = null;
List<Object[]> rows = new ArrayList<>();
for(org.apache.poi.ss.usermodel.Row row : wb.getSheetAt(0)) {
if(table == null) {
for(Cell cell : row) {
tableBuilder.addColumn(new ColumnBuilder(cell.getStringCellValue()
// column-names cannot contain some characters
.replace(".", ""),
DataType.TEXT));
}
table = tableBuilder.toTable(linkedDB);
} else {
List<String> values = new ArrayList<>();
for(Cell cell : row) {
if(cell.getCellTypeEnum() == CellType.NUMERIC) {
values.add(Double.toString(cell.getNumericCellValue()));
} else {
values.add(cell.getStringCellValue());
}
}
rows.add(values.toArray());
}
}
Preconditions.checkNotNull(table, "Did not have a row in " + excelFile);
table.addRows(rows);
} catch (InvalidFormatException e) {
throw new IllegalStateException(e);
}
return linkedDB;
}
return parentResolver.resolveLinkedDatabase(linkerDb, linkeeFileName);
}
}

How to filter a stream of tuple with Apache Edgent

I created a source function according to this manual.
public static void main(String[] args) throws Exception {
DirectProvider dp = new DirectProvider();
Topology top = dp.newTopology();
final URL url = new URL("http://finance.yahoo.com/d/quotes.csv?s=BAC+COG+FCX&f=snabl");
TStream<String> linesOfWebsite = top.source(queryWebsite(url));
}
Now I'd like to filter this stream. I had something like this in mind:
TStream<Iterable<String>> simpleFiltered = source.filter(item-> item.contains("BAX");
Which is not working. Does anybody has an idea how to filter the stream? I don't want to change the request url to do the filtering upfront.
It's difficult to tell from the info provided. dp.submit(top) is needed to run the topology. The filter code isn't specifying an item that occurs using the URL that's being specified. e.g.,
...
TStream<String> linesOfWebsite = top.source(queryWebsite(url));
linesOfWebsite.print(); // show what's received
TStream<String> filtered = linesOfWebsite.filter(t -> t.contains("BAC"));
filtered.sink(t -> System.out.println("filtered: " + t));
dp.submit(top); // required

How to set HTTP header in Apache JClouds?

I'm using Apache JClouds to connect to my Openstack Swift installation. I managed to upload and download objects from Swift. However, I failed to see how to upload dynamic large object to Swift.
To upload dynamic large object, I need to upload all segments first, which I can do as usual. Then I need to upload a manifest object to combine them logically. The problem is to tell Swift this is a manifest object, I need to set a special header, which I don't know how to do that using JClouds api.
Here's a dynamic large object example from openstack official website.
The code I'm using:
public static void main(String[] args) throws IOException {
BlobStore blobStore = ContextBuilder.newBuilder("swift").endpoint("http://localhost:8080/auth/v1.0")
.credentials("test:test", "test").buildView(BlobStoreContext.class).getBlobStore();
blobStore.createContainerInLocation(null, "container");
ByteSource segment1 = ByteSource.wrap("foo".getBytes(Charsets.UTF_8));
Blob seg1Blob = blobStore.blobBuilder("/foo/bar/1").payload(segment1).contentLength(segment1.size()).build();
System.out.println(blobStore.putBlob("container", seg1Blob));
ByteSource segment2 = ByteSource.wrap("bar".getBytes(Charsets.UTF_8));
Blob seg2Blob = blobStore.blobBuilder("/foo/bar/2").payload(segment2).contentLength(segment2.size()).build();
System.out.println(blobStore.putBlob("container", seg2Blob));
ByteSource manifest = ByteSource.wrap("".getBytes(Charsets.UTF_8));
// TODO: set manifest header here
Blob manifestBlob = blobStore.blobBuilder("/foo/bar").payload(manifest).contentLength(manifest.size()).build();
System.out.println(blobStore.putBlob("container", manifestBlob));
Blob dloBlob = blobStore.getBlob("container", "/foo/bar");
InputStream input = dloBlob.getPayload().openStream();
while (true) {
int i = input.read();
if (i < 0) {
break;
}
System.out.print((char) i); // should print "foobar"
}
}
The "TODO" part is my problem.
Edited:
I've been pointed out that Jclouds handles large file upload automatically, which is not so useful in our case. In fact, we do not know how large the file will be or when the next segment will arrive at the time we start to upload the first segment. Our api is designed to make client able to upload their files in chunks of their own chosen size and at their own chosen time, and when done, call a 'commit' to make these chunks as a file. So this makes us want to upload the manifest on our own here.
According to #Everett Toews's answer, I've got my code correctly running:
public static void main(String[] args) throws IOException {
CommonSwiftClient swift = ContextBuilder.newBuilder("swift").endpoint("http://localhost:8080/auth/v1.0")
.credentials("test:test", "test").buildApi(CommonSwiftClient.class);
SwiftObject segment1 = swift.newSwiftObject();
segment1.getInfo().setName("foo/bar/1");
segment1.setPayload("foo");
swift.putObject("container", segment1);
SwiftObject segment2 = swift.newSwiftObject();
segment2.getInfo().setName("foo/bar/2");
segment2.setPayload("bar");
swift.putObject("container", segment2);
swift.putObjectManifest("container", "foo/bar2");
SwiftObject dlo = swift.getObject("container", "foo/bar", GetOptions.NONE);
InputStream input = dlo.getPayload().openStream();
while (true) {
int i = input.read();
if (i < 0) {
break;
}
System.out.print((char) i);
}
}
jclouds handles writing the manifest for you. Here are a couple of examples that might help you, UploadLargeObject and largeblob.MainApp.
Try using
Map<String, String> manifestMetadata = ImmutableMap.of(
"X-Object-Manifest", "<container>/<prefix>");
BlobBuilder.userMetadata(manifestMetadata)
If that doesn't work you might have to use the CommonSwiftClient like in CrossOriginResourceSharingContainer.java.

Categories

Resources