Apache Beam TextIO glob get original filename

Apache Beam TextIO glob get original filename - java

I have setup a pipeline. I have to parse hundreds of *.gz files. Therefore glob works quite good.
But I need the original name of the currently processed file, because i want to name the result files as the original files.
Can anyone help me here?
Here is my code.
#Default.String(LOGS_PATH + "*.gz")
String getInputFile();
void setInputFile(String value);
TextIO.Read read = TextIO.read().withCompressionType(TextIO.CompressionType.GZIP).from(options.getInputFile());
read.getName();
p.apply("ReadLines", read).apply(new CountWords())
.apply(MapElements.via(new FormatAsTextFn()))
.apply("WriteCounts", TextIO.write().to(WordCountOptions.LOGS_PATH + "_" + options.getOutput()));
p.run().waitUntilFinish();

This is possible starting with Beam 2.2 using a combination of FileIO.match(), FileIO.read() and custom code to read lines of text. You can already use this at HEAD, or you can wait until release 2.2 is finalized (it's currently in progress).
PCollection<KV<String, String>> filesAndLines =
p.apply(FileIO.match().filepattern(...))
.apply(FileIO.read())
.apply(ParDo.of(new DoFn<ReadableFile, KV<String, String>>() {
#ProcessElement
public void process(ProcessContext c) {
ReadableFile f = c.element();
String filename = f.getMetadata().resourceId().toString();
String line;
try (BufferedReader r = new BufferedReader(Channels.newInputStream(f.open()))) {
while ((line = r.readLine()) != null) {
c.output(KV.of(filename, line));
}
}
}
}));

Related

Transform a large jsonl file with unknown json properties into csv using apache beam google dataflow and java

How to convert a large jsonl file with unknown json properties into csv using Apache Beam, google dataflow and java
Here is my scenario:
A large jsonl file is in google storage
Json properties are unknown, so using Apache Beam's Schema can not be defined in Beam's pipeline.
Use Apache beam, google dataflow and java to convert jsonl to csv
Once transformation is done, store csv in google storage (same bucket where jsonl is stored)
Notify by some means, like transformation_done=true if possible (rest api or event)
Any help or guidance would be helpful, as I am new to Apache beam, though I am reading the doc from Apache Beam.
I have edited the question with an example JSONL data
{"Name":"Gilbert", "Session":"2013", "Score":"24", "Completed":"true"}
{"Name":"Alexa", "Session":"2013", "Score":"29", "Completed":"true"}
{"Name":"May", "Session":"2012B", "Score":"14", "Completed":"false"}
{"Name":"Deloise", "Session":"2012A", "Score":"19", "Completed":"true"}
While json key's are there in an input file but it's not known while transforming.
I'll explain that by an example, suppose I have three clients and each got it's own google storage, so each upload their own jsonl file with different json properties.
Client 1: Input Jsonl File
{"city":"Mumbai", "pincode":"2012A"}
{"city":"Delhi", "pincode":"2012N"}
Client 2: Input Jsonl File
{"Relation":"Finance", "Code":"2012A"}
{"Relation":"Production", "Code":"20XXX"}
Client 3: Input Jsonl File
{"Name":"Gilbert", "Session":"2013", "Score":"24", "Completed":"true"}
{"Name":"Alexa", "Session":"2013", "Score":"29", "Completed":"true"}
Question: How could I write A Generic beam pipeline which transforms all three as shown below
Client 1: Output CSV file
["city", "pincode"]
["Mumbai","2012A"]
["Delhi", "2012N"]
Client 2: Output CSV file
["Relation", "Code"]
["Finance", "2012A"]
["Production","20XXX"]
Client 3: Output CSV file
["Name", "Session", "Score", "true"]
["Gilbert", "2013", "24", "true"]
["Alexa", "2013", "29", "true"]

Edit: Removed the previous ans as questions have been modified with examples.
There is no generic way provided by anyone to achieve such result. You have to write the logic yourself depending on your requirements and how you are handling the pipeline.
Below there are some examples but you need to verify these for your case as I have only tried these on a small JSONL file.
TextIO
Approach 1
If you can collect the header value of the output csv
then it will be much easier. But getting the header beforehand itself another challenge.
//pipeline
pipeline.apply("ReadJSONLines",
TextIO.read().from("FILE URL"))
.apply(ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processLines(#Element String line, OutputReceiver<String> receiver) {
String values = getCsvLine(line, false);
receiver.output(values);
}
}))
.apply("WriteCSV",
TextIO.write().to("FileName")
.withSuffix(".csv")
.withoutSharding()
.withDelimiter(new char[] { '\r', '\n' })
.withHeader(getHeader()));
private static String getHeader() {
String header = "";
//your logic to get the header line.
return header;
}
probable ways to get the header line(Only assumptions may not work in your case) :
You can have a text file in GCS which will store the header of a particular JSON File. And in your logic you can fetch the header by reading the file , check this SO thread about how to read files from GCS
You can try to pass the header as a runtime argument but that depends how you are configuring and executing your pipeline.
Approach 2
This is a workaround I found for small JsonFiles(~10k lines). This below example may not work for large files.
final int[] count = { 0 };
pipeline.apply(//read file)
.apply(ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processLines(#Element String line, OutputReceiver<String> receiver) {
// check if its the first processing element. If yes then create the header
if (count[0] == 0) {
String header = getCsvLine(line, true);
receiver.output(header);
count[0]++;
}
String values = getCsvLine(line, false);
receiver.output(values);
}
}))
.apply(//write file)
FileIO
As mentioned by Saransh in comments by using FileIO all you have to do is read the JSONL line by line manually and then convert those into comma separated format.EG:
pipeline.apply(FileIO.match().filepattern("FILE PATH"))
.apply(FileIO.readMatches())
.apply(FlatMapElements
.into(TypeDescriptors.strings())
.via((FileIO.ReadableFile f) -> {
List<String> output = new ArrayList<>();
try (BufferedReader br = new BufferedReader(Channels.newReader(f.open(), "UTF-8"))) {
String line = br.readLine();
while (line != null) {
if (output.size() == 0) {
String header = getCsvLine(line, true);
output.add(header);
}
String result = getCsvLine(line, false);
output.add(result);
line = br.readLine();
}
} catch (IOException e) {
throw new RuntimeException("Error while reading", e);
}
return output;
}))
.apply(//write to gcs)
In the above examples I have used a getCsvLine method(created for code usability) which takes a single line from the file and converts it into a comma separated format.To parse the JSON object I have used GSON.
/**
* #param line take each JSONL line
* #param isHeader true : Returns output combining the JSON keys || false:
* Returns output combining the JSON values
**/
public static String getCsvLine(String line, boolean isHeader) {
List<String> values = new ArrayList<>();
// convert the line into jsonobject
JsonObject jsonObject = JsonParser.parseString(line).getAsJsonObject();
// iterate json object and collect all values
for (Map.Entry<String, JsonElement> entry : jsonObject.entrySet()) {
if (isHeader)
values.add(entry.getKey());
else
values.add(entry.getValue().getAsString());
}
String result = String.join(",", values);
return result;
}

skip header while reading a CSV file in Apache Beam

I want to skip header line from a CSV file. As of now I'm removing the header manually before loading it to google storage.
Below is my code :
PCollection<String> financeobj =p.apply(TextIO.read().from("gs://storage_path/Financials.csv"));
PCollection<ClassFinance> pojos5 = financeobj.apply(ParDo.of(new DoFn<String, ClassFinance>() { // converting String into classtype
private static final long serialVersionUID = 1L;
#ProcessElement
public void processElement(ProcessContext c) {
String[] strArr = c.element().split(",");
ClassFinance fin = new ClassFinance();
fin.setBeneficiaryFinance(strArr[0]);
fin.setCatlibCode(strArr[1]);
fin.set_rNR_(Double.valueOf(strArr[2]));
fin.set_rNCS_(Double.valueOf(strArr[3]));
fin.set_rCtb_(Double.valueOf(strArr[4]));
fin.set_rAC_(Double.valueOf(strArr[5]));
c.output(fin);
}
}));
I have checked the existing question in stackoverflow but I dont find it promising : Skipping header rows - is it possible with Cloud DataFlow?
Any help ?
Edit : I have tried something like below and it worked :
PCollection<String> financeobj = p.apply(TextIO.read().from("gs://google-bucket/final_input/Financials123.csv"));
PCollection<ClassFinance> pojos5 = financeobj.apply(ParDo.of(new DoFn<String, ClassFinance>() { // converting String into classtype
private static final long serialVersionUID = 1L;
#ProcessElement
public void processElement(ProcessContext c) {
String[] strArr2 = c.element().split(",");
String header = Arrays.toString(strArr2);
ClassFinance fin = new ClassFinance();
if(header.contains("Beneficiary"))
System.out.println("Header");
else {
fin.setBeneficiaryFinance(strArr2[0].trim());
fin.setCatlibCode(strArr2[1].trim());
fin.setrNR(Double.valueOf(strArr2[2].trim().replace("", "0")));
fin.setrNCS(Double.valueOf(strArr2[3].trim().replace("", "0")));
fin.setrCtb(Double.valueOf(strArr2[4].trim().replace("", "0")));
fin.setrAC(Double.valueOf(strArr2[5].trim().replace("", "0")));
c.output(fin);
}
}
}));

The older Stack Overflow post that you shared (Skipping header rows - is it possible with Cloud DataFlow?) does contain the answer to your question.
This option is currently not available in the Apache Beam SDK, although there is an open Feature Request in the Apache Beam JIRA issue tracker, BEAM-123. Note that, as of writing, this feature request is still open and unresolved, and it has been like that for 2 years already. However, it looks like some effort is being done in that sense, and the latest update in the issue is from February 2018, so I would advise you to stay updated on that JIRA issue, as it was last moved to the sdk-java-core component, and it may be getting more attention there.
With that information in mind, I would say that the approach you are using (removing the header before uploading the file to GCS) is the best option for you. I would refrain from doing it manually, as you can easily script that and automate the remove header ⟶ upload file process.
EDIT:
I have been able to come up with a simple filter using a DoFn. It might not be the most elegant solution (I am not an Apache Beam expert myself), but it does work, and you may be able to adapt it to your needs. It requires that you know beforehand the header of the CSV files being uploaded (as it will be filtering by element content), but again, take this just as a template that you may be able to modify to your needs:
public class RemoveCSVHeader {
// The Filter class
static class FilterCSVHeaderFn extends DoFn<String, String> {
String headerFilter;
public FilterCSVHeaderFn(String headerFilter) {
this.headerFilter = headerFilter;
}
#ProcessElement
public void processElement(ProcessContext c) {
String row = c.element();
// Filter out elements that match the header
if (!row.equals(this.headerFilter)) {
c.output(row);
}
}
}
// The main class
public static void main(String[] args) throws IOException {
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
PCollection<String> vals = p.apply(TextIO.read().from("gs://BUCKET/FILE.csv"));
String header = "col1,col2,col3,col4";
vals.apply(ParDo.of(new FilterCSVHeaderFn(header)))
.apply(TextIO.write().to("out"));
p.run().waitUntilFinish();
}
}

This code works for me. I have used Filter.by() to filter out the header row from csv file.
static void run(GcsToDbOptions options) {
Pipeline p = Pipeline.create(options);
// Read the CSV file from GCS input file path
p.apply("Read Rows from " + options.getInputFile(), TextIO.read()
.from(options.getInputFile()))
// filter the header row
.apply("Remove header row",
Filter.by((String row) -> !((row.startsWith("dwid") || row.startsWith("\"dwid\"")
|| row.startsWith("'dwid'")))))
// write the rows to database using prepared statement
.apply("Write to Auths Table in Postgres", JdbcIO.<String>write()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(dataSource(options)))
.withStatement(INSERT_INTO_MYTABLE)
.withPreparedStatementSetter(new StatementSetter()));
PipelineResult result = p.run();
try {
result.getState();
result.waitUntilFinish();
} catch (UnsupportedOperationException e) {
// do nothing
} catch (Exception e) {
e.printStackTrace();
}}

https://medium.com/#baranitharan/the-textio-write-1be1c07fbef0
The TextIO.Write in Dataflow now has withHeader function to add a header row to the data. This function was added in verison 1.7.0.
So you can add a header to your csv like this:
TextIO.Write.named("WriteToText")
.to("/path/to/the/file")
.withHeader("col_name1,col_name2,col_name3,col_name4")
.withSuffix(".csv"));
The withHeader function automatically adds a newline character at the end of the header row.

Count the account of the files at first and send a different file every minute

I have a folder named collect, there will be some files such as selectData01.json, selectData02.json, selectData03.json and so on.
I have to count the account of the files at first, and then I will send a different file every minute.
Now I want to konw how to achieve the purpose
public String getData() {
String strLocation = new SendSituationData().getClass().getProtectionDomain().getCodeSource().getLocation().getPath();
log.info("strLocation = ");
// String strParent = new File(strLocation).getParent() + "/collectData/conf.properties";
// System.out.println("strParent = " + strParent);
File fileConf = new File("collect/");
System.out.println("fileConf = " + fileConf.getAbsolutePath());
List<List<String>> listFiles = new ArrayList<>();
//File root = new File(DashBoardListener.class.getClassLoader().getResource("collectData/").getPath());
//File root = new File("collectData/application.conf");
File root = new File(fileConf.getAbsolutePath());
System.out.println("root.listFiles( ) = " + root.listFiles( ));
Arrays
.stream(Objects.requireNonNull(root.listFiles( )))
.filter(file -> file.getName().endsWith("json"))
.map(File::toPath)
.forEach(path -> {
try {
//List<String> lines = Files.readAllLines(path);
//System.out.println("lines = " + lines);
List<String> lines = Files.readAllLines(path);
listFiles.add(lines);
} catch (IOException e) {
e.printStackTrace( );
}
});
String dataBody = listToString(listFiles.get(0));
//log.info(dataBody);
ResultMap result = buildRsult();
//String jsonString = JSON.toJSONString(result);
}
public static String listToString(List<String> stringList){
if (stringList == null) {
return null;
}
StringBuilder result=new StringBuilder();
boolean flag=false;
for (String string : stringList) {
if (flag) {
result.append("");
}else {
flag=true;
}
result.append(string);
}
return result.toString();
}
supplement
My friend, maybe i don't express my purpose explicitly. If I have three files, I will sent the first file in the 0:00, sent the second file in the 0:01, sent the third file in the 0:03, sent the first file in the 0:04, sent the second file in the 0:05 and so on.
If I have five files, I will sent the first file in the 0:00, sent the second file in the 0:01, sent the third file in the 0:03, sent the fourth file in the 0:04, sent the fifth file in the 0:05 and so on.
I want to know how to achieve the function
supplement
I have a struct Project that contains a folder named collect. Each file represents a string.
At first, I want to calculate the number of files in collect folder, and then I will send a file every minute.
Any suggestions?

I would use Apache camel with file2 component.
http://camel.apache.org/file2.html
Please read about 'noop' option before running any tests.
Processed files are deleted by default as far as I remember.
Update - simple example added:
I would recommend to start with https://start.spring.io/
Add at least two dependencies: Web and Camel (requires Spring Boot >=1.4.0.RELEASE and <2.0.0.M1)
Create new route, you can start from this example:
#Component
public class FileRouteBuilder extends RouteBuilder {
public static final String DESTINATION = "file://out/";
public static final String SOURCE = "file://in/?noop=true";
#Override
public void configure() throws Exception {
from(SOURCE)
.process(exchange -> {
//your processing here
})
.log("File: ${file:name} has been sent to: " + DESTINATION)
.to(DESTINATION);
}
}
My output:
2018-03-22 15:24:08.917 File: test1.txt has been sent to: file://out/
2018-03-22 15:24:08.931 File: test2.txt has been sent to: file://out/
2018-03-22 15:24:08.933 File: test3.txt has been sent to: file://out/

Stackoverflowerror while using distinct in apache spark

I use Spark 2.0.1.
I am trying to find distinct values in a JavaRDD as below
JavaRDD<String> distinct_installedApp_Ids = filteredInstalledApp_Ids.distinct();
I see that this line is throwing the below exception
Exception in thread "main" java.lang.StackOverflowError
at org.apache.spark.rdd.RDD.checkpointRDD(RDD.scala:226)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:84)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
..........
The same stacktrace is repeated again and again.
The input filteredInstalledApp_Ids has large input with millions of records.Will thh issue be the number of records or is there a efficient way to find distinct values in JavaRDD. Any help would be much appreciated. Thanks in advance. Cheers.
Edit 1:
Adding the filter method
JavaRDD<String> filteredInstalledApp_Ids = installedApp_Ids
.filter(new Function<String, Boolean>() {
#Override
public Boolean call(String v1) throws Exception {
return v1 != null;
}
}).cache();
Edit 2:
Added the method used to generate installedApp_Ids
public JavaRDD<String> getIdsWithInstalledApps(String inputPath, JavaSparkContext sc,
JavaRDD<String> installedApp_Ids) {
JavaRDD<String> appIdsRDD = sc.textFile(inputPath);
try {
JavaRDD<String> appIdsRDD1 = appIdsRDD.map(new Function<String, String>() {
#Override
public String call(String t) throws Exception {
String delimiter = "\t";
String[] id_Type = t.split(delimiter);
StringBuilder temp = new StringBuilder(id_Type[1]);
if ((temp.indexOf("\"")) != -1) {
String escaped = temp.toString().replace("\\", "");
escaped = escaped.replace("\"{", "{");
escaped = escaped.replace("}\"", "}");
temp = new StringBuilder(escaped);
}
// To remove empty character in the beginning of a
// string
JSONObject wholeventObj = new JSONObject(temp.toString());
JSONObject eventJsonObj = wholeventObj.getJSONObject("eventData");
int appType = eventJsonObj.getInt("appType");
if (appType == 1) {
try {
return (String.valueOf(appType));
} catch (JSONException e) {
return null;
}
}
return null;
}
}).cache();
if (installedApp_Ids != null)
return sc.union(installedApp_Ids, appIdsRDD1);
else
return appIdsRDD1;
} catch (Exception e) {
e.printStackTrace();
}
return null;
}

I assume the main dataset is in inputPath. It appears that it's a comma-separated file with JSON-encoded values.
I think you could make your code a bit simpler by combination of Spark SQL's DataFrames and from_json function. I'm using Scala and leave converting the code to Java as a home exercise :)
The lines where you load a inputPath text file and the line parsing itself can be as simple as the following:
import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...
val dataset = spark.read.csv(inputPath)
You can display the content using show operator.
dataset.show(truncate = false)
You should see the JSON-encoded lines.
It appears that the JSON lines contain eventData and appType fields.
val jsons = dataset.withColumn("asJson", from_json(...))
See functions object for reference.
With JSON lines, you can select the fields of your interest:
val apptypes = jsons.select("eventData.appType")
And then union it with installedApp_Ids.
I'm sure the code gets easier to read (and hopefully to write too). The migration will give you extra optimizations that you may or may not be able to write yourself using assembler-like RDD API.
And the best is that filtering out nulls is as simple as using na operator that gives DataFrameNaFunctions like drop. I'm sure you'll like them.
It does not necessarily answer your initial question, but this java.lang.StackOverflowError might get away just by doing the code migration and the code gets easier to maintain, too.

confirmation email - create templae and combine it with an object

I need to implement email confirmation in my java web application. I am stuck with the email I have to send to the user.
I need to combine a template (of an confirmation email) with the User object and this will be the html content of the confirmation email.
I thought about using xslt as the template engine but I don't have xml form of the User object and don't really know how to create a xml from User instance.
I thought about jsp, but how do I render jsp page with an object and get the html as a result?
Any idea what packages I can use in order to create templae and combine it with an object?

I have used the following before. I seem to recall it wasn't complicated
http://velocity.apache.org/

How complex is the user object? If it's just five string-valued fields (say) you could simply supply these as string parameters to the transformation, avoiding the need to build XML from your Java data.
Alternatively, Java XSLT processors typically provide some way to invoke methods on Java objects from within the XSLT code. So you could supply the Java object as a parameter to the stylesheet and invoke its methods using extension functions. The details are processor-specific.

Instead of learning a new code, debug other's complicate code I decided to write my own small and suitable util:
public class StringTemplate {
private String filePath;
private String charsetName;
private Collection<AbstractMap.SimpleEntry<String, String>> args;
public StringTemplate(String filePath, String charsetName,
Collection<AbstractMap.SimpleEntry<String, String>> args) {
this.filePath = filePath;
this.charsetName=charsetName;
this.args = args;
}
public String generate() throws FileNotFoundException, IOException {
StringBuilder builder = new StringBuilder();
BufferedReader reader = new BufferedReader(new InputStreamReader(
getClass().getResourceAsStream(filePath),charsetName));
try {
String line = null;
while ((line = reader.readLine()) != null) {
builder.append(line);
builder.append(System.getProperty("line.separator"));
}
} finally {
reader.close();
}
for (AbstractMap.SimpleEntry<String, String> arg : this.args) {
int index = builder.indexOf(arg.getKey());
while (index != -1) {
builder.replace(index, index + arg.getKey().length(), arg.getValue());
index += arg.getValue().length();
index = builder.indexOf(arg.getKey(), index);
}
}
return builder.toString();
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache Beam TextIO glob get original filename - java

Related

Transform a large jsonl file with unknown json properties into csv using apache beam google dataflow and java

skip header while reading a CSV file in Apache Beam

Count the account of the files at first and send a different file every minute

Stackoverflowerror while using distinct in apache spark

confirmation email - create templae and combine it with an object

Categories

Resources