so I have written an apache beam pipeline that reads a file that contains 99 other files calculates the checksum and creates a key-value pair of the file and its checksum what I need to do is write these key-value pairs to a manifest.json file I am running into some serialization problems currently and any advice and help would be amazing.
Here is my code:
public class BeamPipeline {
private static final Logger log = LoggerFactory.getLogger(BeamPipeline.class);
public static interface MyOptions extends PipelineOptions {
#Description("Input Path(with gs:// prefix)")
String getInput();
void setInput(String value);
}
public static void main(String[] args) {
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
Pipeline p = Pipeline.create(options);
JsonObject obj = new JsonObject();
File dir = new File(options.getInput());
for (File file : dir.listFiles()) {
String inputString = file.toString();
p
.apply("Match Files", FileIO.match().filepattern(inputString))
.apply("Read Files", FileIO.readMatches())
.apply(MapElements.via(new SimpleFunction<FileIO.ReadableFile, KV<String, String>>() {
public KV<String, String> apply(FileIO.ReadableFile file) {
String temp = null;
try {
temp = file.readFullyAsUTF8String();
} catch (IOException e) {
}
String sha256hex = org.apache.commons.codec.digest.DigestUtils.sha256Hex(temp);
obj.addProperty(temp, sha256hex);
String json = obj.toString();
try (FileWriter fileWriter = new FileWriter("./manifest.json")) {
fileWriter.write(json);
} catch (IOException e) {
}
return KV.of(file.getMetadata().resourceId().toString(), sha256hex);
}
}))
.apply("Print", ParDo.of(new DoFn<KV<String, String>, Void>() {
#ProcessElement
public void processElement(ProcessContext c) {
log.info(String.format("File: %s, SHA-256 %s", c.element().getKey(), c.element().getValue()));
}
}));
}
p.run();
}
}
Here are my errors currently:
"main" java.lang.IllegalArgumentException: unable to serialize DoFnAndMainOutput{doFn=org.apache.beam.sdk.transforms.MapElements$1#50756c76, mainOutputTag=Tag<output>}
Caused by: java.io.NotSerializableException: com.google.gson.JsonObject
DoFns are serialized with all the objects accessed from the Dofn.
The JsonObject is not serializable. They are created out of DoFn and referred in DoFn which makes DoFn non serializable.
You can create JsonObject with in the DoFn to avoid this serialization issues.
public class BeamPipeline {
private static final Logger log = LoggerFactory.getLogger(BeamPipeline.class);
public static interface MyOptions extends PipelineOptions {
#Description("Input Path(with gs:// prefix)")
String getInput();
void setInput(String value);
}
public static void main(String[] args) {
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
Pipeline p = Pipeline.create(options);
File dir = new File(options.getInput());
for (File file : dir.listFiles()) {
String inputString = file.toString();
p
.apply("Match Files", FileIO.match().filepattern(inputString))
.apply("Read Files", FileIO.readMatches())
.apply(MapElements.via(new SimpleFunction<FileIO.ReadableFile, KV<String, String>>() {
public KV<String, String> apply(FileIO.ReadableFile file) {
String temp = null;
try {
temp = file.readFullyAsUTF8String();
} catch (IOException e) {
}
String sha256hex = org.apache.commons.codec.digest.DigestUtils.sha256Hex(temp);
JsonObject obj = new JsonObject();
obj.addProperty(temp, sha256hex);
String json = obj.toString();
try (FileWriter fileWriter = new FileWriter("./manifest.json")) {
fileWriter.write(json);
} catch (IOException e) {
}
return KV.of(file.getMetadata().resourceId().toString(), sha256hex);
}
}))
.apply("Print", ParDo.of(new DoFn<KV<String, String>, Void>() {
#ProcessElement
public void processElement(ProcessContext c) {
log.info(String.format("File: %s, SHA-256 %s", c.element().getKey(), c.element().getValue()));
}
}));
}
p.run();
}
}
Related
I have a DoFn that is supposed to split input into two separate PCollections. The pipeline builds and runs up until it is time to output in the DoFn, and then I get the following exception:
"java.lang.IllegalArgumentException: Unknown output tag Tag<edu.mayo.mcc.cdh.pipeline.PubsubToAvro$PubsubMessageToArchiveDoFn$2.<init>:219#2587af97b4865538>
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument(Preconditions.java:216)...
If I declare the TupleTags I'm using in the ParDo, I get that error, but if I declare them outside of the ParDo I get a syntax error saying the OutputReceiver can't find the tags. Below is the apply and the ParDo/DoFn:
PCollectionTuple results = (messages.apply("Map to Archive", ParDo.of(new PubsubMessageToArchiveDoFn()).withOutputTags(noTag, TupleTagList.of(medaPcollection))));
PCollection<AvroPubsubMessageRecord> medaPcollectionTransformed = results.get(medaPcollection);
PCollection<AvroPubsubMessageRecord> noTagPcollectionTransformed = results.get(noTag);
static class PubsubMessageToArchiveDoFn extends DoFn<PubsubMessage, AvroPubsubMessageRecord> {
final TupleTag<AvroPubsubMessageRecord> medaPcollection = new TupleTag<AvroPubsubMessageRecord>(){};
final TupleTag<AvroPubsubMessageRecord> noTag = new TupleTag<AvroPubsubMessageRecord>(){};
#ProcessElement
public void processElement(ProcessContext context, MultiOutputReceiver out) {
String appCode;
PubsubMessage message = context.element();
String msgStr = new String(message.getPayload(), StandardCharsets.UTF_8);
try {
JSONObject jsonObject = new JSONObject(msgStr);
LOGGER.info("json: {}", jsonObject);
appCode = jsonObject.getString("app_code");
LOGGER.info(appCode);
if(appCode == "MEDA"){
LOGGER.info("Made it to MEDA tag");
out.get(medaPcollection).output(new AvroPubsubMessageRecord(
message.getPayload(), message.getAttributeMap(), context.timestamp().getMillis()));
} else {
LOGGER.info("Made it to default tag");
out.get(noTag).output(new AvroPubsubMessageRecord(
message.getPayload(), message.getAttributeMap(), context.timestamp().getMillis()));
}
} catch (Exception e) {
LOGGER.info("Error Processing Message: {}\n{}", msgStr, e);
}
}
}
Can you try without MultiOutputReceiver out parameter in the processElement method ?
Outputs are then returned with context.output with passing element and corresponding TupleTag.
Your example only with context :
static class PubsubMessageToArchiveDoFn extends DoFn<PubsubMessage, AvroPubsubMessageRecord> {
final TupleTag<AvroPubsubMessageRecord> medaPcollection = new TupleTag<AvroPubsubMessageRecord>(){};
final TupleTag<AvroPubsubMessageRecord> noTag = new TupleTag<AvroPubsubMessageRecord>(){};
#ProcessElement
public void processElement(ProcessContext context) {
String appCode;
PubsubMessage message = context.element();
String msgStr = new String(message.getPayload(), StandardCharsets.UTF_8);
try {
JSONObject jsonObject = new JSONObject(msgStr);
LOGGER.info("json: {}", jsonObject);
appCode = jsonObject.getString("app_code");
LOGGER.info(appCode);
if(appCode == "MEDA"){
LOGGER.info("Made it to MEDA tag");
context.output(medaPcollection, new AvroPubsubMessageRecord(
message.getPayload(), message.getAttributeMap(), context.timestamp().getMillis()));
} else {
LOGGER.info("Made it to default tag");
context.output(noTag, new AvroPubsubMessageRecord(
message.getPayload(), message.getAttributeMap(), context.timestamp().getMillis()));
}
} catch (Exception e) {
LOGGER.info("Error Processing Message: {}\n{}", msgStr, e);
}
}
I also show you an example that works for me :
public class WordCountFn extends DoFn<String, Integer> {
private final TupleTag<Integer> outputTag = new TupleTag<Integer>() {};
private final TupleTag<Failure> failuresTag = new TupleTag<Failure>() {};
#ProcessElement
public void processElement(ProcessContext ctx) {
try {
// Could throw ArithmeticException.
final String word = ctx.element();
ctx.output(1 / word.length());
} catch (Throwable throwable) {
final Failure failure = Failure.from("step", ctx.element(), throwable);
ctx.output(failuresTag, failure);
}
}
public TupleTag<Integer> getOutputTag() {
return outputTag;
}
public TupleTag<Failure> getFailuresTag() {
return failuresTag;
}
}
In my first output (good case), no need to pass the TupleTag ctx.output(1 / word.length());
For my second output (failure case), I pass the Failure tag with the corresponding element.
I was able to get around this by making my ParDo an anonymous function instead of a class. I put the whole function inline and had no problem finding the output tags after I did that. Thanks for the suggestions!
I'm trying to check if config1 exists in a text file, I'm using Google's Gson library.
My JSON file :
{
"maps":{
"config2":{
"component1":"url1",
"component2":"url1",
"component3":"url1"
},
"config1":{
"component1":"url1",
"component2":"url1",
"component3":"url1"
}
}
}
Loading :
public void load() throws IOException {
File file = getContext().getFileStreamPath("jsonfile.txt");
FileInputStream fis = getContext().openFileInput("jsonfile.txt");
InputStreamReader isr = new InputStreamReader(fis);
BufferedReader bufferedReader = new BufferedReader(isr);
StringBuilder sb = new StringBuilder();
String line;
while ((line = bufferedReader.readLine()) != null) {
sb.append(line);
}
String json = sb.toString();
Gson gson = new Gson();
Data data = gson.fromJson(json, Data.class);
componentURL= data.getMap().get("config1").get("component1");
Saving :
Gson gson = new Gson();
webViewActivity.Data data = gson.fromJson(json, webViewActivity.Data.class);
Map<String, String> configTest = data.getMap().get("config1");
data.getMap().get("config1").put(component, itemUrl);
String json = gson.toJson(data);
String filename = "jsonfile.txt";
FileOutputStream outputStream;
try {
outputStream = openFileOutput(filename, Context.MODE_PRIVATE);
outputStream.write(json.getBytes());
outputStream.close();
} catch (Exception e) {
e.printStackTrace();
}
Data class :
public class Data {
private Map<String, Map<String, String>> map;
public Data() {
}
public Data(Map<String, Map<String, String>> map) {
this.map = map;
}
public Map<String, Map<String, String>> getMap() {
return map;
}
public void setMap(Map<String, Map<String, String>> map) {
this.map = map;
}
}
My problem is that I need to create the file once and then check if the file exists, if it does I need to check if config1 exists if it doesn't I need to put config1 in the file.
But I can't check if config1 exists because I get :
java.lang.NullPointerException: Attempt to invoke virtual method 'java.util.Map com.a.app.ui.app.appFragment$Data.getMap()
I check if it exists by doing :
Boolean configTest = data.getMap().containsKey("config1");
if(!configTest){}
How can I create the file and check the data without getting a NullPointerException ?
I think you should modify the way you're handling things.
First create POJO for Config1 each values as:
// file Config1.java
public class Config1
{
private String component1;
private String component2;
private String component3;
public String getComponent1 ()
{
return component1;
}
public void setComponent1 (String component1)
{
this.component1 = component1;
}
public String getComponent2 ()
{
return component2;
}
public void setComponent2 (String component2)
{
this.component2 = component2;
}
public String getComponent3 ()
{
return component3;
}
public void setComponent3 (String component3)
{
this.component3 = component3;
}
#Override
public String toString()
{
return "ClassPojo [component1 = "+component1+", component2 = "+component2+", component3 = "+component3+"]";
}
}
And then after that POJO for Config2
// file Config2.java
public class Config2
{
private String component1;
private String component2;
private String component3;
public String getComponent1 ()
{
return component1;
}
public void setComponent1 (String component1)
{
this.component1 = component1;
}
public String getComponent2 ()
{
return component2;
}
public void setComponent2 (String component2)
{
this.component2 = component2;
}
public String getComponent3 ()
{
return component3;
}
public void setComponent3 (String component3)
{
this.component3 = component3;
}
#Override
public String toString()
{
return "ClassPojo [component1 = "+component1+", component2 = "+component2+", component3 = "+component3+"]";
}
}
And then you need POJO for Maps
// file Maps.java
public class Maps
{
private Config2 config2;
private Config1 config1;
public Config2 getConfig2 ()
{
return config2;
}
public void setConfig2 (Config2 config2)
{
this.config2 = config2;
}
public Config1 getConfig1 ()
{
return config1;
}
public void setConfig1 (Config1 config1)
{
this.config1 = config1;
}
#Override
public String toString()
{
return "ClassPojo [config2 = "+config2+", config1 = "+config1+"]";
}
}
And finally the class which will wrap everything up MyJsonPojo. Though you can rename it to whatever you want.
// file MyJsonPojo.java
public class MyJsonPojo
{
private Maps maps;
public Maps getMaps ()
{
return maps;
}
public void setMaps (Maps maps)
{
this.maps = maps;
}
#Override
public String toString()
{
return "ClassPojo [maps = "+maps+"]";
}
}
Finally replace your code in the loadData() method as:
public void load() throws IOException {
File file = getContext().getFileStreamPath("jsonfile.txt");
FileInputStream fis = getContext().openFileInput("jsonfile.txt");
InputStreamReader isr = new InputStreamReader(fis);
BufferedReader bufferedReader = new BufferedReader(isr);
StringBuilder sb = new StringBuilder();
String line;
while ((line = bufferedReader.readLine()) != null) {
sb.append(line);
}
String json = sb.toString();
Gson gson = new Gson();
Data data = gson.fromJson(json, MyJsonPojo.class);
Maps maps = data.getMaps();
Config1 config1 = null;
if (maps != null) {
config1 = maps.getConfig1()
}
if (config1 != null) {
componentURL = config1.getComponent1();
}
}
For saving the values you can do this:
public void save() {
// set url here
Component1 component1 = new Component1();
component1.setComponent1(itemUrl);
// store it in maps
Maps maps = new Maps();
maps.setComponent1(component1);
// finally add it to the MyJsonPojo instance
MyJsonPojo myJsonPojo = new MyJsonPojo();
myJsonPojo.setMaps(maps);
Gson gson = new Gson();
String json = gson.toJson(maps);
String filename = "jsonfile.txt";
FileOutputStream outputStream;
try {
outputStream = openFileOutput(filename, Context.MODE_PRIVATE);
outputStream.write(json.getBytes());
outputStream.close();
} catch (Exception e) {
e.printStackTrace();
}
}
Please note that you may have to modify the save() code as per your structure because I am quite unsure about how you have handled what in the code. I have provided the basic implementation without much proof reading my code.
I am totally new to Apache Beam and Java.
Been working on PHP for around 5 years but i haven't worked in Java for the last 5 years :), plus Apache Beam SDK in java is something that is also new so bear with me.
I would like to implement pipeline where i will get data from Google PubSub, map the relevant fields into array and then check it to MySql Db to see if the message belong to one table, after that i will need to send api call to our API that will update some data in our app db. Another pipeline will enrich the data from elasticsearch and insert it into BigQuery.
But as of this moment i am stuck with reading data from MySql, i simply cannot adopt the argument in PCollection using JdbcIO.
My plan is to check if in Mysql table is present value that i get from pubsub ( value listid ).
Here is my code so far, any help will be appreciated.
Pipeline p = Pipeline.create(options);
org.apache.beam.sdk.values.PCollection<PubsubMessage> messages = p.apply(PubsubIO.readMessagesWithAttributes()
.fromSubscription("*******"));
org.apache.beam.sdk.values.PCollection<String> messages2 = messages.apply("GetPubSubEvent",
ParDo.of(new DoFn<PubsubMessage, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
Map<String, String> Map = new HashMap<String, String>();
PubsubMessage message = c.element();
String messageText = new String(message.getPayload(), StandardCharsets.UTF_8);
JSONObject jsonObj = new JSONObject(messageText);
String requestURL = jsonObj.getJSONObject("httpRequest").getString("requestUrl");
String query = requestURL.split("\\?")[1];
final Map<String, String> querymap = Splitter.on('&').trimResults().withKeyValueSeparator("=")
.split(query);
JSONObject querymapJson = new JSONObject(querymap);
int subscriberid = 0;
int listid = 0;
int statid = 0;
int points = 0;
String stattype = "";
String requesttype = "";
try {
subscriberid = querymapJson.getInt("emp_uid");
} catch (Exception e) {
}
try {
listid = querymapJson.getInt("emp_lid");
} catch (Exception e) {
}
try {
statid = querymapJson.getInt("emp_statid");
} catch (Exception e) {
}
try {
stattype = querymapJson.getString("emp_stattype");
Map.put("stattype", stattype);
} catch (Exception e) {
}
try {
requesttype = querymapJson.getString("type");
} catch (Exception e) {
}
try {
statid = querymapJson.getInt("leadscore");
} catch (Exception e) {
}
Map.put("subscriberid", String.valueOf(subscriberid));
Map.put("listid", String.valueOf(listid));
Map.put("statid", String.valueOf(statid));
Map.put("requesttype", requesttype);
Map.put("leadscore", String.valueOf(points));
Map.put("requestip", jsonObj.getJSONObject("httpRequest").getString("remoteIp"));
System.out.print("Hello from message 1");
c.output(Map.toString());
}
}));
org.apache.beam.sdk.values.PCollection<String> messages3 = messages2.apply("Test",
ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
System.out.println(c.element());
System.out.print("Hello from message 2");
}
}));
org.apache.beam.sdk.values.PCollection<KV<String, String>> messages23 = messages2.apply(JdbcIO.<KV<String, String>>read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create("org.apache.derby.jdbc.ClientDriver",
"jdbc:derby://localhost:1527/beam"))
.withQuery("select * from artist").withRowMapper(new JdbcIO.RowMapper<KV<String, String>>() {
public KV<String, String> mapRow(ResultSet resultSet) throws Exception {
KV<String, String> kv = KV.of(resultSet.getString("label"), resultSet.getString("name"));
return kv;
}
#Override
public KV<String, String> mapRow(java.sql.ResultSet resultSet) throws Exception {
KV<String, String> kv = KV.of(resultSet.getString("label"), resultSet.getString("name"));
return kv;
}
}).withCoder(KvCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of())));
p.run().waitUntilFinish();
Why I am getting nullpointerexception in ts.reset() line in InputFile class? If I use any inbuilt analyser like whitespaceanalyser, I don't get any exception. What is the problem here?
public class CourtesyTitleFilter extends TokenFilter
{
TokenStream input;
Map<String,String> courtesyTitleMap = new HashMap<String,String>();
private CharTermAttribute termAttr;
public CourtesyTitleFilter(TokenStream input) throws IOException
{
super(input);
termAttr = input.addAttribute(CharTermAttribute.class);
courtesyTitleMap.put("Dr", "doctor");
courtesyTitleMap.put("Mr", "mister");
courtesyTitleMap.put("Mrs", "miss");
}
#Override
public boolean incrementToken() throws IOException
{
if (!input.incrementToken())
return false;
String small = termAttr.toString();
if(courtesyTitleMap.containsKey(small)) {
termAttr.setEmpty().append(courtesyTitleMap.get(small));
System.out.print(courtesyTitleMap.get(small));
}
return true;
}
}
public class CourtesyTitleAnalyzer extends Analyzer
{
#Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader)
{
TokenStream filter = null;
Tokenizer whitespaceTokenizer = new WhitespaceTokenizer(reader);
try
{
filter = new CourtesyTitleFilter (whitespaceTokenizer);
}
catch(IOException e)
{
e.printStackTrace();
}
return new TokenStreamComponents(whitespaceTokenizer,filter);
}
}
public class InputFile
{
public static void main(String[] args) throws IOException, ParseException
{
TokenStream ts=null;
CourtesyTitleAnalyzer cta=new CourtesyTitleAnalyzer();
try
{
StringReader sb=new StringReader("Hello Mr Hari. Meet Dr Kalam and Mrs xyz");
ts = cta.tokenStream("field",sb);
OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
ts.reset();
while (ts.incrementToken())
{
String token = termAtt.toString();
System.out.println("[" + token + "]");
System.out.println("Token starting offset: " + offsetAtt.startOffset());
System.out.println(" Token ending offset: " + offsetAtt.endOffset());
System.out.println("");
}
ts.end();
}
catch (IOException e)
{
e.printStackTrace();
}
finally
{
ts.close();
cta.close();
}
}
}
input is already defined in the TokenFilter abstract class. You are hiding it by declaring it in your implementation.
So, just delete the line TokenStream input; in your CourtesyTitleFilter.
Question : I want to change the hard coding json file path. The path will be from detailsListHM but I dont know how to do it.
Here is my main program
public class Program {
// hard coding json file path
private static final String filePath = "C:/appSession.json";
public static void main(String[] args)
{
taskManager();
}
public static void taskManager()
{
detailsHM = jsonParser(filePath);
}
public static HashMap<String, String> jsonParser(String jsonFilePath)
{
HashMap<String, String> detailsHM = new HashMap<String, String>();
String refGene = "";
try {
// read the json file
FileReader reader = new FileReader(filePath);
} catch (FileNotFoundException ex) {
ex.printStackTrace();
}
}
}
Here is another class called CustomConfiguration
public class CustomConfiguration {
private static HashMap<String, String> detailsListHM =new HashMap<String,String>();
public static void readConfig(String a) {
//read from config.properties file
try {
String result = "";
Properties properties = new Properties();
String propFileName = a;
InputStream inputStream = new FileInputStream(propFileName);
properties.load(inputStream);
// get the property value and print it out
String lofreqPath = properties.getProperty("lofreqPath");
String bamFilePath = properties.getProperty("bamFilePath");
String bamFilePath2 = properties.getProperty("bamFilePath2");
String resultPath = properties.getProperty("resultPath");
String refGenPath = properties.getProperty("refGenPath");
String filePath = properties.getProperty("filePath");
Set keySet = properties.keySet();
List keyList = new ArrayList(keySet);
Collections.sort(keyList);
Iterator itr = keyList.iterator();
while (itr.hasNext()) {
String key = (String) itr.next();
String value = properties.getProperty(key.toString());
detailsListHM.put(key, value);
}
} catch (IOException ex) {
System.err.println("CustomConfiguration - readConfig():" + ex.getMessage());
}
}
public static HashMap<String, String> getConfigHM() {
return detailsListHM;
}
Add a new property call "json-filepath" and read like
String filePath = properties.getProperty("json-filepath");
So the end user can change the json file path even during the runtime.
you can pass the filePath parameter by using the main parameters.
public static void main(String[] args) {
String filePath = null;
if(args.length > 0) {
filePath = args[0];
}
}
And invoke your main class like this:
java Program C:/appSession.json