I am totally new to Apache Beam and Java.
Been working on PHP for around 5 years but i haven't worked in Java for the last 5 years :), plus Apache Beam SDK in java is something that is also new so bear with me.
I would like to implement pipeline where i will get data from Google PubSub, map the relevant fields into array and then check it to MySql Db to see if the message belong to one table, after that i will need to send api call to our API that will update some data in our app db. Another pipeline will enrich the data from elasticsearch and insert it into BigQuery.
But as of this moment i am stuck with reading data from MySql, i simply cannot adopt the argument in PCollection using JdbcIO.
My plan is to check if in Mysql table is present value that i get from pubsub ( value listid ).
Here is my code so far, any help will be appreciated.
Pipeline p = Pipeline.create(options);
org.apache.beam.sdk.values.PCollection<PubsubMessage> messages = p.apply(PubsubIO.readMessagesWithAttributes()
.fromSubscription("*******"));
org.apache.beam.sdk.values.PCollection<String> messages2 = messages.apply("GetPubSubEvent",
ParDo.of(new DoFn<PubsubMessage, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
Map<String, String> Map = new HashMap<String, String>();
PubsubMessage message = c.element();
String messageText = new String(message.getPayload(), StandardCharsets.UTF_8);
JSONObject jsonObj = new JSONObject(messageText);
String requestURL = jsonObj.getJSONObject("httpRequest").getString("requestUrl");
String query = requestURL.split("\\?")[1];
final Map<String, String> querymap = Splitter.on('&').trimResults().withKeyValueSeparator("=")
.split(query);
JSONObject querymapJson = new JSONObject(querymap);
int subscriberid = 0;
int listid = 0;
int statid = 0;
int points = 0;
String stattype = "";
String requesttype = "";
try {
subscriberid = querymapJson.getInt("emp_uid");
} catch (Exception e) {
}
try {
listid = querymapJson.getInt("emp_lid");
} catch (Exception e) {
}
try {
statid = querymapJson.getInt("emp_statid");
} catch (Exception e) {
}
try {
stattype = querymapJson.getString("emp_stattype");
Map.put("stattype", stattype);
} catch (Exception e) {
}
try {
requesttype = querymapJson.getString("type");
} catch (Exception e) {
}
try {
statid = querymapJson.getInt("leadscore");
} catch (Exception e) {
}
Map.put("subscriberid", String.valueOf(subscriberid));
Map.put("listid", String.valueOf(listid));
Map.put("statid", String.valueOf(statid));
Map.put("requesttype", requesttype);
Map.put("leadscore", String.valueOf(points));
Map.put("requestip", jsonObj.getJSONObject("httpRequest").getString("remoteIp"));
System.out.print("Hello from message 1");
c.output(Map.toString());
}
}));
org.apache.beam.sdk.values.PCollection<String> messages3 = messages2.apply("Test",
ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
System.out.println(c.element());
System.out.print("Hello from message 2");
}
}));
org.apache.beam.sdk.values.PCollection<KV<String, String>> messages23 = messages2.apply(JdbcIO.<KV<String, String>>read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create("org.apache.derby.jdbc.ClientDriver",
"jdbc:derby://localhost:1527/beam"))
.withQuery("select * from artist").withRowMapper(new JdbcIO.RowMapper<KV<String, String>>() {
public KV<String, String> mapRow(ResultSet resultSet) throws Exception {
KV<String, String> kv = KV.of(resultSet.getString("label"), resultSet.getString("name"));
return kv;
}
#Override
public KV<String, String> mapRow(java.sql.ResultSet resultSet) throws Exception {
KV<String, String> kv = KV.of(resultSet.getString("label"), resultSet.getString("name"));
return kv;
}
}).withCoder(KvCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of())));
p.run().waitUntilFinish();
Related
I have a DoFn that is supposed to split input into two separate PCollections. The pipeline builds and runs up until it is time to output in the DoFn, and then I get the following exception:
"java.lang.IllegalArgumentException: Unknown output tag Tag<edu.mayo.mcc.cdh.pipeline.PubsubToAvro$PubsubMessageToArchiveDoFn$2.<init>:219#2587af97b4865538>
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument(Preconditions.java:216)...
If I declare the TupleTags I'm using in the ParDo, I get that error, but if I declare them outside of the ParDo I get a syntax error saying the OutputReceiver can't find the tags. Below is the apply and the ParDo/DoFn:
PCollectionTuple results = (messages.apply("Map to Archive", ParDo.of(new PubsubMessageToArchiveDoFn()).withOutputTags(noTag, TupleTagList.of(medaPcollection))));
PCollection<AvroPubsubMessageRecord> medaPcollectionTransformed = results.get(medaPcollection);
PCollection<AvroPubsubMessageRecord> noTagPcollectionTransformed = results.get(noTag);
static class PubsubMessageToArchiveDoFn extends DoFn<PubsubMessage, AvroPubsubMessageRecord> {
final TupleTag<AvroPubsubMessageRecord> medaPcollection = new TupleTag<AvroPubsubMessageRecord>(){};
final TupleTag<AvroPubsubMessageRecord> noTag = new TupleTag<AvroPubsubMessageRecord>(){};
#ProcessElement
public void processElement(ProcessContext context, MultiOutputReceiver out) {
String appCode;
PubsubMessage message = context.element();
String msgStr = new String(message.getPayload(), StandardCharsets.UTF_8);
try {
JSONObject jsonObject = new JSONObject(msgStr);
LOGGER.info("json: {}", jsonObject);
appCode = jsonObject.getString("app_code");
LOGGER.info(appCode);
if(appCode == "MEDA"){
LOGGER.info("Made it to MEDA tag");
out.get(medaPcollection).output(new AvroPubsubMessageRecord(
message.getPayload(), message.getAttributeMap(), context.timestamp().getMillis()));
} else {
LOGGER.info("Made it to default tag");
out.get(noTag).output(new AvroPubsubMessageRecord(
message.getPayload(), message.getAttributeMap(), context.timestamp().getMillis()));
}
} catch (Exception e) {
LOGGER.info("Error Processing Message: {}\n{}", msgStr, e);
}
}
}
Can you try without MultiOutputReceiver out parameter in the processElement method ?
Outputs are then returned with context.output with passing element and corresponding TupleTag.
Your example only with context :
static class PubsubMessageToArchiveDoFn extends DoFn<PubsubMessage, AvroPubsubMessageRecord> {
final TupleTag<AvroPubsubMessageRecord> medaPcollection = new TupleTag<AvroPubsubMessageRecord>(){};
final TupleTag<AvroPubsubMessageRecord> noTag = new TupleTag<AvroPubsubMessageRecord>(){};
#ProcessElement
public void processElement(ProcessContext context) {
String appCode;
PubsubMessage message = context.element();
String msgStr = new String(message.getPayload(), StandardCharsets.UTF_8);
try {
JSONObject jsonObject = new JSONObject(msgStr);
LOGGER.info("json: {}", jsonObject);
appCode = jsonObject.getString("app_code");
LOGGER.info(appCode);
if(appCode == "MEDA"){
LOGGER.info("Made it to MEDA tag");
context.output(medaPcollection, new AvroPubsubMessageRecord(
message.getPayload(), message.getAttributeMap(), context.timestamp().getMillis()));
} else {
LOGGER.info("Made it to default tag");
context.output(noTag, new AvroPubsubMessageRecord(
message.getPayload(), message.getAttributeMap(), context.timestamp().getMillis()));
}
} catch (Exception e) {
LOGGER.info("Error Processing Message: {}\n{}", msgStr, e);
}
}
I also show you an example that works for me :
public class WordCountFn extends DoFn<String, Integer> {
private final TupleTag<Integer> outputTag = new TupleTag<Integer>() {};
private final TupleTag<Failure> failuresTag = new TupleTag<Failure>() {};
#ProcessElement
public void processElement(ProcessContext ctx) {
try {
// Could throw ArithmeticException.
final String word = ctx.element();
ctx.output(1 / word.length());
} catch (Throwable throwable) {
final Failure failure = Failure.from("step", ctx.element(), throwable);
ctx.output(failuresTag, failure);
}
}
public TupleTag<Integer> getOutputTag() {
return outputTag;
}
public TupleTag<Failure> getFailuresTag() {
return failuresTag;
}
}
In my first output (good case), no need to pass the TupleTag ctx.output(1 / word.length());
For my second output (failure case), I pass the Failure tag with the corresponding element.
I was able to get around this by making my ParDo an anonymous function instead of a class. I put the whole function inline and had no problem finding the output tags after I did that. Thanks for the suggestions!
Develop mqtt connector for Kafka using spring.
Using the mqtt library provided by spring, messages are collected as follows.
message handler
#Bean
#ServiceActivator(inputChannel = "mqttInputChannel")
public MessageHandler handler() {
return new MessageHandler() {
#Override
public void handleMessage(Message<?> message) throws MessagingException {
String topic = message.getHeaders().get(MqttHeaders.RECEIVED_TOPIC).toString();
if(topic.equals("myTopic")) {
System.out.println("Mqtt data pub");
}
System.out.println(message.getPayload());
if(topic==null) {
topic = "mqttdata";
}
String tag = "test/vib";
String name = null;
if(name==null) {
name = KafkaMessageService.MQTT_PRODUCER;
}
HashMap<String, Object> datalist = new HashMap<String, Object>();
try {
datalist =convertJSONstringToMap(message.getPayload().toString());
System.out.println(datalist.get("mac"));
counts = kafkaMessageService.publish(topic, name, tag, (HashMap<String,Object>[] datalist);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
public static HashMap<String,Object> convertJSONstringToMap(String json) throws Exception {
ObjectMapper mapper = new ObjectMapper();
HashMap<String, Object> map = new HashMap<String, Object>();
map = mapper.readValue(json, new TypeReference<HashMap<String, Object>>() {});
return map;
}
publish method
public int publish(String topic,String producerName,String tag,HashMap<String,Object>[] datalist) throws NotMatchedProducerException,KafkaPubFailureException{
KafkaProducerAdaptor adaptor = searchProducerAdaptor(producerName);
if(adaptor==null) {
throw new NotMatchedProducerException();
}
KafkaTemplate<String,Object> kafkaTemplate = adaptor.getKafkaTemplate();
LocalDateTime currentDateTime = LocalDateTime.now();
String receivedTime = currentDateTime.toString();
ObjectMapper objectMapper = new ObjectMapper();
String key = adaptor.getName();
int counts = 0;
for(HashMap<String,Object> data : datalist) {
Map<String,Object> messagePacket = new HashMap<String,Object>();
messagePacket.put("tag", tag);
messagePacket.put("data", data);
messagePacket.put("receivedtime", receivedTime);
try {
kafkaTemplate.send(topic,key,objectMapper.valueToTree(messagePacket)).get();
logger.info("Sent message : topic=["+topic+"],key=["+key+"] value=["+messagePacket+"]");
} catch(Exception e) {
logger.info("Unable to send message : topic=["+topic+"],key=["+key+"] message=["+messagePacket+"] / due to : "+e.getMessage());
throw new KafkaPubFailureException(e);
}
counts++;
}
return counts;
}
I don't know how to declare a hashmap <String, object> [] as an instance and how to use it.
The above source was taken from spring support as it is, and some modifications were made.
same record is being written to topic. not pulling latest records from splunk. time parameters are set in start method to pull last one min data. Any inputs.
currently i dont set offset from source. when poll is run every time, does it look for source offset and then poll? in logs can we have time as offset.
#Override
public List<SourceRecord> poll() throws InterruptedException {
List<SourceRecord> results = new ArrayList<>();
Map<String, String> recordProperties = new HashMap<String, String>();
while (true) {
try {
String line = null;
InputStream stream = job.getResults(previewArgs);
String earlierKey = null;
String value = null;
ResultsReaderCsv csv = new ResultsReaderCsv(stream);
HashMap<String, String> event;
while ((event = csv.getNextEvent()) != null) {
for (String key: event.keySet()) {
if(key.equals("rawlogs")){
recordProperties.put("rawlogs", event.get(key)); results.add(extractRecord(Splunklog.SplunkLogSchema(), line, recordProperties));
return results;}}}
csv.close();
stream.close();
Thread.sleep(500);
} catch(Exception ex) {
System.out.println("Exception occurred : " + ex);
}
}
}
private SourceRecord extractRecord(Schema schema, String line, Map<String, String> recordProperties) {
Map<String, String> sourcePartition = Collections.singletonMap(FILENAME_FIELD, FILENAME);
Map<String, String> sourceOffset = Collections.singletonMap(POSITION_FIELD, recordProperties.get(OFFSET_KEY));
return new SourceRecord(sourcePartition, sourceOffset, TOPIC_NAME, schema, recordProperties);
}
#Override
public void start(Map<String, String> properties) {
try {
config = new SplunkSourceTaskConfig(properties);
} catch (ConfigException e) {
throw new ConnectException("Couldn't start SplunkSourceTask due to configuration error", e);
}
HttpService.setSslSecurityProtocol(SSLSecurityProtocol.TLSv1_2);
Service service = new Service("splnkip", port);
String credentials = "user:pwd";
String basicAuthHeader = Base64.encode(credentials.getBytes());
service.setToken("Basic " + basicAuthHeader);
String startOffset = readOffset();
JobArgs jobArgs = new JobArgs();
if (startOffset != null) {
log.info("-------------------------------task OFFSET!NULL ");
jobArgs.setExecutionMode(JobArgs.ExecutionMode.BLOCKING);
jobArgs.setSearchMode(JobArgs.SearchMode.NORMAL);
jobArgs.setEarliestTime(startOffset);
jobArgs.setLatestTime("now");
jobArgs.setStatusBuckets(300);
} else {
log.info("-------------------------------task OFFSET=NULL ");
jobArgs.setExecutionMode(JobArgs.ExecutionMode.BLOCKING);
jobArgs.setSearchMode(JobArgs.SearchMode.NORMAL);
jobArgs.setEarliestTime("+419m");
jobArgs.setLatestTime("+420m");
jobArgs.setStatusBuckets(300);
}
String mySearch = "search host=search query";
job = service.search(mySearch, jobArgs);
while (!job.isReady()) {
try {
Thread.sleep(500);
} catch (InterruptedException ex) {
log.error("Exception occurred while waiting for job to start: " + ex);
}
}
previewArgs = new JobResultsPreviewArgs();
previewArgs.put("output_mode", "csv");
stop = new AtomicBoolean(false);
}
so I have written an apache beam pipeline that reads a file that contains 99 other files calculates the checksum and creates a key-value pair of the file and its checksum what I need to do is write these key-value pairs to a manifest.json file I am running into some serialization problems currently and any advice and help would be amazing.
Here is my code:
public class BeamPipeline {
private static final Logger log = LoggerFactory.getLogger(BeamPipeline.class);
public static interface MyOptions extends PipelineOptions {
#Description("Input Path(with gs:// prefix)")
String getInput();
void setInput(String value);
}
public static void main(String[] args) {
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
Pipeline p = Pipeline.create(options);
JsonObject obj = new JsonObject();
File dir = new File(options.getInput());
for (File file : dir.listFiles()) {
String inputString = file.toString();
p
.apply("Match Files", FileIO.match().filepattern(inputString))
.apply("Read Files", FileIO.readMatches())
.apply(MapElements.via(new SimpleFunction<FileIO.ReadableFile, KV<String, String>>() {
public KV<String, String> apply(FileIO.ReadableFile file) {
String temp = null;
try {
temp = file.readFullyAsUTF8String();
} catch (IOException e) {
}
String sha256hex = org.apache.commons.codec.digest.DigestUtils.sha256Hex(temp);
obj.addProperty(temp, sha256hex);
String json = obj.toString();
try (FileWriter fileWriter = new FileWriter("./manifest.json")) {
fileWriter.write(json);
} catch (IOException e) {
}
return KV.of(file.getMetadata().resourceId().toString(), sha256hex);
}
}))
.apply("Print", ParDo.of(new DoFn<KV<String, String>, Void>() {
#ProcessElement
public void processElement(ProcessContext c) {
log.info(String.format("File: %s, SHA-256 %s", c.element().getKey(), c.element().getValue()));
}
}));
}
p.run();
}
}
Here are my errors currently:
"main" java.lang.IllegalArgumentException: unable to serialize DoFnAndMainOutput{doFn=org.apache.beam.sdk.transforms.MapElements$1#50756c76, mainOutputTag=Tag<output>}
Caused by: java.io.NotSerializableException: com.google.gson.JsonObject
DoFns are serialized with all the objects accessed from the Dofn.
The JsonObject is not serializable. They are created out of DoFn and referred in DoFn which makes DoFn non serializable.
You can create JsonObject with in the DoFn to avoid this serialization issues.
public class BeamPipeline {
private static final Logger log = LoggerFactory.getLogger(BeamPipeline.class);
public static interface MyOptions extends PipelineOptions {
#Description("Input Path(with gs:// prefix)")
String getInput();
void setInput(String value);
}
public static void main(String[] args) {
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
Pipeline p = Pipeline.create(options);
File dir = new File(options.getInput());
for (File file : dir.listFiles()) {
String inputString = file.toString();
p
.apply("Match Files", FileIO.match().filepattern(inputString))
.apply("Read Files", FileIO.readMatches())
.apply(MapElements.via(new SimpleFunction<FileIO.ReadableFile, KV<String, String>>() {
public KV<String, String> apply(FileIO.ReadableFile file) {
String temp = null;
try {
temp = file.readFullyAsUTF8String();
} catch (IOException e) {
}
String sha256hex = org.apache.commons.codec.digest.DigestUtils.sha256Hex(temp);
JsonObject obj = new JsonObject();
obj.addProperty(temp, sha256hex);
String json = obj.toString();
try (FileWriter fileWriter = new FileWriter("./manifest.json")) {
fileWriter.write(json);
} catch (IOException e) {
}
return KV.of(file.getMetadata().resourceId().toString(), sha256hex);
}
}))
.apply("Print", ParDo.of(new DoFn<KV<String, String>, Void>() {
#ProcessElement
public void processElement(ProcessContext c) {
log.info(String.format("File: %s, SHA-256 %s", c.element().getKey(), c.element().getValue()));
}
}));
}
p.run();
}
}
I want to get a list of objects from database
i'm 100% that i retreive the data but the list so my php code seems to be good
public ArrayList<Categorie> getListCategorie() {
ArrayList<Categorie> listcategories = new ArrayList<>();
ConnectionRequest con2 = new ConnectionRequest();
con2.setUrl("http://localhost/pidev2017/selectcategorie.php");
con2.addResponseListener(new ActionListener<NetworkEvent>() {
#Override
public void actionPerformed(NetworkEvent evt) {
try {
JSONParser j = new JSONParser();
Map<String, Object> catefories = j.parseJSON(new CharArrayReader(new String(con2.getResponseData()).toCharArray()));
List<Map<String, Object>> list = (List<Map<String, Object>>) catefories.get("Categorie");
for (Map<String, Object> obj : list) {
Categorie categorie = new Categorie();
categorie.setId(Integer.parseInt(obj.get("id").toString()));
categorie.setNomCategorie(obj.get("nomCategorie").toString());
listcategories.add(categorie);
}
} catch (IOException ex) {
}
}
});
NetworkManager.getInstance().addToQueue(con2);
return listcategories;
}
when i want to fetch my result "listcategories" i found that is empty
Change
NetworkManager.getInstance().addToQueue(con2);
to
NetworkManager.getInstance().addToQueueAndWait(con2);
It's possible that you try to get the result before the data has been fetched.