Different result on running Flink in local mode and Yarn cluster - java

I run a code using Flink Java API that gets some bytes from Kafka and parses it following by inserting into Cassandra database using another library static method (both parsing and inserting results is done by the library). Running code on local in IDE, I get the desired answer, but running on YARN cluster the parse method didn't work as expected!
public class Test {
static HashMap<Integer, Object> ConfigHashMap = new HashMap<>();
public static void main(String[] args) throws Exception {
CassandraConnection.connect();
Parser.setInsert(true);
stream.flatMap(new FlatMapFunction<byte[], Void>() {
#Override
public void flatMap(byte[] value, Collector<Void> out) throws Exception {
Parser.parse(ByteBuffer.wrap(value), ConfigHashMap);
// Parser.parse(ByteBuffer.wrap(value));
}
});
env.execute();
}
}
There is a static HashMap field in the Parser class that configuration of parsing data is based on its information, and data will insert it during the execution. The problem running on YARN was this data was not available for taskmanagers and they just print config is not available!
So I redefine that HashMap as a parameter for parse method, but no differences in results!
How can I fix the problem?

I changed static methods and fields to non-static and using RichFlatMapFunction solved the problem.
stream.flatMap(new RichFlatMapFunction<byte[], Void>() {
CassandraConnection con = new CassandraConnection();
int i = 0 ;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
con.connect();
}
#Override
public void flatMap(byte[] value, Collector<Void> out) throws Exception {
ByteBuffer tb = ByteBuffer.wrap(value);
np.parse(tb, ConfigHashMap, con);
}
});

Related

How to implement a BOUNDED source for Flink's batch execution mode?

I'm trying to do a Flink (1.12.1) batch job, with the following steps:
Custom SourceFunction to connect with MongoDB
Do any flatmaps and maps to transform some data
Sink it in other MongoDB
I'm trying to run it in a StreamExecutionEnvironment , with RuntimeExexutionMode.BATCH, but the application throws a exception because detect my source as UNBOUNDED... And I can't set it BOUNDED ( it must finish after collect all documents in the mongo collection )
The exception:
exception in thread "main" java.lang.IllegalStateException: Detected an UNBOUNDED source with the 'execution.runtime-mode' set to 'BATCH'. This combination is not allowed, please set the 'execution.runtime-mode' to STREAMING or AUTOMATIC
at org.apache.flink.util.Preconditions.checkState(Preconditions.java:193)
at org.apache.flink.streaming.api.graph.StreamGraphGenerator.shouldExecuteInBatchMode(StreamGraphGenerator.java:335)
at org.apache.flink.streaming.api.graph.StreamGraphGenerator.generate(StreamGraphGenerator.java:258)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.getStreamGraph(StreamExecutionEnvironment.java:1958)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.getStreamGraph(StreamExecutionEnvironment.java:1943)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1782)
at com.grupotsk.bigdata.matadatapmexporter.MetadataPMExporter.main(MetadataPMExporter.java:33)
Some code:
Execution environment
public static StreamExecutionEnvironment getBatch() {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.BATCH);
env.addSource(new MongoSource()).print();
return env;
}
Mongo source:
public class MongoSource extends RichSourceFunction<Document> {
private static final long serialVersionUID = 8321722349907219802L;
private MongoClient mongoClient;
private MongoCollection mc;
#Override
public void open(Configuration con) {
mongoClient = new MongoClient(
new MongoClientURI("mongodb://localhost:27017/database"));
mc=mongoClient.getDatabase("database").getCollection("collection");
}
#Override
public void run(SourceContext<Document> ctx) throws Exception {
MongoCursor<Document> itr=mc.find(Document.class).cursor();
while(itr.hasNext())
ctx.collect(itr.next());
this.cancel();
}
#Override
public void cancel() {
mongoClient.close();
}
Thanks !
Sources used with RuntimeExecutionMode.BATCH must implement Source rather than SourceFunction. And the sink should implement Sink rather than SinkFunction.
See Integrating Flink into your ecosystem - How to build a Flink connector from scratch for an introduction to these new interfaces. They are described in FLIP-27: Refactor Source Interface and FLIP-143: Unified Sink API.

Implementing Spring + Apache Flink project with Postgres

I have a SpringBoot gradle project using apache flink to process datastream signals. When a new signal comes through the datastream, I would like to query look up (i.e. findById() ) it's details using an ID in a postgres database table which is already created in order to get additional information about the signal and enrich the data. I would like to avoid using spring dependencies to perform the lookup (i.e Autowire repository) and want to stick with flink implementation for the lookup.
Where can i specify how to add the postgres connection config information such as port, database, url, username, password etc... (for simplicity purposes can assume the postgres db is local in my machine). Is it as simple as adding the configuration to the application.properties file? if so how can i write the query method to look up the record in the postgres table when searching by non primary key value?
Some online sources are suggesting using this skeleton code but I am not sure how/id it fits my use case. (I have a EventEntity model created which contains all the params/columns from the table which i'm looking up).
like so
public class DatabaseMapper extends RichFlatMapFunction<String, EventEntity> {
// Declare DB connection & query statements
public void open(Configuration parameters) throws Exception {
//Initialize DB connection
//prepare query statements
}
#Override
public void flatMap(String value, Collector<EventEntity> out) throws Exception {
}
}
Your sample code is correct. You can set all your custom initialization and preparation code for PostgreSQL in open() method. Then you can use your pre-configured fields in your flatMap() function.
Here is one sample for Redis operations
I have used RichAsyncFunction here and I suggest you do the same as it is suggested as best practice. Read here for more: https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/stream/operators/asyncio.html)
You can pass configuration parameteres in your constructor method and use it in your initialization process
public static class AsyncRedisOperations extends RichAsyncFunction<Object,Object> {
private JedisPool jedisPool;
private Configuration redisConf;
public AsyncRedisOperations(Configuration redisConf) {
this.redisConf = redisConf;
}
#Override
public void open(Configuration parameters) {
JedisPoolConfig jedisPoolConfig = new JedisPoolConfig();
jedisPoolConfig.setMaxTotal(this.redisConf.getInteger("pool", 8));
jedisPoolConfig.setMaxIdle(this.redisConf.getInteger("pool", 8));
jedisPoolConfig.setMaxWaitMillis(this.redisConf.getInteger("maxWait", 0));
JedisPool jedisPool = new JedisPool(jedisPoolConfig,
this.redisConf.getString("host", "192.168.10.10"),
this.redisConf.getInteger("port", 6379), 5000);
try {
this.jedisPool = jedisPool;
this.logger.info("Redis connected: " + jedisPool.getResource().isConnected());
} catch (Exception e) {
this.logger.error(BaseUtil.append("Exception while connecting Redis"));
}
}
#Override
public void asyncInvoke(Object in, ResultFuture<Object> out) {
try (Jedis jedis = this.jedisPool.getResource()) {
String key = jedis.get(key);
this.logger.info("Redis Key: " + key);
}
}
}

How to get a result from Rserve back in a variable in Apache Camel

Since there is not much documentation available for Rcode component of Apache Camel, I am not sure how I can get an output from a simple R code snippet that I am running through Rcode.
In my RouteBuilder, I have the following code:
from("activiti:testCamelTask:sendMsgToCamel", "direct://rcode_source")
.setBody(simple(rSourceCode))
.to("rcode://localhost:6311/parse_and_eval?bufferSize=4194304")
.end();
Where rSourceCode contains my R Code, which is:
c <-4;
print(c);
The code is running correctly and I am able to see the output in Rserve console.
I want the the value of the variable c back to my java code in a variable. How can this be done?
I find it more normal to use Apache Camel to send the result to some other component, possibly within the same program. But you can also store the data, e.g. via a Bean:
public class RserveCamel {
public static class StoringBean {
private REXP payload;
public REXP getPayload() {
return payload;
}
public void setPayload(REXP payload) {
this.payload = payload;
}
}
public static void main(String args[]) throws Exception {
StoringBean storingBean = new StoringBean();
CamelContext context = new DefaultCamelContext();
context.addRoutes(new RouteBuilder() {
#Override
public void configure() throws Exception {
from("direct:rcode")
.to("rcode:localhost:6311/eval?bufferSize=4194304")
.to("log:test?showBody=true&showHeaders=false")
.bean(storingBean, "setPayload");
}
});
ProducerTemplate producerTemplate = context.createProducerTemplate();
context.start();
producerTemplate.sendBody("direct:rcode", "c <- 4; print(c);");
context.stop();
System.out.println(storingBean.getPayload().asString());
}
}

Using JNDI to share objects beetween differends Java programs

I'm trying to run the following example but I did not succeed.
The intention is sharing objects through JNDI between two Java programs running on different virtual machines. I'm using TomEE 1.5.1 and TomEE 1.6
The JNDI parameters that I'm using are:
Hashtable<String, String> ctxProps = new Hashtable<String, String>();
ctxProps.put("java.naming.factory.initial","org.apache.openejb.client.RemoteInitialContextFactory");
ctxProps.put("java.naming.provider.url", "http://myjndihost.com:8080/tomee/ejb");
ctxProps.put("java.naming.security.principal", "tomee");
ctxProps.put("java.naming.security.credentials", "tomee");
The Sharing class provide 3 methods: connect (called within the constructor), set and get, and It's source code is here:
public class Sharing
{
private Context context=null;
public Sharing() throws Exception
{
connect();
}
private void connect() throws Exception
{
Hashtable<String, String> ctxProps = new Hashtable<String, String>();
ctxProps.put("java.naming.factory.initial","org.apache.openejb.client.RemoteInitialContextFactory");
ctxProps.put("java.naming.provider.url", "http://localhost:8080/tomee/ejb");
ctxProps.put("java.naming.security.principal", "tomee");
ctxProps.put("java.naming.security.credentials", "tomee");
context=new InitialContext(ctxProps);
context=context.createSubcontext("DEMO");
}
public void set(String key, Serializable value) throws Exception
{
try
{
context.bind(key,value);
}
catch(NameAlreadyBoundException ex)
{
context.rebind(key,value);
}
}
public Object get(String key) throws Exception
{
try
{
return context.lookup(key);
}
catch(NameNotFoundException e)
{
return null;
}
}
}
Then, I run this TestSet.java
public class TestSet
{
public static void main(String[] args) throws Exception
{
Sharing share = new Sharing();
share.put("fecha", new Date());
}
}
It's throught an javax.naming.OperationNotSupportedException:
Exception in thread "main" javax.naming.OperationNotSupportedException
at org.apache.openejb.client.JNDIContext.createSubcontext(JNDIContext.java:551)
at javax.naming.InitialContext.createSubcontext(InitialContext.java:483)
at demo.Sharing.connect(Sharing.java:32)
at demo.Sharing.<init>(Sharing.java:20)
at demo.TestSet.main(TestSet.java:9)
But, if I remove the createSubcontect call (in line 32 of Sharing.java) then, the next exception is in the line 39, when
try to bind an object.
Exception in thread "main" javax.naming.OperationNotSupportedException
at org.apache.openejb.client.JNDIContext.bind(JNDIContext.java:511)
at javax.naming.InitialContext.bind(InitialContext.java:419)
at demo.Sharing.put(Sharing.java:39)
at demo.TestSet.main(TestSet.java:10)
Of course, next step will be run TestGet.java but the time has not arrived because
I could not run succefully TestSet yet.
The original example is taken from here, using WebLogic.
http://www.javaworld.com/article/2076440/jndi/use-jndi-to-share-objects-between-different-virtual-machines.html
Thanks a lot.
Pablo.

neo4j rest graphdb hangs when connecting to remote heroku instance

public class Test
{
private static RestAPI rest = new RestAPIFacade("myIp","username","password");
public static void main(String[] args)
{
Map<String, Object> foo = new HashMap<String, Object>();
foo.put("Test key", "testing");
rest.createNode(foo);
}
}
No output it just hangs on connection indefinitely.
Environment:
Eclipse
JDK 7
neo4j-rest-binding 1.9: https://github.com/neo4j/java-rest-binding
Heroku
Any ideas as to why this just hangs?
The following code works:
public class Test
{
private static RestAPI rest = new RestAPIFacade("myIp","username","password");
public static void main(String[] args)
{
Node node = rest.getNodeById(1);
}
}
So it stands that I can correctly retrieve remote values.
I guess this is caused by lacking usage of transactions. By default neo4j-rest-binding aggregates multiple operations into one request (aka one transaction). There are 2 ways to deal with this:
change transactional behaviour to "1 operation = 1 transaction" by setting
-Dorg.neo4j.rest.batch_transaction=false for your JVM. Be aware this could impact performance since every atomic operation is a seperate REST request.
use transactions in your code:
.
RestGraphDatabse db = new RestGraphDatabase("http://localhost:7474/db/data",username,password);
Transaction tx = db.beginTx();
try {
Node node = db.createNode();
node.setPropery("key", "value");
tx.success();
} finally {
tx.finish();
}

Categories

Resources