Drools In Spark for Streaming File - java

We were able to successfully Integrate drools with spark, When we try to apply rules from Drools we were able to do for Batch file, which is present in HDFS, But we tried to use drools for Streaming file so that we can make decision instantly, But we couldn't figure out how to do it.Below is the snippet of the code what we are trying to achieve.
Case1: .
SparkConf conf = new SparkConf().setAppName("sample");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> javaRDD = sc.textFile("/user/root/spark/sample.dat");
List<String> store = new ArrayList<String>();
store = javaRDD.collect();
Case 2: when we use streaming context
SparkConf sparkconf = new SparkConf().setAppName("sparkstreaming");
JavaStreamingContext ssc =
new JavaStreamingContext(sparkconf, new Duration(1));
JavaDStream<String> lines = ssc.socketTextStream("xx.xx.xx.xx", xxxx);
In the first case we were able apply our rules on the variable store, but in the second case we were not able to apply rules on the dstream lines.
If someone has some idea, how it can be done, will be a great help.

Here is one way to get it done.
Create your knowledge session with business rules first.
//Create knowledge and session here
KnowledgeBase kbase = KnowledgeBaseFactory.newKnowledgeBase();
KnowledgeBuilder kbuilder = KnowledgeBuilderFactory.newKnowledgeBuilder();
kbuilder.add( ResourceFactory.newFileResource( "rulefile.drl"),
ResourceType.DRL );
Collection<KnowledgePackage> pkgs = kbuilder.getKnowledgePackages();
kbase.addKnowledgePackages( pkgs );
final StatelessKnowledgeSession ksession = kbase.newStatelessKnowledgeSession();
Create JavaDStream using StreamingContext.
SparkConf sparkconf = new SparkConf().setAppName("sparkstreaming");
JavaStreamingContext ssc =
new JavaStreamingContext(sparkconf, new Duration(1));
JavaDStream<String> lines = ssc.socketTextStream("xx.xx.xx.xx", xxxx);
Call DStream's foreachRDD to create facts and fire your rules.
lines.foreachRDD(new Function<JavaRDD<String>, Void>() {
#Override
public Void call(JavaRDD<String> rdd) throws Exception {
List<String> facts = rdd.collect();
//Apply rules on facts here
ksession.execute(facts);
return null;
}
});

Related

Apache Flink Dynamic Pipeline

I'm working on creating a framework to allow customers to create their own plugins to my software built on Apache Flink. I've outlined in a snippet below what I'm trying to get working (just as a proof of concept), however I'm getting a org.apache.flink.client.program.ProgramInvocationException: The main method caused an error. error when trying to upload it.
I want to be able to branch the input stream into x number of different pipelines, then having those combine together into a single output. What I have below is just my simplified version I'm starting with.
public class ContentBase {
public static void main(String[] args) throws Exception {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "kf-service:9092");
properties.setProperty("group.id", "varnost-content");
// Setup up execution environment and get stream from Kafka
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<ObjectNode> logs = see.addSource(new FlinkKafkaConsumer011<>("log-input",
new JSONKeyValueDeserializationSchema(false), properties).setStartFromLatest())
.map((MapFunction<ObjectNode, ObjectNode>) jsonNodes -> (ObjectNode) jsonNodes.get("value"));
// Create a new List of Streams, one for each "rule" that is being executed
// For now, I have a simple custom wrapper on flink's `.filter` function in `MyClass.filter`
List<String> codes = Arrays.asList("404", "200", "500");
List<DataStream<ObjectNode>> outputs = new ArrayList<>();
for (String code : codes) {
outputs.add(MyClass.filter(logs, "response", code));
}
// It seemed as though I needed a seed DataStream to union all others on
ObjectMapper mapper = new ObjectMapper();
ObjectNode seedObject = (ObjectNode) mapper.readTree("{\"start\":\"true\"");
DataStream<ObjectNode> alerts = see.fromElements(seedObject);
// Union the output of each "rule" above with the seed object to then output
for (DataStream<ObjectNode> output : outputs) {
alerts.union(output);
}
// Convert to string and sink to Kafka
alerts.map((MapFunction<ObjectNode, String>) ObjectNode::toString)
.addSink(new FlinkKafkaProducer011<>("kf-service:9092", "log-output", new SimpleStringSchema()));
see.execute();
}
}
I can't figure out how to get the actual error out of the Flink web interface to add that information here
There were a few errors I found.
1) A Stream Execution Environment can only have one input (apparently? I could be wrong) so adding the .fromElements input was not good
2) I forgot all DataStreams are immutable so the .union operation creates a new DataStream output.
The final result ended up being much simpler
public class ContentBase {
public static void main(String[] args) throws Exception {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "kf-service:9092");
properties.setProperty("group.id", "varnost-content");
// Setup up execution environment and get stream from Kafka
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<ObjectNode> logs = see.addSource(new FlinkKafkaConsumer011<>("log-input",
new JSONKeyValueDeserializationSchema(false), properties).setStartFromLatest())
.map((MapFunction<ObjectNode, ObjectNode>) jsonNodes -> (ObjectNode) jsonNodes.get("value"));
// Create a new List of Streams, one for each "rule" that is being executed
// For now, I have a simple custom wrapper on flink's `.filter` function in `MyClass.filter`
List<String> codes = Arrays.asList("404", "200", "500");
List<DataStream<ObjectNode>> outputs = new ArrayList<>();
for (String code : codes) {
outputs.add(MyClass.filter(logs, "response", code));
}
Optional<DataStream<ObjectNode>> alerts = outputs.stream().reduce(DataStream::union);
// Convert to string and sink to Kafka
alerts.map((MapFunction<ObjectNode, String>) ObjectNode::toString)
.addSink(new FlinkKafkaProducer011<>("kf-service:9092", "log-output", new SimpleStringSchema()));
see.execute();
}
}
The code you post cannot be compiled through because of the last part code (i.e., converting to string). You mixed up the java stream API map with Flink one. Change it to
alerts.get().map(ObjectNode::toString);
can fix it.
Good luck.

How to read multiple text files in Spark for document clustering?

I want to read multiple text documents from a directory for document clustering.
For that, I want to read data as:
SparkConf sparkConf = new SparkConf().setAppName(appName).setMaster("local[*]").set("spark.executor.memory", "2g");
JavaSparkContext context = new JavaSparkContext(sparkConf);
SparkSession spark = SparkSession.builder().config(sparkConf).getOrCreate();
Dataset<Row> dataset = spark.read().textFile("path to directory");
Here, I don't want to use
JavaPairRDD data = context.wholeTextFiles(path);
because I want Dataset as a return type.
In scala you could write this:
context.wholeTextFiles("...").toDS()
In java you need to use an Encoder. See the javadoc for more detail.
JavaPairRDD<String, String> rdd = context.wholeTextFiles("hdfs:///tmp/test_read");
Encoder<Tuple2<String, String>> encoder = Encoders.tuple(Encoders.STRING(), Encoders.STRING());
spark.createDataset(rdd.rdd(), encoder).show();

Spark checkpointing error when joining static dataset with DStream

I am trying to use Spark Streaming application in Java. My Spark application reads continuous feed from Hadoop
directory using textFileStream() at interval of each 1 Min.
I need to perform Spark aggregation(group by) operation on incoming DStream. After aggregation, I am joining aggregated DStream<Key, Value1> with RDD<Key, Value2>
with RDD<Key, Value2> created from static dataset read by textFile() from hadoop directory.
Problem comes when I enable checkpointing. With empty checkpoint directory, it runs fine. After running 2-3 batches I close it using ctrl+c and run it again.
On second run it throws spark exception immediately: "SPARK-5063"
Exception in thread "main" org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063
Following is the Block of Code of spark application:
private void compute(JavaSparkContext sc, JavaStreamingContext ssc) {
JavaRDD<String> distFile = sc.textFile(MasterFile);
JavaDStream<String> file = ssc.textFileStream(inputDir);
// Read Master file
JavaRDD<MasterParseLog> masterLogLines = distFile.flatMap(EXTRACT_MASTER_LOGLINES);
final JavaPairRDD<String, String> masterRDD = masterLogLines.mapToPair(MASTER_KEY_VALUE_MAPPER);
// Continuous Streaming file
JavaDStream<ParseLog> logLines = file.flatMap(EXTRACT_CKT_LOGLINES);
// calculate the sum of required field and generate group sum RDD
JavaPairDStream<String, Summary> sumRDD = logLines.mapToPair(CKT_GRP_MAPPER);
JavaPairDStream<String, Summary> grpSumRDD = sumRDD.reduceByKey(CKT_GRP_SUM);
//GROUP BY Operation
JavaPairDStream<String, Summary> grpAvgRDD = grpSumRDD.mapToPair(CKT_GRP_AVG);
// Join Master RDD with the DStream //This is the block causing error (without it code is working fine)
JavaPairDStream<String, Tuple2<String, String>> joinedStream = grpAvgRDD.transformToPair(
new Function2<JavaPairRDD<String, String>, Time, JavaPairRDD<String, Tuple2<String, String>>>() {
private static final long serialVersionUID = 1L;
public JavaPairRDD<String, Tuple2<String, String>> call(
JavaPairRDD<String, String> rdd, Time v2) throws Exception {
return masterRDD.value().join(rdd);
}
}
);
joinedStream.print(10);
}
public static void main(String[] args) {
JavaStreamingContextFactory contextFactory = new JavaStreamingContextFactory() {
public JavaStreamingContext create() {
// Create the context with a 60 second batch size
SparkConf sparkConf = new SparkConf();
final JavaSparkContext sc = new JavaSparkContext(sparkConf);
JavaStreamingContext ssc1 = new JavaStreamingContext(sc, Durations.seconds(duration));
app.compute(sc, ssc1);
ssc1.checkpoint(checkPointDir);
return ssc1;
}
};
JavaStreamingContext ssc = JavaStreamingContext.getOrCreate(checkPointDir, contextFactory);
// start the streaming server
ssc.start();
logger.info("Streaming server started...");
// wait for the computations to finish
ssc.awaitTermination();
logger.info("Streaming server stopped...");
}
I know that block of code which joins static dataset with DStream is causing error, But that is taken from spark-streaming
page of Apache spark website (sub heading "stream-dataset join" under "Join Operations"). Please help me to get it working even if
there is different way of doing it. I need to enable checkpointing in my streaming application.
Environment Details:
Centos6.5 :2 node Cluster
Java :1.8
Spark :1.4.1
Hadoop :2.7.1*

Return result in json format - Apache Spark Sql

Hi I am new to Apache spark Sql and now I am using spark java api to query data from hive. Here is the code I have used which is from Spark documentation
SparkConf sparkConf = new SparkConf().setAppName("Spark").setMaster("local");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
HiveContext sqlContext = new .HiveContext(ctx.sc());
Row[] result = = sqlContext.sql("Select * from TableName").collect();
Now I am using for loop to loop through the result and concatenating as json string and it is very slow and is there any other way to achieve this?
Can any one tell me how to do this.

SecureASTCustomizer: how to restrict loops?

I'm trying to restrict using loops(FOR and WHILE operators) in Groovy script.
I tried http://groovy-sandbox.kohsuke.org/ but it seems to be not possible to restrict loops with this lib.
Code:
final String script = "while(true){}";
final ImportCustomizer imports = new ImportCustomizer();
imports.addStaticStars("java.lang.Math");
imports.addStarImports("groovyx.net.http");
imports.addStaticStars("groovyx.net.http.ContentType", "groovyx.net.http.Method");
final SecureASTCustomizer secure = new SecureASTCustomizer();
secure.setClosuresAllowed(true);
List<Integer> tokensBlacklist = new ArrayList<>();
tokensBlacklist.add(Types.KEYWORD_WHILE);
secure.setTokensBlacklist(tokensBlacklist);
final CompilerConfiguration config = new CompilerConfiguration();
config.addCompilationCustomizers(imports, secure);
Binding intBinding = new Binding();
GroovyShell shell = new GroovyShell(intBinding, config);
final Object eval = shell.evaluate(script);
Whats wrong with my code or probably some one knows how I can restrict some loops or operators?
WHILE and FOR are statements. You should rather try adding them as statementsBlacklist instead of tokenBlacklist.
List<Class> statementBlacklist = new ArrayList<>();
statementBlacklist.add( org.codehaus.groovy.ast.stmt.WhileStatement );
secure.setStatementsBlacklist( statementBlacklist );

Categories

Resources