Apache Flink transform DataStream (source) to a List? - java

My question is how to transform from a DataStream to a List, for example in order to be able to iterate through it.
The code looks like :
package flinkoracle;
//imports
//....
public class FlinkOracle {
final static Logger LOG = LoggerFactory.getLogger(FlinkOracle.class);
public static void main(String[] args) {
LOG.info("Starting...");
// get the execution environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
TypeInformation[] fieldTypes = new TypeInformation[]{BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO};
RowTypeInfo rowTypeInfo = new RowTypeInfo(fieldTypes);
//get the source from Oracle DB
DataStream<?> source = env
.createInput(JDBCInputFormat.buildJDBCInputFormat()
.setDrivername("oracle.jdbc.driver.OracleDriver")
.setDBUrl("jdbc:oracle:thin:#localhost:1521")
.setUsername("user")
.setPassword("password")
.setQuery("select * from table1")
.setRowTypeInfo(rowTypeInfo)
.finish());
source.print().setParallelism(1);
try {
LOG.info("----------BEGIN----------");
env.execute();
LOG.info("----------END----------");
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
LOG.info("End...");
}
}
Thanks a lot in advance.
Br
Tamas

Flink provides an iterator sink to collect DataStream results for testing and debugging purposes. It can be used as follows:
import org.apache.flink.contrib.streaming.DataStreamUtils;
DataStream<Tuple2<String, Integer>> myResult = ...
Iterator<Tuple2<String, Integer>> myOutput = DataStreamUtils.collect(myResult)
You can copy an iterator to a new list like this:
while (iter.hasNext())
list.add(iter.next());
Flink also provides a bunch of simple write*() methods on DataStream that are mainly intended for debugging purposes. The data flushing to the target system depends on the implementation of the OutputFormat. This means that not all elements sent to the OutputFormat are immediately shown up in the target system. Note: These write*() methods do not participate in Flinkā€™s checkpointing, and in failure cases, those records might be lost.
writeAsText() / TextOutputFormat
writeAsCsv(...) / CsvOutputFormat
print() / printToErr()
writeUsingOutputFormat() / FileOutputFormat
writeToSocket
Source: link.
You may need to add the following dependency to use DataStreamUtils:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-contrib</artifactId>
<version>0.10.2</version>
</dependency>

In newer versions, DataStreamUtils::collect has been deprecated. Instead you can use DataStream::executeAndCollect which, if given a limit, will return a List of at most that size.
var list = source.executeAndCollect(100);
If you do not know how many elements there are, or if you simply want to iterate through the results without loading them all into memory at once, then you can use the no-arg version to get a CloseableIterator
try (var iterator = source.executeAndCollect()) {
// do something
}

Related

How to limit size of velocity output

I am using java and Apache Velocity 1.7 to evaluate template
Following is sample code:
public void internalEvaluate(Map<String, Object> customContext, String templateText) throws IOException {
// add custom context to VelocityContext
VelocityContext context = new VelocityContext();
for (Map.Entry<String, Object> entry : customContext.entrySet()) {
context.put(entry.getKey(), entry.getValue());
}
// define writer
StringWriter output = new StringWriter();
// define logTag
String logTag = "TestVTL";
// check input template text
if (templateText == null)
templateText = "$noDescription";
Velocity.evaluate(context, output, logTag, templateText);
// write output to file
saveToFile(out);
}
However, specific customContext or templateText may make a large output.
The output can be created as file but cannot be opened by editor.
Below are my questions
Q1.
I would like to limit or check size of output at runtime (or before calling evaluate()) and throw warning message about creating too large file.
Does Velocity provide configure or Api to do something like this?
Q2.
Evaluation process may take a long time.
I would like to know progress status in velocity evaluation process.
Is it possible to get progress information?
Best regards,
Since Velocity only sees a Writer class, it has no mean of counting the output size.
Your best option would be to implement a CustomStringWriter class that will throw an exception when its internal size has reached a certain limit.

Using apache common exec to write to pdf

HI all i am trying to use this apache common exec, using this i am trying to create and write to a file.
the command line argument to write to a file is follows
Example: PDFAnnotator.exe "C:\My Documents\Test.pdf"
I have tried the following
public PrintResultHandler print(final File file, final long printJobTimeout, final boolean printInBackground)
throws IOException {
int exitValue;
ExecuteWatchdog watchdog = null;
PrintResultHandler resultHandler;
// build up the command line to using a 'java.io.File'
CommandLine commandLine = new CommandLine("C:\\Program Files (x86)\\Adobe\\Reader 11.0\\Reader\\AcroRd32.exe");
//CommandLine cmdLine = new CommandLine("C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe");
Map map = new HashMap();
map.put("file", new File("C:\\test\\invoice.pdf"));
commandLine.addArgument("/p");
commandLine.addArgument("/h");
commandLine.addArgument("${file}");
// create the executor and consider the exitValue '1' as success
final Executor executor = new DefaultExecutor();
executor.setExitValue(1);
// create a watchdog if requested
if (printJobTimeout > 0) {
watchdog = new ExecuteWatchdog(printJobTimeout);
executor.setWatchdog(watchdog);
}
// pass a "ExecuteResultHandler" when doing background printing
if (printInBackground) {
System.out.println("[print] Executing non-blocking print job ...");
resultHandler = new PrintResultHandler(watchdog);
executor.execute(commandLine, (Map<String, String>) resultHandler);
}
else {
System.out.println("[print] Executing blocking print job ...");
exitValue = executor.execute(commandLine);
resultHandler = new PrintResultHandler(exitValue);
}
return resultHandler;
}
it does not create any pdf file as an output can you please suggest.
It seems this code has been modified from the Apache Commons Exec tutorial code. There are a couple of modifications to the code it seems you have made which have caused problems.
Firstly, you have deleted the line
commandLine.setSubstitutionMap(map);
Without this line, you are creating the variable map, putting a single value into this map and then doing nothing further with it. Clearly, having a map that you never read any values out of achieves nothing. Reinstate this line, it's important.
The other problem is the line
executor.execute(commandLine, (Map<String, String>) resultHandler);
The difference between this code and the tutorial code is that you have added the cast to Map<String, String>. resultHandler is a PrintResultHandler, but this class does not implement Map<String, String> so this cast will fail.
I don't see why you have the cast at all. Get rid of it to leave you with:
executor.execute(commandLine, resultHandler);
If your code continues not to work, then I can't say what the reasons would be. Maybe the Adode Reader executable isn't where you think it is, maybe the file doesn't exist or doesn't have read permissions. In any case, suitable details should be written to standard output or standard error to help you further diagnose the problem.

Apache Flink Dynamic Pipeline

I'm working on creating a framework to allow customers to create their own plugins to my software built on Apache Flink. I've outlined in a snippet below what I'm trying to get working (just as a proof of concept), however I'm getting a org.apache.flink.client.program.ProgramInvocationException: The main method caused an error. error when trying to upload it.
I want to be able to branch the input stream into x number of different pipelines, then having those combine together into a single output. What I have below is just my simplified version I'm starting with.
public class ContentBase {
public static void main(String[] args) throws Exception {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "kf-service:9092");
properties.setProperty("group.id", "varnost-content");
// Setup up execution environment and get stream from Kafka
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<ObjectNode> logs = see.addSource(new FlinkKafkaConsumer011<>("log-input",
new JSONKeyValueDeserializationSchema(false), properties).setStartFromLatest())
.map((MapFunction<ObjectNode, ObjectNode>) jsonNodes -> (ObjectNode) jsonNodes.get("value"));
// Create a new List of Streams, one for each "rule" that is being executed
// For now, I have a simple custom wrapper on flink's `.filter` function in `MyClass.filter`
List<String> codes = Arrays.asList("404", "200", "500");
List<DataStream<ObjectNode>> outputs = new ArrayList<>();
for (String code : codes) {
outputs.add(MyClass.filter(logs, "response", code));
}
// It seemed as though I needed a seed DataStream to union all others on
ObjectMapper mapper = new ObjectMapper();
ObjectNode seedObject = (ObjectNode) mapper.readTree("{\"start\":\"true\"");
DataStream<ObjectNode> alerts = see.fromElements(seedObject);
// Union the output of each "rule" above with the seed object to then output
for (DataStream<ObjectNode> output : outputs) {
alerts.union(output);
}
// Convert to string and sink to Kafka
alerts.map((MapFunction<ObjectNode, String>) ObjectNode::toString)
.addSink(new FlinkKafkaProducer011<>("kf-service:9092", "log-output", new SimpleStringSchema()));
see.execute();
}
}
I can't figure out how to get the actual error out of the Flink web interface to add that information here
There were a few errors I found.
1) A Stream Execution Environment can only have one input (apparently? I could be wrong) so adding the .fromElements input was not good
2) I forgot all DataStreams are immutable so the .union operation creates a new DataStream output.
The final result ended up being much simpler
public class ContentBase {
public static void main(String[] args) throws Exception {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "kf-service:9092");
properties.setProperty("group.id", "varnost-content");
// Setup up execution environment and get stream from Kafka
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<ObjectNode> logs = see.addSource(new FlinkKafkaConsumer011<>("log-input",
new JSONKeyValueDeserializationSchema(false), properties).setStartFromLatest())
.map((MapFunction<ObjectNode, ObjectNode>) jsonNodes -> (ObjectNode) jsonNodes.get("value"));
// Create a new List of Streams, one for each "rule" that is being executed
// For now, I have a simple custom wrapper on flink's `.filter` function in `MyClass.filter`
List<String> codes = Arrays.asList("404", "200", "500");
List<DataStream<ObjectNode>> outputs = new ArrayList<>();
for (String code : codes) {
outputs.add(MyClass.filter(logs, "response", code));
}
Optional<DataStream<ObjectNode>> alerts = outputs.stream().reduce(DataStream::union);
// Convert to string and sink to Kafka
alerts.map((MapFunction<ObjectNode, String>) ObjectNode::toString)
.addSink(new FlinkKafkaProducer011<>("kf-service:9092", "log-output", new SimpleStringSchema()));
see.execute();
}
}
The code you post cannot be compiled through because of the last part code (i.e., converting to string). You mixed up the java stream API map with Flink one. Change it to
alerts.get().map(ObjectNode::toString);
can fix it.
Good luck.

Java: Antlr4 MySql get individual statements

I'm using Java with JDBC to run MySql code. I want to execute a DDL script, but JDBC can only execute a single statement at a time, which makes it unsuitable to execute a whole .sql file out of the box.
What I'm trying to do is use Antlr4 to parse the .sql file so I can break up each individual statement and then iteratively execute them with JDBC.
I've gotten this far:
InputStream resourceAsStream = Main.class.getClassLoader()
.getResourceAsStream("an-arbitrary-ddl.sql");
CharStream codePointCharStream = CharStreams.fromStream(resourceAsStream);
MySqlLexer tokenSource = new MySqlLexer(new CaseChangingCharStream(codePointCharStream, true));
TokenStream tokenStream = new CommonTokenStream(tokenSource);
MySqlParser mySqlParser = new MySqlParser(tokenStream);
// Where do I go from here?
I'm sure I'm just not searching for the correct terms because I'm new to Antlr and manually parsing code. I can't find any reference from here as to what I need to do to get individual sql statements out of the MySqlParser. What do I need to do next?
A parser is not the right tool for this kind of problem. A statement splitter is pretty easy to write manually and much faster if you do it yourself. I implemented such a splitter in C++ in MySQL Workbench. Shouldn't be difficult to port this to Java. The code is very fast (1 Mio LOC SQL code in under 1 sec on an average machine). A parser would need much longer.
I'm sure this can be improved, however, as the most simple way I could create this was creating a listener and provide the constructor with a Consumer<String> object. The listener looks at individual statements and recursively constructs them. There is probably a more optimal solution, however, I no longer have time to try to optimize this if there is.
/**
* #author Paul Nelson Baker
* #see GitHub
* #see LinkedIn
* #since 2018-09
*/
public class SqlStatementListener extends MySqlParserBaseListener {
private final Consumer<String> sqlStatementConsumer;
public SqlStatementListener(Consumer<String> sqlStatementConsumer) {
this.sqlStatementConsumer = sqlStatementConsumer;
}
#Override
public void enterSqlStatement(MySqlParser.SqlStatementContext ctx) {
if (ctx.getChildCount() > 0) {
StringBuilder stringBuilder = new StringBuilder();
recreateStatementString(ctx.getChild(0), stringBuilder);
stringBuilder.setCharAt(stringBuilder.length() - 1, ';');
String recreatedSqlStatement = stringBuilder.toString();
sqlStatementConsumer.accept(recreatedSqlStatement);
}
super.enterSqlStatement(ctx);
}
private void recreateStatementString(ParseTree currentNode, StringBuilder stringBuilder) {
if (currentNode instanceof TerminalNode) {
stringBuilder.append(currentNode.getText());
stringBuilder.append(' ');
}
for (int i = 0; i < currentNode.getChildCount(); i++) {
recreateStatementString(currentNode.getChild(i), stringBuilder);
}
}
}
Next you need to traverse the statements, the string consumer from earlier allows you to lazily redirect the output wherever you need. This can be as simple as just printing to stdout, however, it can just as easily be used to append to a list.
public List<String> mySqlStatementsFrom(String sourceCode) {
List<String> statements = new ArrayList<>();
mySqlStatementsToConsumer(sourceCode, statements::add);
return statements;
}
public void mySqlStatementsToConsumer(String sourceCode, Consumer<String> mySqlStatementConsumer) {
CharStream codePointCharStream = CharStreams.fromString(sourceCode);
MySqlLexer tokenSource = new MySqlLexer(new CaseChangingCharStream(codePointCharStream, true));
TokenStream tokenStream = new CommonTokenStream(tokenSource);
MySqlParser mySqlParser = new MySqlParser(tokenStream);
SqlStatementListener statementListener = new SqlStatementListener(mySqlStatementConsumer);
ParseTreeWalker.DEFAULT.walk(statementListener, mySqlParser.sqlStatements());
}

How to filter a stream of tuple with Apache Edgent

I created a source function according to this manual.
public static void main(String[] args) throws Exception {
DirectProvider dp = new DirectProvider();
Topology top = dp.newTopology();
final URL url = new URL("http://finance.yahoo.com/d/quotes.csv?s=BAC+COG+FCX&f=snabl");
TStream<String> linesOfWebsite = top.source(queryWebsite(url));
}
Now I'd like to filter this stream. I had something like this in mind:
TStream<Iterable<String>> simpleFiltered = source.filter(item-> item.contains("BAX");
Which is not working. Does anybody has an idea how to filter the stream? I don't want to change the request url to do the filtering upfront.
It's difficult to tell from the info provided. dp.submit(top) is needed to run the topology. The filter code isn't specifying an item that occurs using the URL that's being specified. e.g.,
...
TStream<String> linesOfWebsite = top.source(queryWebsite(url));
linesOfWebsite.print(); // show what's received
TStream<String> filtered = linesOfWebsite.filter(t -> t.contains("BAC"));
filtered.sink(t -> System.out.println("filtered: " + t));
dp.submit(top); // required

Categories

Resources