Generate multiple output files using MultiSinkTap

Generate multiple output files using MultiSinkTap - java

I had the following dataset as input
id,name,gender
asinha161,Aniruddha,Male
vic,Victor,Male
day1,Daisy,Female
jazz030,Jasmine,Female
Mic002,Michael,Male
I aimed at segregating the males and females into two separate output files as follows
Dataset for males
id,name,gender
asinha161,Aniruddha,Male
vic,Victor,Male
Mic002,Michael,Male
Dataset for females
id,name,gender
day1,Daisy,Female
jazz030,Jasmine,Female
Now, I attempted to write a Cascading Framework code which is supposed to do the above task, the code is as follows
public class Main {
public static void main(String[] args) {
Tap sourceTap = new FileTap(new TextDelimited(true, ","), "inputFile.txt");
Tap sink_one = new FileTap(new TextDelimited(true, ","), "maleFile.txt");
Tap sink_two = new FileTap(new TextDelimited(true, ","), "FemaleFile.txt");
Pipe assembly = new Pipe("inputPipe");
// ...split into two pipes
Pipe malePipe = new Pipe("for_male", assembly);
malePipe=new Each(malePipe,new CustomFilterByGender("male"));
Pipe femalePipe = new Pipe("for_female", assembly);
femalePipe=new Each(femalePipe, new CustomFilterByGender("female"));
// create the flow
List<Pipe> pipes = new ArrayList<Pipe>(2)
{{pipes.add(countOne);
pipes.add(countTwo);}};
Tap outputTap=new MultiSinkTap<>(sink_one,sink_two);
FlowConnector flowConnector = new LocalFlowConnector();
Flow flow = flowConnector.connect(sourceTap, outputTap, pipes);
flow.complete();
}
where CustomFilterByGender(String gender); is a custom function that returns tuples according to the gender value passed as argument.
Please note that I have not used Custom Buffer for the sake of efficiency.
Using MultiSinkTap, I am not able to get the desired output since the connect() method of the LocalFlowConnector object is not accepting the MultiSinkTap Object which results to a compilation time error.
It will be imperative if you suggest possible changes in the above code to make it work or the way to use MultiSinkTap.
Thankyou for patiently going through the question :)

I think you want to write output of different pipes into different output files, I made some changes in your code that should solve your purpose definitely.
public class Main {
public static void main(String[] args) {
Tap sourceTap = new FileTap(new TextDelimited(true, ","), "inputFile.txt");
Tap sink_one = new FileTap(new TextDelimited(true, ","), "maleFile.txt");
Tap sink_two = new FileTap(new TextDelimited(true, ","), "FemaleFile.txt");
Pipe assembly = new Pipe("inputPipe");
Pipe malePipe = new Pipe("for_male", assembly);
malePipe=new Each(malePipe,new CustomFilterByGender("male"));
Pipe femalePipe = new Pipe("for_female", assembly);
femalePipe=new Each(femalePipe, new CustomFilterByGender("female"));
List<Pipe> pipes = new ArrayList<Pipe>(2);
pipes.add(malePipe);
pipes.add(femalePipe);
Map<String, Tap> sinks = new HashMap<String, Tap>();
sinks.put("for_male", sink_one);
sinks.put("for_female", sink_two);
FlowConnector flowConnector = new LocalFlowConnector();
Flow flow = flowConnector.connect(sourceTap, sinks, pipes);
flow.complete();
}
Instead of using MultiSinkTap you can directly give the Map<> of Sinks those you want to connect to the output pipes in this case malePipe and femalePipe.

Related

Apache Flink Dynamic Pipeline

I'm working on creating a framework to allow customers to create their own plugins to my software built on Apache Flink. I've outlined in a snippet below what I'm trying to get working (just as a proof of concept), however I'm getting a org.apache.flink.client.program.ProgramInvocationException: The main method caused an error. error when trying to upload it.
I want to be able to branch the input stream into x number of different pipelines, then having those combine together into a single output. What I have below is just my simplified version I'm starting with.
public class ContentBase {
public static void main(String[] args) throws Exception {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "kf-service:9092");
properties.setProperty("group.id", "varnost-content");
// Setup up execution environment and get stream from Kafka
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<ObjectNode> logs = see.addSource(new FlinkKafkaConsumer011<>("log-input",
new JSONKeyValueDeserializationSchema(false), properties).setStartFromLatest())
.map((MapFunction<ObjectNode, ObjectNode>) jsonNodes -> (ObjectNode) jsonNodes.get("value"));
// Create a new List of Streams, one for each "rule" that is being executed
// For now, I have a simple custom wrapper on flink's `.filter` function in `MyClass.filter`
List<String> codes = Arrays.asList("404", "200", "500");
List<DataStream<ObjectNode>> outputs = new ArrayList<>();
for (String code : codes) {
outputs.add(MyClass.filter(logs, "response", code));
}
// It seemed as though I needed a seed DataStream to union all others on
ObjectMapper mapper = new ObjectMapper();
ObjectNode seedObject = (ObjectNode) mapper.readTree("{\"start\":\"true\"");
DataStream<ObjectNode> alerts = see.fromElements(seedObject);
// Union the output of each "rule" above with the seed object to then output
for (DataStream<ObjectNode> output : outputs) {
alerts.union(output);
}
// Convert to string and sink to Kafka
alerts.map((MapFunction<ObjectNode, String>) ObjectNode::toString)
.addSink(new FlinkKafkaProducer011<>("kf-service:9092", "log-output", new SimpleStringSchema()));
see.execute();
}
}
I can't figure out how to get the actual error out of the Flink web interface to add that information here

There were a few errors I found.
1) A Stream Execution Environment can only have one input (apparently? I could be wrong) so adding the .fromElements input was not good
2) I forgot all DataStreams are immutable so the .union operation creates a new DataStream output.
The final result ended up being much simpler
public class ContentBase {
public static void main(String[] args) throws Exception {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "kf-service:9092");
properties.setProperty("group.id", "varnost-content");
// Setup up execution environment and get stream from Kafka
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<ObjectNode> logs = see.addSource(new FlinkKafkaConsumer011<>("log-input",
new JSONKeyValueDeserializationSchema(false), properties).setStartFromLatest())
.map((MapFunction<ObjectNode, ObjectNode>) jsonNodes -> (ObjectNode) jsonNodes.get("value"));
// Create a new List of Streams, one for each "rule" that is being executed
// For now, I have a simple custom wrapper on flink's `.filter` function in `MyClass.filter`
List<String> codes = Arrays.asList("404", "200", "500");
List<DataStream<ObjectNode>> outputs = new ArrayList<>();
for (String code : codes) {
outputs.add(MyClass.filter(logs, "response", code));
}
Optional<DataStream<ObjectNode>> alerts = outputs.stream().reduce(DataStream::union);
// Convert to string and sink to Kafka
alerts.map((MapFunction<ObjectNode, String>) ObjectNode::toString)
.addSink(new FlinkKafkaProducer011<>("kf-service:9092", "log-output", new SimpleStringSchema()));
see.execute();
}
}

The code you post cannot be compiled through because of the last part code (i.e., converting to string). You mixed up the java stream API map with Flink one. Change it to
alerts.get().map(ObjectNode::toString);
can fix it.
Good luck.

Java Vaadin: CSV to Grid

I have a CSV File which I want to display in a Grid in Vaadin.
The File looks like this:
CTI.csv
Facebook
Twitter
Instagram
Wiki
So far i tried it with a while Loop and a for Loop. The for Loop looks like this:
Scanner sc = null;
try {
sc = new Scanner(new File("C:/development/code/HelloWorld/src/CTI.csv"));
} catch (FileNotFoundException e) {
e.printStackTrace();
}
List<CTITools> tools = null;
for (Iterator<String> s = sc; s.hasNext(); ) {
tools = Arrays.asList(new CTITools(s.next()));
}
Grid<CTITools> grid = new Grid<>();
grid.setItems(tools);
grid.addColumn(CTITools::getTool).setCaption("Tool");
layout.addComponents(grid);
setContent(layout);
The issue now is that it only shows the last entry "Wiki". If i hardcode the data like the following it works:
List<CTITools> tools;
tools = Arrays.asList(
new CTITools("DFC"),
new CTITools("AgentInfo"),
new CTITools("Customer"));
new CTITools("Wiki"));
Grid<CTITools> grid = new Grid<>();
grid.setItems(tools);
grid.addColumn(CTITools::getTool).setCaption("Tool");
So what am I doing wrong? Why doesn't the Loop work?

The line tools = Arrays.asList(new CTITools(s.next())); creates new list on each iteration. If you want your items to be stored in one list you need to create it once. Then use tools.add(new CTITools(s.next())) to add the items to the same list in the loop.

You create a new list at each iteration so you lose your previous content.
tools = Arrays.asList(new CTITools(s.next()));
You should only add items to the same tools list
List<CTITools> tools = new ArrayList<>();
for (Iterator<String> s = sc; s.hasNext(); ) {
tools.add(new CTITools(s.next()));
}

Trouble building Shapefile in Geotools

I have a project where I want to load in a given shapefile, and pick out polygons above a certain size before writing the results to a new shapefile. Maybe not the most efficient, but I've got code that successfully does all of that, right up to the point where it is supposed to write the shapefile. I get no errors, but the resulting shapefile has no usable data in it. I've followed as many tutorials as possible, but still I'm coming up blank.
The first bit of code is where I read in a shapefile, pickout the polygons I want, and put then into a feature collection. This part seems to work fine as far as I can tell.
public class ShapefileTest {
public static void main(String[] args) throws MalformedURLException, IOException, FactoryException, MismatchedDimensionException, TransformException, SchemaException {
File oldShp = new File("Old.shp");
File newShp = new File("New.shp");
//Get data from the original ShapeFile
Map<String, Object> map = new HashMap<String, Object>();
map.put("url", oldShp.toURI().toURL());
//Connect to the dataStore
DataStore dataStore = DataStoreFinder.getDataStore(map);
//Get the typeName from the dataStore
String typeName = dataStore.getTypeNames()[0];
//Get the FeatureSource from the dataStore
FeatureSource<SimpleFeatureType, SimpleFeature> source = dataStore.getFeatureSource(typeName);
SimpleFeatureCollection collection = (SimpleFeatureCollection) source.getFeatures(); //Get all of the features - no filter
//Start creating the new Shapefile
final SimpleFeatureType TYPE = createFeatureType(); //Calls a method that builds the feature type - tested and works.
DefaultFeatureCollection newCollection = new DefaultFeatureCollection(); //To hold my new collection
try (FeatureIterator<SimpleFeature> features = collection.features()) {
while (features.hasNext()) {
SimpleFeature feature = features.next(); //Get next feature
SimpleFeatureBuilder fb = new SimpleFeatureBuilder(TYPE); //Create a new SimpleFeature based on the original
Integer level = (Integer) feature.getAttribute(1); //Get the level for this feature
MultiPolygon multiPoly = (MultiPolygon) feature.getDefaultGeometry(); //Get the geometry collection
//First count how many new polygons we will have
int numNewPoly = 0;
for (int i = 0; i < multiPoly.getNumGeometries(); i++) {
double area = getArea(multiPoly.getGeometryN(i));
if (area > 20200) {
numNewPoly++;
}
}
//Now build an array of the larger polygons
Polygon[] polys = new Polygon[numNewPoly]; //Array of new geometies
int iPoly = 0;
for (int i = 0; i < multiPoly.getNumGeometries(); i++) {
double area = getArea(multiPoly.getGeometryN(i));
if (area > 20200) { //Write the new data
polys[iPoly] = (Polygon) multiPoly.getGeometryN(i);
iPoly++;
}
}
GeometryFactory gf = new GeometryFactory(); //Create a geometry factory
MultiPolygon mp = new MultiPolygon(polys, gf); //Create the MultiPolygonyy
fb.add(mp); //Add the geometry collection to the feature builder
fb.add(level);
fb.add("dBA");
SimpleFeature newFeature = SimpleFeatureBuilder.build( TYPE, new Object[]{mp, level,"dBA"}, null );
newCollection.add(newFeature); //Add it to the collection
}
At this point I have a collection that looks right - it has the correct bounds and everything. The next bit if code is where I put it into a new Shapefile.
//Time to put together the new Shapefile
Map<String, Serializable> newMap = new HashMap<String, Serializable>();
newMap.put("url", newShp.toURI().toURL());
newMap.put("create spatial index", Boolean.TRUE);
DataStore newDataStore = DataStoreFinder.getDataStore(newMap);
newDataStore.createSchema(TYPE);
String newTypeName = newDataStore.getTypeNames()[0];
SimpleFeatureStore fs = (SimpleFeatureStore) newDataStore.getFeatureSource(newTypeName);
Transaction t = new DefaultTransaction("add");
fs.setTransaction(t);
fs.addFeatures(newCollection);
t.commit();
ReferencedEnvelope env = fs.getBounds();
}
}
I put in the very last code to check the bounds of the FeatureStore fs, and it comes back null. Obviously, loading the newly created shapefile (which DOES get created and is ab out the right size), nothing shows up.

The solution actually had nothing to do with the code I posted - it had everything to do with my FeatureType definition. I did not include the "the_geom" to my polygon feature type, so nothing was getting written to the file.

I believe you are missing the step to finalize/close the file. Try adding this after the the t.commit line.
fs.close();
As an expedient alternative, you might try out the Shapefile dumper utility mentioned in the Shapefile DataStores docs. Using that may simplify your second code block into two or three lines.

How to read Nutch content from Java/Scala?

I'm using Nutch to crawl some websites (as a process that runs separate of everything else), while I want to use a Java (Scala) program to analyse the HTML data of websites using Jsoup.
I got Nutch to work by following the tutorial (without the script, only executing the individual instructions worked), and I think it's saving the websites' HTML in the crawl/segments/<time>/content/part-00000 directory.
The problem is that I cannot figure out how to actually read the website data (URLs and HTML) in a Java/Scala program. I read this document, but find it a bit overwhelming since I've never used Hadoop.
I tried to adapt the example code to my environment, and this is what I arrived at (mostly by guesswprk):
val reader = new MapFile.Reader(FileSystem.getLocal(new Configuration()), ".../apache-nutch-1.8/crawl/segments/20140711115438/content/part-00000", new Configuration())
var key = null
var value = null
reader.next(key, value) // test for a single value
println(key)
println(value)
However, I am getting this exception when I run it:
Exception in thread "main" java.lang.NullPointerException
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1873)
at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:517)
I am not sure how to work with a MapFile.Reader, specifically, what constructor parameters I am supposed to pass to it. What Configuration objects am I supposed to pass in? Is that the correct FileSystem? And is that the data file I'm interested in?

Scala:
val conf = NutchConfiguration.create()
val fs = FileSystem.get(conf)
val file = new Path(".../part-00000/data")
val reader = new SequenceFile.Reader(fs, file, conf)
val webdata = Stream.continually {
val key = new Text()
val content = new Content()
reader.next(key, content)
(key, content)
}
println(webdata.head)
Java:
public class ContentReader {
public static void main(String[] args) throws IOException {
Configuration conf = NutchConfiguration.create();
Options opts = new Options();
GenericOptionsParser parser = new GenericOptionsParser(conf, opts, args);
String[] remainingArgs = parser.getRemainingArgs();
FileSystem fs = FileSystem.get(conf);
String segment = remainingArgs[0];
Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);
Text key = new Text();
Content content = new Content();
// Loop through sequence files
while (reader.next(key, content)) {
try {
System.out.write(content.getContent(), 0,
content.getContent().length);
} catch (Exception e) {
}
}
}
}
Alternatively, you can use org.apache.nutch.segment.SegmentReader (example).

SecureASTCustomizer: how to restrict loops?

I'm trying to restrict using loops(FOR and WHILE operators) in Groovy script.
I tried http://groovy-sandbox.kohsuke.org/ but it seems to be not possible to restrict loops with this lib.
Code:
final String script = "while(true){}";
final ImportCustomizer imports = new ImportCustomizer();
imports.addStaticStars("java.lang.Math");
imports.addStarImports("groovyx.net.http");
imports.addStaticStars("groovyx.net.http.ContentType", "groovyx.net.http.Method");
final SecureASTCustomizer secure = new SecureASTCustomizer();
secure.setClosuresAllowed(true);
List<Integer> tokensBlacklist = new ArrayList<>();
tokensBlacklist.add(Types.KEYWORD_WHILE);
secure.setTokensBlacklist(tokensBlacklist);
final CompilerConfiguration config = new CompilerConfiguration();
config.addCompilationCustomizers(imports, secure);
Binding intBinding = new Binding();
GroovyShell shell = new GroovyShell(intBinding, config);
final Object eval = shell.evaluate(script);
Whats wrong with my code or probably some one knows how I can restrict some loops or operators?

WHILE and FOR are statements. You should rather try adding them as statementsBlacklist instead of tokenBlacklist.
List<Class> statementBlacklist = new ArrayList<>();
statementBlacklist.add( org.codehaus.groovy.ast.stmt.WhileStatement );
secure.setStatementsBlacklist( statementBlacklist );

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Generate multiple output files using MultiSinkTap - java

Related

Apache Flink Dynamic Pipeline

Java Vaadin: CSV to Grid

Trouble building Shapefile in Geotools

How to read Nutch content from Java/Scala?

SecureASTCustomizer: how to restrict loops?

Categories

Resources