Enhance the degree of parallelization of groupReduce transformation - java

In my Flink program I transform my data using a flatMap operation which divides several blocks of data in multiple smaller blocks. These blocks have a "position" attribute which describes their position in the respective original block. Now I use a groupReduce which needs to transform all small blocks which share the same "position" attribute. So it should be easily distributable on multiple nodes. But when I run my program on multiple nodes the groupReduce is executed with a dop of 1.
I guess this is because I have only one DataSet, but it seems that a GroupedDataSet is not available in Flink Java API. Is there another possibility to enhance the dop of my groupReduce transformation?
Here is the code I am using (dummy code ignoring "details"):
DataSet<SlicedTile> slicedTiles = tiles.flatMap()
.groupBy(position)
.sortGroup(time)
.getDataSet()
//Until here the dop is correct
DataSet<SlicedTile> processedSlicedTiles = slicedTiles.reduceGroup;

The problem with your code is the getDataSet() call. It returns the input of the grouping operation. Hence, the dataset represented by slicedTiles is neither grouped nor are its groups sorted but instead it is the result of the flatMap transformation and the groupBy and sortGroup calls are not considered in the program at all.
Applying a groupReduce (or reduce) operation on a non-grouped dataset is always a non-parallel operation because all elements of the input data set are processed as a single group.
Logically, the three transformation groupBy().sortGroup().reduceGroup() belong together and are translated into a single groupReduce operator (maybe with an additional combiner if the GroupReduceFunction is combinable).
If you change your implementation as follows, it should work as expected.
DataSet<SlicedTile> slicedTiles = tiles.flatMap()
.groupBy(position)
.sortGroup(time)
.reduceGroup(yourFunction);
I will open a JIRA issue to add JavaDocs to the Grouping.getDataSet() method to document the behavior of this function.

Related

How to call plenty of hierarchical RESTful APIs simultaneously and efficiently in Java

Suppose I've got a RESTful API /getDivision/dID=?, the ? need to be replaced with the actual division ID.
Assuming the dID=1, it return a JSON like below:
{
"division": 1
"subdivisions": "2,3,4,5",
"status": true
}
The return JSON contains the subdivision IDs of the dID=1 split by commas.
Note that a subdivision also has its own subdivisions and the top division id is 1. So it is a hierarchical API calling.
Now, in Java, I want to collect all the division IDs in a thread-safe collection. How can I run the routine concurrently and efficiently?
I've tried multithreading, but I think it's high performance overhead. Can I do something akin to non-blocking I/O?

MongoDB (Java): efficient update of multiple documents to different(!) values

I have a MongoDB database and the program I'm writing is meant to change the values of a single field for all documents in a collection. Now if I want them all to change to a single value, like the string value "mask", then I know that updateMany does the trick and it's quite efficient.
However, what I want is an efficient solution for updating to different new values, in fact I want to pick the new value for the field in question for each document from a list, e.g. an ArrayList. But then something like this
collection.updateMany(new BasicDBObject(),
new BasicDBObject("$set",new BasicDBObject(fieldName,
listOfMasks.get(random.nextInt(size)))));
wouldn't work since updateMany doesn't recompute the value that the field should be set to, it just computes what the argument
listOfMasks.get(random.nextInt(size))
would be once and then it uses that for all the documents. So I don't think there's a solution to this problem that can actually employ updateMany since it's simply not versatile enough.
But I was wondering if anyone has any ideas for at least making it faster than simply iterating through all the documents and each time do updateOne where it updates to a new value from the ArrayList (in a random order but that's just a detail), like below?
// Loop until the MongoCursor is empty (until the search is complete)
try {
while (cursor.hasNext()) {
// Pick a random mask
String mask = listOfMasks.get(random.nextInt(size));
// Update this document
collection.updateOne(cursor.next(), Updates.set("test_field", mask));
}
} finally {
cursor.close();
}```
MongoDB provides the bulk write API to batch updates. This would be appropriate for your example of setting the value of a field to a random value (determined on the client) for each document.
Alternatively if there is a pattern to the changes needed you could potentially use find and modify operation with the available update operators.

QueryRecord vs. PartitionRecord for better performance?

In a NiFi dataflow if I want to split a single flowfile into two sets based on the value of a particular field, is it faster, in terms of performance, to use QueryRecord or PartitionRecord in the following manners?
QueryRecord:
SELECT * FROM FLOWFILE WHERE WEIGHT < 1000;
PartitionRecord
In UpdateRecord in RecordPath mode populate a new "string" field greater_or_less with the value of /weight
In UpdateRecord in Literal Value mode update greater_or_less to ${field.value:toNumber():lt(1000)}
In PartitionRecord partition the flowfile on greater_or_less
In the PartitionRecord method, I will have two schemas, with one being the original data format, and the other having the greater_or_less field in addition to the original data format. We'll begin step 1 in the original schema, output from step 1 in the new schema, and then output step 3 in the original schema. The output of step 3 should be two flowfiles, one being equivalent to the output of the QueryRecord method.
In summation, although QueryRecord is a bit simpler to implement, I don't have any knowledge of the back-end machinations of NiFi, or how the overheads of these processors compare, so I am not sure which method is optimal. My instincts tell me that QueryRecord is expensive, but I am not sure how it compares to the type-switching and record-reading-and-writing of the PartitionRecord method.
I don't know which is faster off the top of my head, but both run on Apache Calcite under the covers which is very quick.
Have you considered using GenerateFlowfile to produce test data and try it out?
I would expect that PartitionRecord would be best, but use a filter with a predicate instead of generating a new field in your schema with UpdateRecord.
Both use a Record Reader and Writer for record level processing. So there is no difference on convert Record Abstract Processor in both's implmentation.
The differnce is PartitionRecord access type is native and faster to record level processing on the other hand QueryRecord has an extra overhead of running SQL for which it has to structure its records and metadata according to Calcite specififcations which is an overhead.
Some 5 minute stats I was able to process 47GB of data with a task time of 1:18:00 on QueryRecord while 0:47:00 on PartitionRecord with same number of threads.

Using multiple Leaves in Lucene Classifiers

I am trying to use the KNearestNeighbour classifier in lucene. The document classifier accepts a leafReader in its constructor, for training the classifier.
The problem is that, the Index I am using to train the classifier has multiple leaves. But the constructor for the class only accepts one leaf, and I could not find a process to add the remaining LeafReaders to the Class. I might be missing out on something. Could anyone help please me out with this?
Here is the code I am using currently :
FSDirectory index = FSDirectory.open(Paths.get(indexLoc));
IndexReader reader = DirectoryReader.open(index);
LeafReaderContext leaf = leaves.get(0);
LeafReader atomicReader = leaf.reader();
KNearestNeighborDocumentClassifier knn = new KNearestNeighborDocumentClassifier(atomicReader, BM25, null, 10, 0, 0, "Topics", field2analyzer, "Text");
Leaves represent each segment of you index. In terms of performance and resource usage, you should iterate over the leaves, run the classification for each segment and accumulate your results.
for (LeafReaderContext context : indexReader.getContext().leaves()) {
LeafReader reader = context.reader();
// run for each leaf
}
If that is not possible, you can use the SlowCompositeReaderWrapper which, as the name suggests, might be very slow as it aggregates all the leaves on the fly.
LeafReader singleLeaf = SlowCompositeReaderWrapper.wrap(indexReader);
// run classifier on singleLeaf
Depending on your Lucene version, this sits in lucene-core or lucene-misc (since Lucene 6.0, I think). Also, this class is deprecated and scheduled for removal in Lucene 7.0.
The third option might be to run forceMerge(1) until you only have one segment and you can use the single leaf for this. However, forcing a merge down to a single segment has other issues and might not work for your use case. If you data is write-once and then only used for reading, you could use a forceMerge. If you have regular updates, you'll have to end up using the first option and aggregate the classification result yourself.

Hadoop and MapReduce, How do I send the equivalent of an array of lines pulled from a csv to the map function, where each array contained lines x - y;

Okay, so I have been reading a lot about Hadoop and MapReduce, and maybe it’s because I’m not as familiar with iterators as most, but I have a question I can’t seem to find a direct answer too. Basically, as I understand it, the map function is executed in parallel by many machine and/or cores. Thus, whatever you are working on must not depend on prior code being executed for the program to make any kind of speed gains. This works perfectly for me, but what I’m doing requires me to test information in small batches. Basically I need to send batches of lines in a .csv as arrays of 32, 64, 128 or whatever lines each. Like lines 0 – 127 go to core1’s execution of the map function, lines 128 – 255 lines go to core2’s, etc., .etc . Also I need to have the contents of each batch available as a whole inside the function, as if I had passed it an array. I read a little about how the new java API allows for something called push and pull, and that this allows things to be sent in batches, but I couldn’t find any example code. I dunno, I’m going to continue researching, and I’ll post anything I find, but if anyone knows, could they please post in this thread. I would really appreciate any help I might receive.
edit
If you could simply ensure that the chunks of the .csv are sent in sequence you could preform it this way. I guess this also assumes that there are globals in mapreduce.
//** concept not code **//
GLOBAL_COUNTER = 0;
GLOBAL_ARRAY = NEW ARRAY();
map()
{
GLOBAL_ARRAY[GLOBAL_COUNTER] = ITERATOR_VALUE;
GLOBAL_COUNTER++;
if(GLOBAL_COUNTER == 127)
{
//EXECUTE TEST WITH AN ARRAY OF 128 VALUES FOR COMPARISON
GLOBAL_COUNTER = 0;
}
}
If you're trying to get a chunk of lines from your CSV file into the mapper, you might consider writing your own InputFormat/RecordReader and potentially your own WritableComparable object. With the custom InputFormat/RecordReader you'll be able to specify how objects are created and passed to the mapper based on the input you receive.
If the mapper is doing what you want, but you need these chunks of lines sent to the reducer, make the output key for the mapper the same for each line you want in the same reduce function.
The default TextInputFormat will give input to your mapper like this (the keys/offsets in this example are just random numbers):
0 Hello World
123 My name is Sam
456 Foo bar bar foo
Each of those lines will be read into your mapper as a key,value pair. Just modify the key to be the same for each line you need and write it to the output:
0 Hello World
0 My name is Sam
1 Foo bar bar foo
The first time the reduce function is read, it will receive a key,value pair with the key being "0" and the value being an Iterable object containing "Hello World" and "My name is Sam". You'll be able to access both of these values in the same reduce method call by using the Iterable object.
Here is some pseudo code:
int count = 0
map (key, value) {
int newKey = count/2
context.write(newKey,value)
count++
}
reduce (key, values) {
for value in values
// Do something to each line
}
Hope that helps. :)
If the end goal of what you want is to force certain sets to go to certain machines for processing you want to look into writing your own Partitioner. Otherwise, Hadoop will split data automatically for you depending on the number of reducers.
I suggest reading the tutorial on the Hadoop site to get a better understanding of M/R.
If you simply want to send N lines of input to a single mapper, you can user the NLineInputFormat class. You could then do the line parsing (splitting on commas, etc) in the mapper.
If you want to have access to the lines before and after the line the mapper is currently processing, you may have to write your own input format. Subclassing FileInputFormat is usually a good place to start. You could create an InputFormat that reads N lines, concatenates them, and sends them as one block to a mapper, which then splits the input into N lines again and begins processing.
As far as globals in Hadoop go, you can specify some custom parameters when you create the job configuration, but as far as I know, you cannot change them in a worker and expect the change to propagate throughout the cluster. To set a job parameter that will be visible to workers, do the following where you are creating the job:
job.getConfiguration().set(Constants.SOME_PARAM, "my value");
Then to read the parameters value in the mapper or reducer,
public void map(Text key, Text value, Context context) {
Configuration conf = context.getConfiguration();
String someParam = conf.get(Constants.SOME_PARAM);
// use someParam in processing input
}
Hadoop has support for basic types such as int, long, string, bool, etc to be used in parameters.

Categories

Resources