In a NiFi dataflow if I want to split a single flowfile into two sets based on the value of a particular field, is it faster, in terms of performance, to use QueryRecord or PartitionRecord in the following manners?
QueryRecord:
SELECT * FROM FLOWFILE WHERE WEIGHT < 1000;
PartitionRecord
In UpdateRecord in RecordPath mode populate a new "string" field greater_or_less with the value of /weight
In UpdateRecord in Literal Value mode update greater_or_less to ${field.value:toNumber():lt(1000)}
In PartitionRecord partition the flowfile on greater_or_less
In the PartitionRecord method, I will have two schemas, with one being the original data format, and the other having the greater_or_less field in addition to the original data format. We'll begin step 1 in the original schema, output from step 1 in the new schema, and then output step 3 in the original schema. The output of step 3 should be two flowfiles, one being equivalent to the output of the QueryRecord method.
In summation, although QueryRecord is a bit simpler to implement, I don't have any knowledge of the back-end machinations of NiFi, or how the overheads of these processors compare, so I am not sure which method is optimal. My instincts tell me that QueryRecord is expensive, but I am not sure how it compares to the type-switching and record-reading-and-writing of the PartitionRecord method.
I don't know which is faster off the top of my head, but both run on Apache Calcite under the covers which is very quick.
Have you considered using GenerateFlowfile to produce test data and try it out?
I would expect that PartitionRecord would be best, but use a filter with a predicate instead of generating a new field in your schema with UpdateRecord.
Both use a Record Reader and Writer for record level processing. So there is no difference on convert Record Abstract Processor in both's implmentation.
The differnce is PartitionRecord access type is native and faster to record level processing on the other hand QueryRecord has an extra overhead of running SQL for which it has to structure its records and metadata according to Calcite specififcations which is an overhead.
Some 5 minute stats I was able to process 47GB of data with a task time of 1:18:00 on QueryRecord while 0:47:00 on PartitionRecord with same number of threads.
Related
In my Flink program I transform my data using a flatMap operation which divides several blocks of data in multiple smaller blocks. These blocks have a "position" attribute which describes their position in the respective original block. Now I use a groupReduce which needs to transform all small blocks which share the same "position" attribute. So it should be easily distributable on multiple nodes. But when I run my program on multiple nodes the groupReduce is executed with a dop of 1.
I guess this is because I have only one DataSet, but it seems that a GroupedDataSet is not available in Flink Java API. Is there another possibility to enhance the dop of my groupReduce transformation?
Here is the code I am using (dummy code ignoring "details"):
DataSet<SlicedTile> slicedTiles = tiles.flatMap()
.groupBy(position)
.sortGroup(time)
.getDataSet()
//Until here the dop is correct
DataSet<SlicedTile> processedSlicedTiles = slicedTiles.reduceGroup;
The problem with your code is the getDataSet() call. It returns the input of the grouping operation. Hence, the dataset represented by slicedTiles is neither grouped nor are its groups sorted but instead it is the result of the flatMap transformation and the groupBy and sortGroup calls are not considered in the program at all.
Applying a groupReduce (or reduce) operation on a non-grouped dataset is always a non-parallel operation because all elements of the input data set are processed as a single group.
Logically, the three transformation groupBy().sortGroup().reduceGroup() belong together and are translated into a single groupReduce operator (maybe with an additional combiner if the GroupReduceFunction is combinable).
If you change your implementation as follows, it should work as expected.
DataSet<SlicedTile> slicedTiles = tiles.flatMap()
.groupBy(position)
.sortGroup(time)
.reduceGroup(yourFunction);
I will open a JIRA issue to add JavaDocs to the Grouping.getDataSet() method to document the behavior of this function.
I need to parse files that may be quite large, possibly 100s of megabytes and millions of lines. I have been trying to do this using FlatPack. I would think the way to do this would be to use the buffered parsers and the new stream methods. But, despite that dataset.next() returns true for the correct number of records, the Optional returned by dataset.getRecord() never contains a value.
I have looked at this example/test but it only counts the number of record and does not actually do anything with the content.
example/test
You can use the class BuffReaderParseFactory instead of DefaultParserFactory.
It will read one record from the input file only when you call "next()".
The explanations for both DefaultParserFactory and BuffReaderParseFactory are not exactly helpful. Both libraries said to return PZParser (from newDelimitedParser) but only one of them returns an actual value from a record. Based on the examples I've seen, I think BuffReaderParseFactory is just for checking performance (hence should be faster) and DefaultParserFactory on the other hand contains all the records.
I have a text file, with a sequence of integer per line:
47202 1457 51821 59788
49330 98706 36031 16399 1465
...
The file has 3 million lines of this format. I have to load this file into the memory and extract 5-grams out of it and do some statistics on it. I do have memory limitation (8GB RAM). I tried to minimize the number of objects I create (only have 1 class with 6 float variables, and some methods). And each line of that file, basically generates number of objects of this class (proportional to the size of the line in temrs of #ofwords). I started to feel that Java is not a good way to do these things when C++ is around.
Edit:
Assume that each line produces (n-1) objects of that class. Where n is the number of tokens in that line separated by space (i.e. 1457). So considering the average size of 10 words per line, each line gets mapped to 9 objects on average. So, there will be 9*3*10^6 objects.So, the memory needed is: 9*3*10^6*(8 bytes obj header + 6 x 4 byte floats) + (a map(String,Objects) and another map (Integer,ArrayList(Objects))). I need to keep everything in the memory, because there will be some mathematical optimization happening afterwards.
Reading/Parsing the file:
The best way to handle large files, in any language, is to try and NOT load them into memory.
In java, have a look at MappedByteBuffer. it allows you to map a file into process memory and access its contents without loading the whole thing into your heap.
You might also try reading the file line-by-line and discarding each line after you read it - again to avoid holding the entire file in memory at once.
Handling the resulting objects
For dealing with the objects you produce while parsing, there are several options:
Same as with the file itself - if you can perform whatever it is you want to perform without keeping all of them in memory (while "streaming" the file) - that is the best solution. you didnt describe the problem youre trying to solve so i dont know if thats possible.
Compression of some sort - switch from Wrapper objects (Float) to primitives (float), use something like the flyweight pattern to store your data in giant float[] arrays and only construct short-lived objects to access it, find some pattern in your data that allows you to store it more compactly
Caching/offload - if your data still doesnt fit in memory "page it out" to disk. this can be as simple as extending guava to page out to disk or bringing in a library like ehcache or the likes.
a note on java collections and maps in particular
For small objects java collections and maps in particular incur a large memory penalty (due mostly to everything being wrapped as Objects and the existence of the Map.Entry inner class instances). at the cost of a slightly less elegant API, you should probably look at gnu trove collections if memory consumption is an issue.
Optimal would be to hold only integers and line ends.
To that end, one way would be: convert the file to two files:
one binary file of integers (4 bytes)
one binary file with indexes where the next line would start.
For this one can use a Scanner to read, and a DataOutputStream+BufferedOutputStream to write.
Then you can load those two files in arrays of primitive type:
int[] integers = new int[(int)integersFile.length() / 4];
int[] lineEnds = new int[(int)lineEndsFile.length() / 4];
Reading can be done with a MappedByteBuffer.toIntBuffer(). (You then would not even need the arrays, but it would become a bit COBOL like verbose.)
To allow users to search across multiple fields with Lucene 3.5 I currently create and add a QueryParser for each field to be searched to a DisjunctionMaxQuery. This works great when using OR as the default operator but I now want to change the default operator to AND to get more accurate (and fewer) results.
Problem is, queryParser.setDefaultOperator(QueryParser.AND_OPERATOR) misses many documents since all terms must be in atleast 1 field.
For example, consider the following data for a document: title field = "Programming Languages", body field = "Java, C++, PHP". If a user were to search for Java Programming this particular document would not be included in the results since the title nor the body field contains all terms in the query although combined they do. I would want this document returned for the above query but not for the query HTML Programming.
I've considered a catchall field but I have a few problems with it. First, users frequently include per field terms in their queries (author:bill) which is not possible with a catchall field. Also, I highlight certain fields with FastVectorHighlighter which requires them to be indexed and stored. So by adding a catchall field I would have to index most of the same data twice which is time and space consuming.
Any ideas?
Guess I should have done a little more research. Turns out MultiFieldQueryParser provides the exact functionality I was looking for. For whatever reason I was creating a QueryParser for each field I wanted to search like this:
String[] fields = {"title", "body", "subject", "author"};
QueryParser[] parsers = new QueryParser[fields.length];
for(int i = 0; i < parsers.length; i++)
{
parsers[i] = new QueryParser(Version.LUCENE_35, fields[i], analyzer);
parsers[i].setDefaultOperator(QueryParser.AND_OPERATOR);
}
This would result in a query like this:
(+title:java +title:programming) | (+body:java +body:programming)
...which is not what I was looking. Now I create a single MultiFieldQueryParser like this:
MultiFieldQueryParser parser = new MultiFieldQueryParser(Version.LUCENE_35, new String[]{"title", "body", "subject"}, analyzer);
parser.setDefaultOperator(QueryParser.AND_OPERATOR);
This gives me the query I was looking for:
+(title:java body:java) +(title:programming body:programming)
Thanks to #seeta and #femtoRgon for the help!
Perhaps what you need is a combination of Boolean queries that capture the different combinations of fields and terms. In your given example, the query could be -
(title:Java AND body:programming) OR (title:programming AND body:Java).
I don't know if there's an existing Query class that generates this automatically for you, but I think that's what should be the ultimate query that's run on the index.
You want to be able to search multiple fields with the same set of terms, then the question from your comment:
((title:java title:programming) | (body:java body:programming))~0.2
May not be the best implementation.
You're effectively getting either the score from the title, or the score from the body for the combined set of terms. The case where you hit java in the title and programming in the body would be given approx. equal weight to a hit on java in the body and no hit on programming.
I think a better structured query would be:
(title:java body:java)~0.2 (title:programming body:programming)~0.2
This makes more sense to me, since you want the dismax queries to limit score growing on multiple queries of the same term (in different fields), but you do want scoring to grow for hits on different terms, I believe.
If that sort of query structure gets you better score results, limiting results to a certain minimum score (a percentage of the max score returned, rather than a simple hard-coded value) may be adequate to prevent too-weak results from being seen.
I also still wouldn't count out indexing an all field. It's an implementation I've used before, while indexing BOTH the specific field and the catchall field, thus allowing both general querying and specific single-field queries. Index storage tends to be pretty lean for unstored terms, and it will generally help performance, if you find yourself having to create big, complicated queries to make up for not having it.
If you really want to be sure that it takes minimal storage, you can even turn off TermVectors for that field:
new Field(name, value, Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.NO);
Although I don't know how much of a difference that would really make.
How can I verify two CRC implementations will generate the same checksums?
I'm looking for an exhaustive implementation evaluating methodology specific to CRC.
You can separate the problem into edge cases and random samples.
Edge cases. There are two variables to the CRC input, number of bytes, and value of each byte. So create arrays of 0, 1, and MAX_BYTES, with values ranging from 0 to MAX_BYTE_VALUE. The edge case suite will be something you'll most likely want to keep within a JUnit suite.
Random samples. Using the ranges above, run CRC on randomly generated arrays of bytes in a loop. The longer you let the loop run, the more you exhaust the inputs. If you are low on computing power, consider deploying the test to EC2.
Create several unit tests with the same input that will compare the output of both implementations against each other.
One nice property of CRCs is that for a given set of parameters (polynomial, reflection, initial state, etc.) you will get a constant value when you recompute the CRC over the original dataset + the original CRC. These constants are documented for common CRCs but you can just blindly generate them using two different random data sets and check that they are the same:
implementation 1: crc(rand_data_1 + crc(rand_data_1)) -> constant_1
implementation 2: crc(rand_data_2 + crc(rand_data_2)) -> constant_2
assert constant_1 == constant_2
You can use the same method within an implementation to get a warm fuzzy feeling about its correctness. If your implementation works with arbitrary polynomials, you can have the unittest exhaustively check every possible polynomial using this method without needing to know what the constants are.
This technique is powerful but it would also be wise to add an independent test that verifies the result based on known input for the pathological case where your CRC implementations both produce bad results that happen to get by the constant equivalence check.
First, if it is a standard CRC implementation, you should be able to find known values somewhere on the net.
Second, you could generate some number of payloads and run the each CRC on the payloads and check that the CRC values match.
By writing a unit test for each which takes the same input and verify against the expected output.