How to bucket outputs in Scalding

How to bucket outputs in Scalding - java

I'm trying to output a pipe into different directories such that the output of each directory will be bucketed based on some ids.
So in a plain map reduce code I would use the MultipleOutputs class and I would do something like this in the reducer.
protected void reduce(final SomeKey key,
final Iterable<SomeValue> values,
final Context context) {
...
for (SomeValue value: values) {
String bucketId = computeBucketIdFrom(...);
multipleOutputs.write(key, value, folderName + "/" + bucketId);
...
So i guess one could do it like this in scalding
...
val somePipe = Csv(in, separator = "\t",
fields = someSchema,
skipHeader = true)
.read
for (i <- 1 until numberOfBuckets) {
somePipe
.filter('someId) {id: String => (id.hashCode % numberOfBuckets) == i}
.write(Csv(out + "/bucket" + i ,
writeHeader = true,
separator = "\t"))
}
But I feel that you would end up reding the same pipe many times and it will affect the overall performance.
Is there any other alternatives?
Thanks

Yes, of course there is a better way using TemplatedTsv.
So your code above can be written as follows,
val somePipe = Tsv(in, fields = someSchema, skipHeader = true)
.read
.write(TemplatedTsv(out, "%s", 'some_id, writeHeader = true))
This will put all records coming from 'some_id into separate folders under out/some_ids folder.
However, you can also create integer buckets. Just change the last lines,
.map('some_id -> 'bucket) { id: String => id.hashCode % numberOfBuckets }
.write(TemplatedTsv(out, "%02d", 'bucket, writeHeader = true, fields = ('all except 'bucket)))
This will create two digit folders as out/dd/. You can also check templatedTsv api here.
There might be small problem using templatedTsv, that is reducers can generate lots of small files which can be bad for the next job using your results. Therefore, it is better to sort on template fields before writing to disk. I wrote a blog about about it here.

Related

velocity template drop element from array

I'm trying to get the last element of an array in a velocity template dropped before joining it together into a string and showing the result in the "className": key below:
#set($elem = '"System.NotImplementedException: Test Exception')
#set($trace = $elem.replace('"',""))
#set($tracearray = $trace.split("\."))
#set($arraysize = $tracearray.size())
#set($lastelem = $tracearray.size() - 1)
{
"className":$tracearray.remove($lastelem).toString(),
"method":"$tracearray[$lastelem]"
}#if($foreach.hasNext),#end
#end
]
I've tried several different ways to get the array to drop the element and join it together into a string but haven't had any luck so far.
From the above example I'm looking for the following output to be achieved.
{
"className":"System",
"method":"NotImplementedException: Test Exception"
}
The $elem variable will be holding strings of various lengths and with a different number of .'s in them to split on so the lengths of the arrays will vary.

If you only need to remove the last element, why bother splitting the whole string? You could just do some parsing to extract the class name:
#set($elem = '"System.NotImplementedException: Test Exception')
#set($trace = $elem.replace('"',""))
#set($dot = $trace.lastIndexOf('.'))
#set($className = $trace.substring(0, $dot))
#set($method = $trace.substring($dot + 1))
{
"className": "$className",
"method": "$method"
}
Or, to accomodate the fact that the message at the end could contain a dot:
#set($elem = '"System.NotImplementedException: Test Exception')
#set($trace = $elem.replace('"',""))
#set($colon = $trace.indexOf(':'))
#set($dot = $trace.lastIndexOf('.', $colon))
#set($className = $trace.substring(0, $dot))
#set($method = $trace.substring($dot + 1))
{
"className": "$className",
"method": "$method"
}
With the method you have chosen, you would need another tool to join back the array elements with '.'. All this said, if you happen to be able to populate your Velocity context with a custom tool, all this stuff would be more easily done from with this custom tool.

Recursive calculation in Java

I'm trying to solve a calculation problem in Java.
Suppose my data looks as follows:
466,2.0762
468,2.0799
470,2.083
472,2.0863
474,2.09
476,2.0939
478,2.098
It's a list of ordered pairs, in the form of [int],[double]. Each line in my file contains one pair. The file can contain seven to seven thousand of those lines, all of them formatted as plain text.
Each [int] must be subtracted from the [int] one line above and the result written onto another file. The same calculation must be done for every [double]. For example, in the data reported above, the calculation should be:
478-476 -> result to file
476-474 -> result to file
(...)
2.098-2.0939 -> result to file
2.0939-2.09 -> result to file
and so on.
I beg your pardon if this question will look trivial for the vast majority of you, but after weeks trying to solve it, I got nowhere. I also had troubles finding something even remotely similar on this board!
Any help will be appreciated.
Thanks!

Read the file
Build the result
Write to a file
For the 1. task there are already several good answers here, for example try this one: Reading a plain text file in Java.
You see, we are able to read a file line per line. You may build a List<String> by that which contains the lines of your file.
To the 2. task. Let's iterate through all lines and build the result, again a List<String>.
List<String> inputLines = ...
List<String> outputLines = new LinkedList<String>();
int lastInt = 0;
int lastDouble = 0;
boolean firstValue = true;
for (String line : inputLines) {
// Split by ",", then values[0] is the integer and values[1] the double
String[] values = line.split(",");
int currentInt = Integer.parseInt(values[0]);
double currentDouble = Double.parseDouble(values[1]);
if (firstValue) {
// Nothing to compare to on the first run
firstValue = false;
} else {
// Compare to last values and build the result
int diffInt = lastInt - currentInt;
double diffDouble = lastDouble - currentDouble;
String outputLine = diffInt + "," + diffDouble;
outputLines.add(outputLine);
}
// Current values become last values
lastInt = currentInt;
lastDouble = currentDouble;
}
For the 3. task there are again good solutions on SO. You need to iterate through outputLines and save each line in a file: How to create a file and write to a file in Java?

Ektorp CouchDb: Query for pattern with multiple contains

I want to query multiple candidates for a search string which could look like "My sear foo".
Now I want to look for documents which have a field that contains one (or more) of the entered strings (seen as splitted by whitespaces).
I found some code which allows me to do a search by pattern:
#View(name = "find_by_serial_pattern", map = "function(doc) { var i; if(doc.serialNumber) { for(i=0; i < doc.serialNumber.length; i+=1) { emit(doc.serialNumber.slice(i), doc);}}}")
public List<DeviceEntityCouch> findBySerialPattern(String serialNumber) {
String trim = serialNumber.trim();
if (StringUtils.isEmpty(trim)) {
return new ArrayList<>();
}
ViewQuery viewQuery = createQuery("find_by_serial_pattern").startKey(trim).endKey(trim + "\u9999");
return db.queryView(viewQuery, DeviceEntityCouch.class);
}
which works quite nice for looking just for one pattern. But how do I have to modify my code to get a multiple contains on doc.serialNumber?
EDIT:
This is the current workaround, but there must be a better way i guess.
Also there is only an OR logic. So an entry fits term1 or term2 to be in the list.
#View(name = "find_by_serial_pattern", map = "function(doc) { var i; if(doc.serialNumber) { for(i=0; i < doc.serialNumber.length; i+=1) { emit(doc.serialNumber.slice(i), doc);}}}")
public List<DeviceEntityCouch> findBySerialPattern(String serialNumber) {
String trim = serialNumber.trim();
if (StringUtils.isEmpty(trim)) {
return new ArrayList<>();
}
String[] split = trim.split(" ");
List<DeviceEntityCouch> list = new ArrayList<>();
for (String s : split) {
ViewQuery viewQuery = createQuery("find_by_serial_pattern").startKey(s).endKey(s + "\u9999");
list.addAll(db.queryView(viewQuery, DeviceEntityCouch.class));
}
return list;
}

Looks like you are implementing a full text search here. That's not going to be very efficient in CouchDB (I guess same applies to other databases).
Correct me if I am wrong but from looking at your code looks like you are trying to search a list of serial numbers for a pattern. CouchDB (or any other database) is quite efficient if you can somehow index the data you will be searching for.
Otherwise you must fetch every single record and perform a string comparison on it.
The only way I can think of to optimize this in CouchDB would be the something like the following (with assumptions):
Your serial numbers are not very long (say 20 chars?)
You force the search to be always 5 characters
Generate view that emits every single 5 char long substring from your serial number - more or less this (could be optimized and not sure if I got the in):
...
for (var i = 0; doc.serialNo.length > 5 && i < doc.serialNo.length - 5; i++) {
emit([doc.serialNo.substring(i, i + 5), doc._id]);
}
...
Use _count reduce function
Now the following url:
http://localhost:5984/test/_design/serial/_view/complex-key?startkey=["01234"]&endkey=["01234",{}]&group=true
Will return a list of documents with a hit count for a key of 01234.
If you don't group and set the reduce option to be false, you will get a list of all matches, including duplicates if a single doc has multiple hits.
Refer to http://ryankirkman.com/2011/03/30/advanced-filtering-with-couchdb-views.html for the information about complex keys lookups.
I am not sure how efficient couchdb is in terms of updating that view. It depends on how many records you will have and how many new entries appear between view is being queried (I understand couchdb rebuilds the view's b-tree on demand).
I have generated a view like that that splits doc ids into 5 char long keys. Out of over 1K docs it generated over 30K results - id being 32 char long, simple maths really: (serialNo.length - searchablekey.length + 1) * docscount).
Generating the view took a while but the lookups where fast.
You could generate keys of multiple lengths, etc. All comes down to your records count vs speed of lookups.

JVisualVM HeapDump OQL rendering array inside an Object

I am trying to write a query such as this:
select {r: referrers(f), count:count(referrers(f))}
from com.a.b.myClass f
However, the output doesn't show the actual objects:
{
count = 3.0,
r = [object Object]
}
Removing the Javascript Object notation once again shows referrers normally, but they are no longer compartmentalized. Is there a way to format it inside the Object notation?

So I see that you asked this question a year ago, so I don't know if you still need the answer, but since I was searching around for something similar, I can answer this. The problem is that referrers(f) returns an enumeration and so it doesn't really translate well when you try to put it into your hashmap. I was doing a similar type of analysis where I was trying to find unique char arrays (count the unique combinations of char arrays up to the first 50 characters). What I came up with was this:
var counts = {};
filter(
map(
unique(
map(
filter(heap.objects('char[]'), "it.length > 50"), // filter out strings less than 50 chars in length
function(charArray) { // chop the string at 50 chars and then count the unique combos
var subs = charArray.toString().substr(0,50);
if (! counts[subs]) {
counts[subs] = 1;
} else {
counts[subs] = counts[subs] + 1;
}
return subs;
}
) // map
) // unique
, function(subs) { // map the strings into an array that has the string and the counts of that string
return { string: subs, count: counts[subs] };
}) // map
, "it.count > 5000"); // filter out strings that have counts < 5000
This essentially shows how to take an enumeration (heap.objects('char[]') in this case) and filter it and map it so that you can compute statistics on it. Hope this helps someone.

Introduce a counter into a loop within scala

I'm writing a small program which will convert a very large file into multiple smaller files, each file will contain 100 lines.
I'm iterating over a lines iteration :
while (lines.hasNext) {
val line = lines.next()
}
I want to introduce a counter and when it reaches a certain value, reset the counter and proceed. In java I would do something like :
int counter = 0;
while (lines.hasNext) {
val line = lines.next()
if(counter == 100){
counter = 0;
}
++counter
}
Is there something similar in scala or an alternative method ?

traditionally in scala you use .zipWithIndex
scala> List("foo","bar")
res0: List[java.lang.String] = List(foo, bar)
scala> for((x,i) <- res0.zipWithIndex) println(i + " : " +x)
0 : foo
1 : bar
(this will work with your lines too, as far as they are in Iterator, e.g. has hasNext and next() methods, or some other scala collection)
But if you need a complicated logic, like resetting counter, you may write it the same way as in java:
var counter = 0
while (lines.hasNext) {
val line = lines.next()
if(counter % 100 == 0) {
// now write to another file
}
}
Maybe you can tell us why you want to reset counter, so we may say how to do it better?
EDIT
according to your update, that is better to do using grouped method, as #pr1001 proposed:
lines.grouped(100).foreach(l => l.foreach(/* write line to file*/))

If your resetting counter represents the fact that there are repeated groups of data in the original list, you might want to use the grouped method:
scala> val l = List("one", "two", "three", "four")
l: List[java.lang.String] = List(one, two, three, four)
scala> l.grouped(2).toList
res0: List[List[java.lang.String]] = List(List(one, two), List(three, four))
Update: Since you're reading from a file, you should be able to pretty efficiently iterate over the file:
val bigFile = io.Source.fromFile("/tmp/verybigfile")
val groupedLines = bigFile.getLines.grouped(2).zipWithIndex
groupedLines.foreach(group => {
val (lines, index) = group
val p = new java.io.PrintWriter("/tmp/" + index)
lines.foreach(p.println)
p.close()
})
Of course this could also be written as a for comprehension...
You might even be able to get better performance by converting groupedLines to a parallel collection with .par before writing out each group of lines to its own file.

This would work:
lines grouped 100 flatMap (_.zipWithIndex) foreach {
case (line, count) => //whatever
}

You may use zipWithIndex along with some transformation.
scala> List(10, 20, 30, 40, 50).zipWithIndex.map(p => (p._1, p._2 % 3))
res0: List[(Int, Int)] = List((10,0), (20,1), (30,2), (40,0), (50,1))

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to bucket outputs in Scalding - java

Related

velocity template drop element from array

Recursive calculation in Java

Ektorp CouchDb: Query for pattern with multiple contains

JVisualVM HeapDump OQL rendering array inside an Object

Introduce a counter into a loop within scala

Categories

Resources