Getting sub-scores for sub-queries in Lucene - java

I have constructed a query that's essentially a weighted sum of other queries:
val query = new BooleanQuery
for ((subQuery, weight) <- ...) {
subQuery.setBoost(weight)
query.add(subQuery, BooleanClause.Occur.MUST)
}
When I query the index, I get back documents with the overall scores. This is good, but I also need to know what the sub-scores for each of the sub-queries were. How can I get those? Here's what I'm doing now:
for (scoreDoc <- searcher.search(query, nHits).scoreDocs) {
val score = scoreDoc.score
val subScores = subQueries.map { subQuery =>
val weight = searcher.createNormalizedWeight(subQuery)
val scorer = weight.scorer(reader, true, true)
scorer.advance(scoreDoc.doc)
scorer.score
}
}
I think this gives me the right scores, but it seems wasteful to advance to and re-score the document when I know it's already been scored as part of the overall score.
Is there a more efficient way to get those sub-scores?
[My code here is in Scala, but feel free to respond in Java if that's easier.]
EDIT: Here's what things look like after following Robert Muir's suggestion.
The query:
val query = new BooleanQuery
for ((subQuery, weight) <- ...) {
val weightedQuery = new BoostedQuery(subQuery, new ConstValueSource(weight))
query.add(weightedQuery, BooleanClause.Occur.MUST)
}
The search:
val collector = new DocScoresCollector(nHits)
searcher.search(query, collector)
for (docScores <- collector.getDocSubScores) {
...
}
The collector:
class DocScoresCollector(maxSize: Int) extends Collector {
var scorer: Scorer = null
var subScorers: Seq[Scorer] = null
val priorityQueue = new DocScoresPriorityQueue(maxSize)
override def setScorer(scorer: Scorer): Unit = {
this.scorer = scorer
// a little reflection hackery is required here because of a bug in
// BoostedQuery's scorer's getChildren method
// https://issues.apache.org/jira/browse/LUCENE-4261
this.subScorers = scorer.getChildren.asScala.map(childScorer =>
childScorer.child ...some hackery... ).toList
}
override def acceptsDocsOutOfOrder: Boolean = false
override def collect(doc: Int): Unit = {
this.scorer.advance(doc)
val score = this.scorer.score
val subScores = this.subScorers.map(_.score)
priorityQueue.insertWithOverflow(DocScores(doc, score, subScores))
}
override def setNextReader(context: AtomicReaderContext): Unit = {}
def getDocSubScores: Seq[DocScores] = {
val buffer = Buffer.empty[DocScores]
while (this.priorityQueue.size > 0) {
buffer += this.priorityQueue.pop
}
buffer
}
}
case class DocScores(doc: Int, score: Float, subScores: Seq[Float])
class DocScoresPriorityQueue(maxSize: Int) extends PriorityQueue[DocScores](maxSize) {
def lessThan(a: DocScores, b: DocScores) = a.score < b.score
}

There is a scorer navigation API: the basic idea is you write a collector and in its setScorer method, where normally you would save a reference to that Scorer to later score() each hit, you can now walk the tree of that Scorer's subscorers and so on.
Note that Scorers have pointers back to the Weight that created them, and the Weight back to the Query.
Using all of this, you can stash away references to the subscorers you care about in your setScorer method, e.g. all the ones created from TermQueries. Then when scoring hits, you could and investigate things like the freq() and score() of those nodes in your collector.
In the 3.x series this is a visitor API limited to boolean relationships, in the 4.x series (as of now only an alpha release), you can just get the child+relationship of each subscorer, so it can work with arbitrary queries (including custom ones you write or whatever).
Caveats:
you will need to return false from acceptsDocsOutOfOrder in your collector, as your collector requires this document-at-a-time processing for this to work.
you probably want a bugfix branch of the 3.6 series (http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_6/) or a snapshot of 4.x (http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/). This is because this functionality generally didnt work since disjunctions (OR queries) always set their subscorers 'one doc ahead' of the current document until some things were fixed last week, and those fixes didnt make it in time for 3.6.1. See https://issues.apache.org/jira/browse/LUCENE-3505 for more details.
There aren't really any good examples, except some simple tests that sum up the term frequencies of all the leaf nodes (see below)
Tests:
4.x series: http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/lucene/core/src/test/org/apache/lucene/search/TestBooleanQueryVisitSubscorers.java
3.x series: http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_6/lucene/core/src/test/org/apache/lucene/search/TestBooleanQueryVisitSubscorers.java

Related

feed_dict equivalent in Java

I am using Java to serve a Tensorflow model learned with Python. That model have two inputs. The code is the following :
def predict(float32InputShape: (Long, Long),
float32Inputs: Seq[Seq[Float]],
uint8InputShape: (Long, Long),
uint8Inputs: Seq[Seq[Byte]]
): Array[Float] = {
val float32Input = Tensor.create(
Array(float32InputShape._1, float32InputShape._2),
FloatBuffer.wrap(float32Inputs.flatten.toArray)
)
val uint8Input = Tensor.create(
classOf[UInt8],
Array(uint8InputShape._1, uint8InputShape._2),
ByteBuffer.wrap(uint8Inputs.flatten.toArray)
)
val tfResult = session
.runner()
.feed("serving_default_float32_Input", float32Input)
.feed("serving_default_uint8_Input", uint8Input)
.fetch("PartitionedCall")
.run()
.get(0)
.expect(classOf[java.lang.Float])
tfResult
}
What I would like to do is to refactor that method to make it more generic by passing the inputs like with feed_dict in Python. That is, something like :
def predict2(inputs: Map[String, Seq[Seq[Float]]]): Array[Float] = {
...
session
.runner()
.feed(inputs)
...
}
Where the key of the inputs map would be the name of the input layer. It's not possible to do so with the feed method unless I make a macro (which I want to avoid).
Is there any way to do this with the Java API of Tensorflow (I'm using TF 2.0) ?
Edit :
I found the solution (thanks to #geometrikal answer), the code is in Scala but it shoudn't be too hard to the same in Java.
val runnerWithInputLayers = inputs.foldLeft(session.runner()) {
case (sess, (layerName, array)) =>
val tensor = createTensor(array)
sess.feed(layerName, tensor)
}
val output = runnerWithInputLayers
.fetch(outputLayer)
.run()
.get(0)
.expect(Float.getClass)
It's possible because the .feed method returns a Session.Runner with the input layer provided.
You can feed each in a loop. If not so familiar with java script but pseudo-code is something like
e.g.
val tfResult = session.runner()
for(key, value : inputs) {
tfResult = tfResult(key, value)
}
tfResult = tfResult.fetch("PartitionedCall")
.run()
.get(0)
.expect(classOf[java.lang.Float])
Remember you can break up the function chain at any point, e.g. result = foo.bar().baz().qux() can be written temp = foo.bar().baz(); result = temp.qux()

Kotlin: Find Count from Nested set in List (more functional approach)

Below function creates a Map, gets the count of passengers where passengers are > minTrips. The code works completely fine. Please see below
fun List<Trip>.filter(minTrips : Int): Set<Passenger> {
var passengerMap: HashMap<Passenger, Int> = HashMap()
this.forEach { it: Trip ->
it.passengers.forEach { it: Passenger ->
var count: Int? = passengerMap.get(it)
if (count == null) {
count = 1
passengerMap.put(it, count)
} else {
count += 1
passengerMap.put(it, count)
}
}
}
val filteredMinTrips: Map<Passenger, Int> = passengerMap.filterValues { it >= minTrips }
println (" Filter Results = ${filteredMinTrips}")
return filteredMinTrips.keys
}
Even though this is written in Kotlin, it seems like the code was first written in Java and then converted over to Kotlin. If it was truly written in Kotlin I am sure this wouldnt have been so many lines of code. How can I reduce the lines of Code? What would be a more funtional approach to solve this? What function or functions can I use to extract the Passengers Set directly where Passengers are > minTrips? This is too much of a code and seems crazy. Any pointers would be helpful here.
One way you could do this is to take advantage of Kotlin's flatmap and grouping calls. By creating a list of all passengers on all trips, you can group them, count them, and return the ones that have over a certain number.
Assuming you have data classes like this (essential details only):
data class Passenger(val id: Int)
data class Trip(val passengers: List<Passenger>)
I was able to write this:
fun List<Trip>.frequentPassengers(minTrips: Int): Set<Passenger> =
this
.flatMap { it.passengers }
.groupingBy { it }
.eachCount()
.filterValues { it >= minTrips }
.keys
This is nice because it is a single expression. Going through it, we look at each Trip and extract all of its Passengers. If we had just done map here, we would have List<List<Passenger>>, but we want a List<Passenger> so we flatmap to achieve that. Next, we groupBy the Passenger objects themselves, and call eachCount() on the returned object, giving us a Map<Passenger, Int>. Finally we filter the map down the Passengers we find interesting, and return the set of keys.
Note that I renamed your function, List already has a filter on it, and even though the signatures are different I found it confusing.
You basically want to count the trips for each passenger, so you can put all passengers in a list and then group by them and afterwards count the occurences in each group:
fun List<Trip>.usualPassengers(minTrips : Int) = // 1
flatMap(Trip::passengers) // 2
.groupingBy { it } // 3
.eachCount() // 4
.filterValues { it >= minTrips } // 5
.keys // 6
Explanation:
return type Set<Passenger> can be inferred
this can be ommitted, a list of the form [p1, p2, p1, p5, ...] is returned
a Grouping is created, which looks like this [p1=[p1, p1], p2=[p2], ...]]
the number of occurences in each group will be counted: [p1=2, p2=1, ...]
all elementes with values which less than minTrips will be filtered out
all keys that are left will be returned [p1, p2, ...]
p1...pn are Passenger instances

Spark streaming mapWithState timeout delayed?

I expected the new mapWithState API for Spark 1.6+ to near-immediately remove objects that are timed-out, but there is a delay.
I'm testing the API with the adapted version of the JavaStatefulNetworkWordCount below:
SparkConf sparkConf = new SparkConf()
.setAppName("JavaStatefulNetworkWordCount")
.setMaster("local[*]");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1));
ssc.checkpoint("./tmp");
StateSpec<String, Integer, Integer, Tuple2<String, Integer>> mappingFunc =
StateSpec.function((word, one, state) -> {
if (state.isTimingOut())
{
System.out.println("Timing out the word: " + word);
return new Tuple2<String,Integer>(word, state.get());
}
else
{
int sum = one.or(0) + (state.exists() ? state.get() : 0);
Tuple2<String, Integer> output = new Tuple2<String, Integer>(word, sum);
state.update(sum);
return output;
}
});
JavaMapWithStateDStream<String, Integer, Integer, Tuple2<String, Integer>> stateDstream =
ssc.socketTextStream(args[0], Integer.parseInt(args[1]),
StorageLevels.MEMORY_AND_DISK_SER_2)
.flatMap(x -> Arrays.asList(SPACE.split(x)))
.mapToPair(w -> new Tuple2<String, Integer>(w, 1))
.mapWithState(mappingFunc.timeout(Durations.seconds(5)));
stateDstream.stateSnapshots().print();
Together with nc (nc -l -p <port>)
When I type a word into the nc window I see the tuple being printed in the console every second. But it doesn't seem like the timing out message gets printed out 5s later, as expected based on the timeout set. The time it takes for the tuple to expire seems to vary between 5 & 20s.
Am I missing some configuration option, or is the timeout perhaps only performed at the same time as checkpoints?
Once an event times out it's NOT deleted right away, but is only marked for deletion by saving it to a 'deltaMap':
override def remove(key: K): Unit = {
val stateInfo = deltaMap(key)
if (stateInfo != null) {
stateInfo.markDeleted()
} else {
val newInfo = new StateInfo[S](deleted = true)
deltaMap.update(key, newInfo)
}
}
Then, timed out events are collected and sent to the output stream only at checkpoint. That is: events which time out at batch t, will appear in the output stream only at the next checkpoint - by default, after 5 batch-intervals on average, i.e. batch t+5:
override def checkpoint(): Unit = {
super.checkpoint()
doFullScan = true
}
...
removeTimedoutData = doFullScan // remove timedout data only when full scan is enabled
...
// Get the timed out state records, call the mapping function on each and collect the
// data returned
if (removeTimedoutData && timeoutThresholdTime.isDefined) {
...
Elements are actually removed only when there are enough of them, and when state map is being serialized - which currently also happens only at checkpoint:
/** Whether the delta chain length is long enough that it should be compacted */
def shouldCompact: Boolean = {
deltaChainLength >= deltaChainThreshold
}
// Write the data in the parent state map while copying the data into a new parent map for
// compaction (if needed)
val doCompaction = shouldCompact
...
By default checkpointing occurs every 10 iterations, thus in the example above every 10 seconds; since your timeout is 5 seconds, events are expected within 5-15 seconds.
EDIT: Corrected and elaborated answer following comments by #YuvalItzchakov
Am I missing some configuration option, or is the timeout perhaps only
performed at the same time as snapshots?
Every time a mapWithState is invoked (with your configuration, around every 1 second), the MapWithStateRDD will internally check for expired records and time them out. You can see it in the code:
// Get the timed out state records, call the mapping function on each and collect the
// data returned
if (removeTimedoutData && timeoutThresholdTime.isDefined) {
newStateMap.getByTime(timeoutThresholdTime.get).foreach { case (key, state, _) =>
wrappedState.wrapTimingOutState(state)
val returned = mappingFunction(batchTime, key, None, wrappedState)
mappedData ++= returned
newStateMap.remove(key)
}
}
(Other than time taken to execute each job, it turns out that newStateMap.remove(key) actually only marks files for deletion. See "Edit" for more.)
You have to take into account the time it takes for each stage to be scheduled, and the amount of time it takes for each execution of such a stage to actually take it's turn and run. It isn't accurate because this runs as a distributed systems where other factors can come into play, making your timeout more/less accurate than you expect it to be.
Edit
As #etov rightly points out, newStateMap.remove(key) doesn't actually remove the element from the OpenHashMapBasedStateMap[K, S], but simply mark it for deletion. This is also a reason why you're seeing the expiration time adding up.
The actual relevant piece of code is here:
// Write the data in the parent state map while
// copying the data into a new parent map for compaction (if needed)
val doCompaction = shouldCompact
val newParentSessionStore = if (doCompaction) {
val initCapacity = if (approxSize > 0) approxSize else 64
new OpenHashMapBasedStateMap[K, S](initialCapacity = initCapacity, deltaChainThreshold)
} else { null }
val iterOfActiveSessions = parentStateMap.getAll()
var parentSessionCount = 0
// First write the approximate size of the data to be written, so that readObject can
// allocate appropriately sized OpenHashMap.
outputStream.writeInt(approxSize)
while(iterOfActiveSessions.hasNext) {
parentSessionCount += 1
val (key, state, updateTime) = iterOfActiveSessions.next()
outputStream.writeObject(key)
outputStream.writeObject(state)
outputStream.writeLong(updateTime)
if (doCompaction) {
newParentSessionStore.deltaMap.update(
key, StateInfo(state, updateTime, deleted = false))
}
}
// Write the final limit marking object with the correct count of records written.
val limiterObj = new LimitMarker(parentSessionCount)
outputStream.writeObject(limiterObj)
if (doCompaction) {
parentStateMap = newParentSessionStore
}
If deltaMap should be compacted (marked with the doCompaction variable), then (and only then) is the map cleared from all the deleted instances. How often does that happen? One the delta exceeds the threadshold:
val DELTA_CHAIN_LENGTH_THRESHOLD = 20
Which means the delta chain is longer than 20 items, and there are items that have been marked for deletion.

Process the list of different types - is using scala (or functional programming) more expensive than Java?

First of all, let me be clear that I am very new to Scala and functional programming, so my understanding and implementation may be incorrect or inefficient.
Given a file look like this:
type1 param11 param12 ...
type2 param21 param22 ...
type2 param31 param32 ...
type1 param41 param42 ...
...
Basically, each line starts with the type of an object which can be created by the following parameters in the same line. I'm working an application which goes through each line, creates an object of a given type and returns the list of lists of all the objects.
In Java, my implementation is like this:
public void parse(List[Type1] type1s, List[Type2] type2s, List[String] lines) {
for (String line in lines) {
if (line.startsWith("type1")) {
Type1 type1 = Type1.createObj(line);
type1s.add(type1)l
} else if (line.startsWith("type2")) {
Type2 type2 = Type2.createObj(line);
type2s.add(type2)l
} else { throw new Exception("Unknown type %s".format(line)) }
}
}
In order to do the same thing in Scala, I do this:
def parse(lines: List[String]): (List[Type1], List[Type2]) = {
val type1Lines = lines filter (x => x.startsWith("type1"))
val type2Lines = lines filter (x => x.startsWith("type2"))
val type1s = type1Lines map (x => Type1.createObj(x))
val type2s = type2Lines map (x => Type2.createObj(x))
(type1s, type2s)
}
As I understand, while my Java implementation only goes through the list once, the Scala one has to do it three times: to filter type1, to filter type2 and to create objects from them. Which means the Scala implementation should be slower than the Java one, right? Moreover, the Java implementation is also more memory saving as it only has 3 instances: type1s, type2s and lines. On the other hand, the Scala one has 5: lines, type1Lines, type2Lines, type1s and type2s.
So my questions are:
Is there a better way to re-write my Scala implementation so that the list is iterated only once?
Using immutable object means a new object is create every time, does
it mean functional programming requires more memory than others?
Updated: I create a simple test to demonstrate that the Scala program is slower: a program receives a list of String with size = 1000000. It iterate through a list and check each item, if an item starts with "type1", it adds 1 to a list named type1s, otherwise, it adds 2 to another list named type2s.
Java implementation:
public static void test(List<String> lines) {
System.out.println("START");
List<Integer> type1s = new ArrayList<Integer>();
List<Integer> type2s = new ArrayList<Integer>();
long start = System.currentTimeMillis();
for (String l : lines) {
if (l.startsWith("type1")) {
type1s.add(1);
} else {
type2s.add(2);
}
}
long end = System.currentTimeMillis();
System.out.println(String.format("END after %s milliseconds", end - start));
}
Scala implementation:
def test(lines: List[String]) = {
println("START")
val start = java.lang.System.currentTimeMillis()
val type1Lines = lines filter (x => x.startsWith("type1"))
val type2Lines = lines filter (x => x.startsWith("type2"))
val type1s = type1Lines map (x => 1)
val type2s = type2Lines map (x => 2)
val end = java.lang.System.currentTimeMillis()
println("END after %s milliseconds".format(end - start))
}
}
Averagely, the Java application took 44 milliseconds while the Scala one needed 200 milliseconds.
object ScalaTester extends App {
val random = new Random
test((0 until 1000000).toList map {_ => s"type${random nextInt 10}"})
def test(lines: List[String]) {
val start = Platform.currentTime
val m = lines groupBy {
case s if s startsWith "type1" => "type1"
case s if s startsWith "type2" => "type2"
case _ => ""
}
println(s"Total type1: ${m("type1").size}; Total type2: ${m("type2").size}; time=${Platform.currentTime - start}")
}
}
The real advantage of Scala (and functional programming in general) is the ability to process data transforming one structures into another.
Of course you can combine mappings, flatMappings, filters, groups and so forth in a single code line. It results to a single data collection.
You may do it one after another creating new collections each time. And this produces a little overhead indeed. But does one care about it? Even though you create excessive collections Scala-style programming helps you design parallel oriented code (as Niklas already mentioned) and prevents you from very elusive side-effects errors that imperative-style programming is prone to

Using MapMaker#makeComputingMap to prevent simultaneous RPCs for the same data

We have a slow backend server that is getting crushed by load and we'd like the middle-tier Scala server to only have one outstanding request to the backend for each unique lookup.
The backend server only stores immutable data, but upon the addition of new data, the middle-tier servers will request the newest data on behalf of the clients and the backend server has a hard time with the load. The immutable data is cached in memcached using unique keys generated upon the write, but the write rate is high so we get a low memcached hit rate.
One idea I have is to use Google Guava's MapMaker#makeComputingMap() to wrap the actual lookup and after ConcurrentMap#get() returns, the middle-tier will save the result and just delete the key from the Map.
This seems a little wasteful, although the code is very easy to write, see below for an example of what I'm thinking.
Is there a more natural data structure, library or part of Guava that would solve this problem?
import com.google.common.collect.MapMaker
object Test
{
val computer: com.google.common.base.Function[Int,Long] =
{
new com.google.common.base.Function[Int,Long] {
override
def apply(i: Int): Long =
{
val l = System.currentTimeMillis + i
System.err.println("For " + i + " returning " + l)
Thread.sleep(2000)
l
}
}
}
val map =
{
new MapMaker().makeComputingMap[Int,Long](computer)
}
def get(k: Int): Long =
{
val l = map.get(k)
map.remove(k)
l
}
def main(args: Array[String]): Unit =
{
val t1 = new Thread() {
override def run(): Unit =
{
System.err.println(get(123))
}
}
val t2 = new Thread() {
override def run(): Unit =
{
System.err.println(get(123))
}
}
t1.start()
t2.start()
t1.join()
t2.join()
System.err.println(get(123))
}
}
I'm not sure why you implement remove yourself, why not simply have weak or soft values and let the GC clean up for you?
new MapMaker().weakValues().makeComputingMap[Int, Long](computer)
I think what you do is quite reasonable. You only use the structure to get lock-striping on the key, to ensure that accesses to the same key conflict. No worries that you don't need a value mapping per key. ConcurrentHashMap and friends is the only structure in Java libraries+Guava that offers you lock-striping.
This does induce some minor runtime overhead, plus the size of the hashtable which you don't need (which might even grow, if accesses to the same segment pile up and remove() doesn't keep up).
If you want to make it as cheap as possible, you could code some simple lock-striping yourself. Basically an Object[] (or Array[AnyRef] :)) of N locks (N = concurrency level), and you just map the hash of the lookup key into this array, and lock. Another advantage of this is that you really don't have to do hashcode tricks that CHM requires to do, because the latter has to split the hashcode in one part to select the lock, and another for the needs of the hashtable, but you can use the whole of it just for the lock selection.
edit: Sketching my comment below:
val concurrencyLevel = 16
val locks = (for (i <- 0 to concurrencyLevel) yield new AnyRef).toArray
def access(key: K): V = {
val lock = locks(key.hashCode % locks.size)
lock synchronized {
val valueFromCache = cache.lookup(key)
valueFromCache match {
case Some(v) => return v
case None =>
val valueFromBackend = backendServer.lookup(key)
cache.put(key, valueFromBackend)
return valueFromBackend
}
}
}
(Btw, is the toArray call needed? Or the returned IndexSeq is already fast to access by index?)

Categories

Resources