Is there a per-request limit for simultaneous transactions? - java

I'm using a lot of (sharded) counters in my application. According to my current design, a single request can cause 100-200 of different counters to increment.
So for each counter I'm picking up one shard whose value I'm incrementing. I'm incrementing each shard in a transaction, which means I will end up doing 100-200 transactions as part of processing a single request. Naturally I intend to do this asynchronously, so that I will be essentially running all 100-200 transactions in parallel.
As this number feels pretty high, I'm left wondering whether there is some per-request or per-instance limit for the amount of simultaneous transactions (or datastore requests). I could not find information on this from the documentation.
By the way for some reason Google's documentation states that "if your app has counters that are updated frequently, you should not increment them transactionally" [1], but on the other hand their code example on sharding counters uses a transaction for incrementing the shard [2]. I have figured I can use transactions if I just use enough shards. I prefer transactions as I would like my counters not to miss increments.
https://cloud.google.com/appengine/docs/java/datastore/transactions
https://cloud.google.com/appengine/articles/sharding_counters

There are three limitations that will probably cause you problems here:
1/sec write limit per entity group
5 entity groups per XG
10 concurrent 'threads' per instance
The last one is the tricky one for your use case.
Its a bit hard to find info on (and may in fact be out of date information - so its worth testing), but each instance only allows 10 concurrent core threads (regardless of size - F1/F2/F...).
That is, ignoring the creation of background threads, if you were to assume that each request takes a thread, as does each RPC (datastore, memcache, text search etc), you can only use 10 at a time. If the scheduler thinks an incoming request would exceed 10, it will route the request to a new instance.
In a scenario you want to write to 100 entities in parallel, i'd expect it to only allow about 10 concurrent writes (the rest blocking), but also your instance could only service one request at a time.
Alternatives for you:
Use dedicated memcache - you'll need to handle backing the counters onto durable storage but you could do that in batches on a backend. This may result in you losing some data due to flushes, whether thats ok or not you'll have to decide
Use CloudSQL sequences or tables - if you dont require huge scale,
but do require lots of counters, this may be a good approach - you
could store counts as raw counts, or as timeseries data and
post-process for accurate counts
Use pull queues to update counters in batches on a backend. You can process many 'events' across your many counter tables in larger batches. The downside is that the counts will not be up to date at any given point in time
The best approach is probably a hybrid.
For example, accepting some eventual consistency in counts:
When a request comes in - atomic increment of counters in memcache
When a request comes in - queue an 'event' task
Serve needed counts from memcache - if not present load from the datastore
Use TTLs on memcache, so that eventually the datastore is seen as the 'source of truth'
Run a cron which pulls 100 'event' tasks off the queue every 5 minutes (or as appropriate), and updates counters for all the events in a transaction in the datastore
UPDATE: I found this section in the docs, talking about controlling max number of concurrent requests, it makes a nebulous reference to
You may experience increased API latency if this setting is too high.
I'd say it's worth playing with.

I see that you're using a sharded counter approach to avoid contention, as described in: cloud.google.com/appengine/articles/sharding_counters.
Can you collect all your counters in a single entity, so that each shard is a bunch of counters? Then you wouldn't need so many separate transactions. According to cloud.google.com/appengine/docs/python/ndb/#quotas, an entity can be 1MB max, and certainly 200 integers will fit into that size restriction just fine.
It may be that you don't know the property names in advance. Here is an approach expressed in Go using its PropertyLoadSaver interface that can deal with dynamic counter names.
const (
counterPrefix = "COUNTER:"
)
type shard struct {
// We manage the saving and loading of counters explicitly.
counters map[string]int64 `datastore:"-"`
}
// NewShard construct a new shard.
func NewShard() *shard {
return &shard{make(map[string]int64)}
}
// Save implements PropertyLoadSaver.
func (s *shard) Save(c chan<- datastore.Property) error {
defer close(c)
for key, value := range s.counters {
c <- datastore.Property{
Name: counterPrefix + key,
Value: value,
NoIndex: true,
}
}
return nil
}
// Load implements PropertyLoadSaver.
func (s *shard) Load(c <-chan datastore.Property) error {
s.counters = make(map[string]int64)
for prop := range c {
if strings.HasPrefix(prop.Name, counterPrefix) {
s.counters[prop.Name[len(counterPrefix):]] = prop.Value.(int64)
}
}
return nil
}
The key is to use the raw API for defining your own property names when saving to the datastore. The Java API almost certainly has similar access, given the existence of PropertyContainer.
And the rest of the code in described in the sharding article would be expressed in terms of manipulating a single entity that knows about mutiple counters. So, for example, rather than having Increment() deal with a single counter:
// // Increment increments the named counter.
func Increment(c appengine.Context, name string) error {
...
}
we'd change its signature to a bulk-oriented operation:
// // Increment increments the named counters.
func Increment(c appengine.Context, names []string) error {
...
}
and the implementation would find a single shard, call Increment() for each of the counters we'd want to increment, and Save() that single entity to the datastore, all within a single transaction. Query would also involve consulting all the shards... but reads are fast. We still maintain the sharding architecture to avoid write contention.
The complete example code for Go is:
package sharded_counter
import (
"fmt"
"math/rand"
"strings"
"appengine"
"appengine/datastore"
)
const (
numShards = 20
shardKind = "CounterShard"
counterPrefix = "counter:"
)
type shard struct {
// We manage the saving and loading of counters explicitly.
counters map[string]int64 `datastore:"-"`
}
// NewShard constructs a new shard.
func NewShard() *shard {
return &shard{make(map[string]int64)}
}
// Returns a list of the names stored in the shard.
func (s *shard) Names() []string {
names := make([]string, 0, len(s.counters))
for name, _ := range s.counters {
names = append(names, name)
}
return names
}
// Lookup finds the counter's value.
func (s *shard) Lookup(name string) int64 {
return s.counters[name]
}
// Increment adds to the counter's value.
func (s *shard) Increment(name string) {
s.counters[name]++
}
// Save implements PropertyLoadSaver.
func (s *shard) Save(c chan<- datastore.Property) error {
for key, value := range s.counters {
c <- datastore.Property{
Name: counterPrefix + key,
Value: value,
NoIndex: true,
}
}
close(c)
return nil
}
// Load implements PropertyLoadSaver.
func (s *shard) Load(c <-chan datastore.Property) error {
s.counters = make(map[string]int64)
for prop := range c {
if strings.HasPrefix(prop.Name, counterPrefix) {
s.counters[prop.Name[len(counterPrefix):]] = prop.Value.(int64)
}
}
return nil
}
// AllCounters returns all counters.
func AllCounters(c appengine.Context) (map[string]int64, error) {
var results map[string]int64
results = make(map[string]int64)
q := datastore.NewQuery(shardKind)
q = q.Ancestor(ancestorKey(c))
for t := q.Run(c); ; {
var s shard
_, err := t.Next(&s)
if err == datastore.Done {
break
}
if err != nil {
return results, err
}
for _, name := range s.Names() {
results[name] += s.Lookup(name)
}
}
return results, nil
}
// ancestorKey returns an key that all counter shards inherit.
func ancestorKey(c appengine.Context) *datastore.Key {
return datastore.NewKey(c, "CountersAncestor", "CountersAncestor", 0, nil)
}
// Increment increments the named counters.
func Increment(c appengine.Context, names []string) error {
shardName := fmt.Sprintf("shard%d", rand.Intn(numShards))
err := datastore.RunInTransaction(c, func(c appengine.Context) error {
key := datastore.NewKey(c, shardKind, shardName, 0, ancestorKey(c))
s := NewShard()
err := datastore.Get(c, key, s)
// A missing entity and a present entity will both work.
if err != nil && err != datastore.ErrNoSuchEntity {
return err
}
for _, name := range names {
s.Increment(name)
}
_, err = datastore.Put(c, key, s)
return err
}, nil)
return err
}
which, if you look closely, is pretty much the example with a single, unnamed counter, but extended to handle multiple counter names. I changed a little bit on the query side so that the reads are using the same ancestor key so that we're in the same entity group.

Thanks for the responses! I think I now have the answers I need.
Regarding the per-request or per-instance limit
There is per-instance limit for concurrent threads, which effectively limits the amount of concurrent transactions. The default limit is 10. It can be incremented, but it is unclear what side-effects that will have.
Regarding the underlying problem
I chose to divide the counters in groups in such a way that counters that are usually incremented "together" are in the same group. Shards carry partial counts for all counters within the group the individual shard is associated with.
Counts are incremented still in transactions, but due to grouping only a maximum of five transactions per request is needed. Each transaction increments numerous partial counts stored in a single shard, which is represented as a single datastore entity.
Even if the transactions are run in series, the time to process a request will still be acceptable. Each counter group has a few hundred counters. I make sure there are enough shards to avoid contention.
It should be noted that this solution is only possible because the counters can be divided into fairly large groups of counters that are typically incremented together.

Related

Iterate over large collection in mongo [duplicate]

I have over 300k records in one collection in Mongo.
When I run this very simple query:
db.myCollection.find().limit(5);
It takes only few miliseconds.
But when I use skip in the query:
db.myCollection.find().skip(200000).limit(5)
It won't return anything... it runs for minutes and returns nothing.
How to make it better?
One approach to this problem, if you have large quantities of documents and you are displaying them in sorted order (I'm not sure how useful skip is if you're not) would be to use the key you're sorting on to select the next page of results.
So if you start with
db.myCollection.find().limit(100).sort({created_date:true});
and then extract the created date of the last document returned by the cursor into a variable max_created_date_from_last_result, you can get the next page with the far more efficient (presuming you have an index on created_date) query
db.myCollection.find({created_date : { $gt : max_created_date_from_last_result } }).limit(100).sort({created_date:true});
From MongoDB documentation:
Paging Costs
Unfortunately skip can be (very) costly and requires the server to walk from the beginning of the collection, or index, to get to the offset/skip position before it can start returning the page of data (limit). As the page number increases skip will become slower and more cpu intensive, and possibly IO bound, with larger collections.
Range based paging provides better use of indexes but does not allow you to easily jump to a specific page.
You have to ask yourself a question: how often do you need 40000th page? Also see this article;
I found it performant to combine the two concepts together (both a skip+limit and a find+limit). The problem with skip+limit is poor performance when you have a lot of docs (especially larger docs). The problem with find+limit is you can't jump to an arbitrary page. I want to be able to paginate without doing it sequentially.
The steps I take are:
Create an index based on how you want to sort your docs, or just use the default _id index (which is what I used)
Know the starting value, page size and the page you want to jump to
Project + skip + limit the value you should start from
Find + limit the page's results
It looks roughly like this if I want to get page 5432 of 16 records (in javascript):
let page = 5432;
let page_size = 16;
let skip_size = page * page_size;
let retval = await db.collection(...).find().sort({ "_id": 1 }).project({ "_id": 1 }).skip(skip_size).limit(1).toArray();
let start_id = retval[0].id;
retval = await db.collection(...).find({ "_id": { "$gte": new mongo.ObjectID(start_id) } }).sort({ "_id": 1 }).project(...).limit(page_size).toArray();
This works because a skip on a projected index is very fast even if you are skipping millions of records (which is what I'm doing). if you run explain("executionStats"), it still has a large number for totalDocsExamined but because of the projection on an index, it's extremely fast (essentially, the data blobs are never examined). Then with the value for the start of the page in hand, you can fetch the next page very quickly.
i connected two answer.
the problem is when you using skip and limit, without sort, it just pagination by order of table in the same sequence as you write data to table so engine needs make first temporary index. is better using ready _id index :) You need use sort by _id. Than is very quickly with large tables like.
db.myCollection.find().skip(4000000).limit(1).sort({ "_id": 1 });
In PHP it will be
$manager = new \MongoDB\Driver\Manager("mongodb://localhost:27017", []);
$options = [
'sort' => array('_id' => 1),
'limit' => $limit,
'skip' => $skip,
];
$where = [];
$query = new \MongoDB\Driver\Query($where, $options );
$get = $manager->executeQuery("namedb.namecollection", $query);
I'm going to suggest a more radical approach. Combine skip/limit (as an edge case really) with sort range based buckets and base the pages not on a fixed number of documents, but a range of time (or whatever your sort is). So you have top-level pages that are each range of time and you have sub-pages within that range of time if you need to skip/limit, but I suspect the buckets can be made small enough to not need skip/limit at all. By using the sort index this avoids the cursor traversing the entire inventory to reach the final page.
My collection has around 1.3M documents (not that big), properly indexed, but still takes a big performance hit by the issue.
After reading other answers, the solution forward is clear; the paginated collection must be sorted by a counting integer similar to the auto-incremental value of SQL instead of the time-based value.
The problem is with skip; there is no other way around it; if you use skip, you are bound to hit with the issue when your collection grows.
Using a counting integer with an index allows you to jump using the index instead of skip. This won't work with time-based value because you can't calculate where to jump based on time, so skipping is the only option in the latter case.
On the other hand,
by assigning a counting number for each document, the write performance would take a hit; because all documents must be inserted sequentially. This is fine with my use case, but I know the solution is not for everyone.
The most upvoted answer doesn't seem applicable to my situation, but this one does. (I need to be able to seek forward by arbitrary page number, not just one at a time.)
Plus, it is also hard if you are dealing with delete, but still possible because MongoDB support $inc with a minus value for batch updating. Luckily I don't have to deal with the deletion in the app I am maintaining.
Just write this down as a note to my future self. It is probably too much hassle to fix this issue with the current application I am dealing with, but next time, I'll build a better one if I were to encounter a similar situation.
If you have mongos default id that is ObjectId, use it instead. This is probably the most viable option for most projects anyway.
As stated from the official mongo docs:
The skip() method requires the server to scan from the beginning of
the input results set before beginning to return results. As the
offset increases, skip() will become slower.
Range queries can use indexes to avoid scanning unwanted documents,
typically yielding better performance as the offset grows compared to
using skip() for pagination.
Descending order (example):
function printStudents(startValue, nPerPage) {
let endValue = null;
db.students.find( { _id: { $lt: startValue } } )
.sort( { _id: -1 } )
.limit( nPerPage )
.forEach( student => {
print( student.name );
endValue = student._id;
} );
return endValue;
}
Ascending order example here.
If you know the ID of the element from which you want to limit.
db.myCollection.find({_id: {$gt: id}}).limit(5)
This is a lil genious solution which works like charm
For faster pagination don't use the skip() function. Use limit() and find() where you query over the last id of the precedent page.
Here is an example where I'm querying over tons of documents using spring boot:
Long totalElements = mongockTemplate.count(new Query(),"product");
int page =0;
Long pageSize = 20L;
String lastId = "5f71a7fe1b961449094a30aa"; //this is the last id of the precedent page
for(int i=0; i<(totalElements/pageSize); i++) {
page +=1;
Aggregation aggregation = Aggregation.newAggregation(
Aggregation.match(Criteria.where("_id").gt(new ObjectId(lastId))),
Aggregation.sort(Sort.Direction.ASC,"_id"),
new CustomAggregationOperation(queryOffersByProduct),
Aggregation.limit((long)pageSize)
);
List<ProductGroupedOfferDTO> productGroupedOfferDTOS = mongockTemplate.aggregate(aggregation,"product",ProductGroupedOfferDTO.class).getMappedResults();
lastId = productGroupedOfferDTOS.get(productGroupedOfferDTOS.size()-1).getId();
}

Reactor EmitterProcessor that only retains last n elements?

How do I create an EmitterProcessor that retains only the latest n elements, such that it also works even if there are no subscribers?
At the moment I create a processor like this:
EmitterProcessor<Integer> processor = EmitterProcessor.create();
And an external system provides temperature updates randomly throughout the day. In the callback from that system I do:
void tempConsumer(int temp) {
processor.onNext(temp);
}
However onNext(...) blocks once processor.getBufferSize() elements have been added.
How can I create a processor that discards the oldest element, in this case, rather than blocking?
This seems to be covered to some degree in reactor-core #763. Simon Baslé first discusses a proposed change to EmitterProcessor such that when "sending data while there are NO subscribers [and] the queue contains bufferSize elements, the oldest element is dropped and the onNext is enqueued." But then in the next comment, he says "we won't go ahead with my suggested change above. We instead advise you to use the sink() rather than directly the onNext. Namely, to use the onRequest callback inside the sink() to perform exactly as many sink.next(...) as there are requests."
However, if I understand things correctly this only covers the case where you can calculate new elements on demand, e.g. like so:
FluxSink<Integer> sink = processor.sink();
Random random = new Random();
sink.onRequest(n -> random.nextInt()); // Generate next n requested elements.
But in my situation, I can't generate the latest n temperature readings on demand. Of course, I could maintain my own external bounded buffer of the latest readings and then read from that in onRequest(...) but I'm assuming Reactor can do this for me?
I presume this question is a dup - but my Google foo has failed me here.
Ricard Kollcaku's answer that one should use ReplayProcessor seems to be the right way to do things. Here is another example that I wrote to get clear in my head how to use it:
ReplayProcessor<Integer> flux = ReplayProcessor.create(Queues.SMALL_BUFFER_SIZE);
FluxSink<Integer> sink = flux.sink();
// ReplayProcessor.getBufferSize() returns unbounded,
// while CAPACITY returns the capacity of the underlying buffer.
int capacity = flux.scan(Scannable.Attr.CAPACITY);
// Add twice as many elements as the underlying buffer can take.
int count = capacity * 2;
for (int i = 0; i < count; i++) {
sink.next(i);
}
// If `capacity` is 256, this will print value 256 thru to 511.
flux.subscribe(System.out::println);
I also found this section, in Hands-On Reactive Programming with Reactor, useful in explaining things.
You must use ReplayProcessor like this example :
ReplayProcessor<Integer> directProcessor = ReplayProcessor.cacheLast();
Flux.range(1, 10)
.map(integer -> {
directProcessor.onNext(integer);
return integer;
}).doOnComplete(() -> {
directProcessor.subscribe(System.out::println);
directProcessor.subscribe(System.out::println);
})
.subscribe();

Q: Parse.com query count stability

For testing purpose I put the following code in the onCreate() of an Activity:
// Create 50 objects
for (int i = 0; i < 50; i++) {
ParseObject obj = new ParseObject("test_obj");
obj.put("foo", "bar");
try {
obj.save();
} catch (ParseException pe) {
Log.d("Parsetest", "Failed to save " + pe.toString());
}
}
// Count them
for (int i = 0; i < 10; i ++) {
ParseQuery<ParseObject> query = ParseQuery.getQuery("test_obj");
query.countInBackground(new CountCallback() {
#Override
public void done(int count, ParseException e) {
if (e == null) {
Log.d("Parsetest", "Background found " + count + " objects");
} else {
Log.d("Parsetest", "Query issue" + e.toString());
}
}
});
}
I would expect the count to be always fifty, however running this code yields something like:
D/Parsetest(17761): Background found 0 objects
D/Parsetest(17761): Background found 0 objects
D/Parsetest(17761): Background found 0 objects
D/Parsetest(17761): Background found 0 objects
D/Parsetest(17761): Background found 0 objects
D/Parsetest(17761): Background found 0 objects
D/Parsetest(17761): Background found 50 objects
D/Parsetest(17761): Background found 0 objects
D/Parsetest(17761): Background found 0 objects
D/Parsetest(17761): Background found 0 objects
Can somebody explain this behavior and how to correct this ?
Without knowing further details, I'm inclined to believe the inconsistency is due to threading and the mixing of synchronous/asynchronous calls.
For example, calling obj.save(); is synchronous (reference), however, without seeing the rest of your code, it's possible that the synchronous save is being executed on a background thread.
Additionally, query.countInBackground is asynchronous and is being called multiple times with a for loop. This is going to simultaneously create 10 separate background processes to query Parse for the count of objects and depending on how the save is handled there could be race conditions.
Lastly, there are documented limitations on count operations with Parse.
Count queries are rate limited to a maximum of 160 requests per
minute. They can also return inaccurate results for classes with more
than 1,000 objects. Thus, it is preferable to architect your
application to avoid this sort of count operation (by using counters,
for example.)
From Héctor Ramos on the Parse Developers Google group,
Count queries have always been expensive once you throw some
constraints in. If you only care about the total size of the
collection, you can run a count query without any constraints and that
one should be pretty fast, as getting the total number of records is a
different problem than counting how many of these match an arbitrary
list of constraints. This is just the reality of working with database
systems.
Given the cost of count operations, it is possible that Parse has mechanisms in place to prevent rapid bursts of count operations from a given client.
If you are needing to perform count operations often, the recommended approach is to use cloud code afterSave hooks to increment/decrement a counter as needed.

How would I use slick 3.0 to return one row at a time?

How I would build a scala query to return one row of my table at a time?
My tables are in the following location if they help in answering this question:
Slick 3.0 (scala) queries don't return data till they are run multiple times (I think)
val q5 = for {
c <- dat.patientsss
} yield (c.PID, c.Gender, c.Age, c.Ethnicity)
Await.result((db.stream(q5.result).foreach(println)),Duration.Inf)
but instead of printing, I need return each.
Answer
Use a materialized result instead:
val result = Await.result((db.run(q5.result)), Duration.Inf)
result is a Seq that contains all your patient data. Use foreach to iterate over the result set:
result.foreach(r => yourFancyAlgorithm(r)) // r is a single patients data row
Sidenote
Await blocks the current thread making one of slick's best features obsolete. Blocking threads is something you should not do. I highly recommend to read about Future and Promise in scala.
The above example can be simply written as:
val result = db.run(q5.result))
result in this case will be of type Future[Seq[(yourPatientsData)]]. To access the data, use map on the result:
result.map(d => whatever) // d is of type Seq[(yourPatientsData)]
While waiting for the result, the rest of your application will continue to do its calculations and stuff. Finally when the result is ready, the callback (d => whatever) will run.

What determines the number of reducers and how to avoid bottlenecks regarding reducers?

Suppose I have a big tsv file with this kind of information:
2012-09-22 00:00:01.0 249342258346881024 47268866 0 0 0 bo
2012-09-22 00:00:02.0 249342260934746115 1344951 0 0 4 ot
2012-09-22 00:00:02.0 249342261098336257 346095334 1 0 0 ot
2012-09-22 00:05:02.0 249342261500977152 254785340 0 1 0 ot
I want to implement a MapReduce job that enumerates time intervals of five minutes and filter some information of the tsv inputs. The output file would look like this:
0 47268866 bo
0 134495 ot
0 346095334 ot
1 254785340 ot
The key is the number of the interval, e.g., 0 is the reference of the interval between 2012-09-22 00:00:00.0 to 2012-09-22 00:04:59.
I don't know if this problem doesn't fit on MapReduce approach or if I'm not thinking it right. In the map function, I'm just passing the timestamp as key and the filtered information as value. In the reduce function, I count the intervals by using global variables and produce the output mentioned.
i. Does the framework determine the number of reducers in some automatically way or it is user defined? With one reducer, I think that there is no problem on my approach, but I'm wondering if one reduce can become a bottleneck when dealing with really large files, can it?
ii. How can I solve this problem with multiple reducers?
Any suggestions would be really appreciated!
Thanks in advance!
EDIT:
The first question is answered by #Olaf, but the second still gives me some doubts regarding parallelism. The map output of my map function is currently this (I'm just passing the timestamp with minute precision):
2012-09-22 00:00 47268866 bo
2012-09-22 00:00 344951 ot
2012-09-22 00:00 346095334 ot
2012-09-22 00:05 254785340 ot
So in the reduce function I receive inputs that the key represents the minute when the information was collected and the values the information itself and I want to enumerate five minutes intervals beginning with 0. I'm currently using a global variable to store the beginning of the interval and when a key extrapolate it I'm incrementing the interval counter (That is also a global variable).
Here is the code:
private long stepRange = TimeUnit.MINUTES.toMillis(5);
private long stepInitialMillis = 0;
private int stepCounter = 0;
#Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
long millis = Long.valueOf(key.toString());
if (stepInitialMillis == 0) {
stepInitialMillis = millis;
} else {
if (millis - stepInitialMillis > stepRange) {
stepCounter = stepCounter + 1;
stepInitialMillis = millis;
}
}
for (Text value : values) {
context.write(new Text(String.valueOf(stepCounter)),
new Text(key.toString() + "\t" + value));
}
}
So, with multiple reducers, I will have my reduce function running on two or more nodes, in two or more JVMs and I will lose the control given by the global variables and I'm not thinking of a workaround for my case.
The number of reducers depends on the configuration of the cluster, although you can limit the number of reducers used by your MapReduce job.
A single reducer would indeed become a bottleneck in your MapReduce job if you are dealing with any significant amount of data.
Hadoop MapReduce engine gurantees that all values associated with the same key are sent to the same reducer, so your approach should work with multile reducers. See Yahoo! tutorial for details: http://developer.yahoo.com/hadoop/tutorial/module4.html#listreducing
EDIT: To guarantee that all values for the same time interval go to the same reducer, you would have to use some unique identifier of the time interval as the key. You would have to do it in the mapper. I'm reading your question again and, unless you want to somehow aggregate the data between the records corresponding to the same time interval, you don't need any reducer at all.
EDIT: As #SeanOwen pointed, the number of reducers depends on the configuration of the cluster. Usually it is configured between 0.95 and 1.75 times the number of maximum tasks per node times the number of data nodes. If the mapred.reduce.tasks value is not set in the cluster configuration, the default number of reducers is 1.
It looks like you're wanting to aggregate some data by five-minute blocks. Map-reduce with Hadoop works great for this sort of thing! There should be no reason to use any "global variables". Here is how I would set it up:
The mapper reads one line of the TSV. It grabs the timestamp, and computes which five-minute bucket it belongs in. Make that into a string, and emit it as the key, something like "20120922:0000", "20120922:0005", "20120922:0010", etc. As for the value that is emitted along with that key, just keep it simple to start with, and send on the whole tab-delimited line as another Text object.
Now that the mapper has determined how the data needs to be organized, it's the reducer's job to do the aggregation. Each reducer will get a key (one of the five-minute buckers), along with the list of all the lines that fit into that bucket. It can iterate over that list, and extract whatever it wants from it, writing output to the context as needed.
As for mappers, just let hadoop figure that part out. Set the number of reducers to how many nodes you have in the cluster, as a starting point. Should run just fine.
Hope this helps.

Categories

Resources