I have to read a file and store the values and then later do a lookup.
For e.g., the file will look as follows:
Gryffindor = 5
Gryffindor.Name.Harry = 10
Gryffindor.Name.Harry.Cloak.Black = 15
and so on...
I need to store these (I was thinking of a map). Later, I need to process every character and lookup this map to assign them points. Suppose I encounter Harry, I know that he's from Gryffindor and he's wearing a blue cloak. I will have to lookup this map (or whatever object I use) as
Gryffindor.Name.Harry.Cloak.Blue
which should return me nothing. I then need to fall back to just the name and lookup
Gryffindor.Name.Harry
that should return me a 10.
Similarly, if I lookup for Ron, (suppose he's wearing black),
Gryffindor.Name.Ron.Cloak.Black
should return nothing, fall back to
Gryffindor.Name.Ron
again nothing, fall back to
Gryffindor
which should return 5.
What will be an elegant way to store and read this data? I was thinking of using a map for storing the key value pairs and then a switch case to read them back. How would you do it?
Java has a built-in Properties class that implements Map and can read and write the data format you describe (see that class's load() and store() methods).
There's nothing in there to implement your "fall back to a higher-level key" feature, so you'll need to write a method that looks in the Properties instance for data under the desired key, and keeps trying successively shorter versions of the same key if it finds nothing.
Related
I have a MongoDB database and the program I'm writing is meant to change the values of a single field for all documents in a collection. Now if I want them all to change to a single value, like the string value "mask", then I know that updateMany does the trick and it's quite efficient.
However, what I want is an efficient solution for updating to different new values, in fact I want to pick the new value for the field in question for each document from a list, e.g. an ArrayList. But then something like this
collection.updateMany(new BasicDBObject(),
new BasicDBObject("$set",new BasicDBObject(fieldName,
listOfMasks.get(random.nextInt(size)))));
wouldn't work since updateMany doesn't recompute the value that the field should be set to, it just computes what the argument
listOfMasks.get(random.nextInt(size))
would be once and then it uses that for all the documents. So I don't think there's a solution to this problem that can actually employ updateMany since it's simply not versatile enough.
But I was wondering if anyone has any ideas for at least making it faster than simply iterating through all the documents and each time do updateOne where it updates to a new value from the ArrayList (in a random order but that's just a detail), like below?
// Loop until the MongoCursor is empty (until the search is complete)
try {
while (cursor.hasNext()) {
// Pick a random mask
String mask = listOfMasks.get(random.nextInt(size));
// Update this document
collection.updateOne(cursor.next(), Updates.set("test_field", mask));
}
} finally {
cursor.close();
}```
MongoDB provides the bulk write API to batch updates. This would be appropriate for your example of setting the value of a field to a random value (determined on the client) for each document.
Alternatively if there is a pattern to the changes needed you could potentially use find and modify operation with the available update operators.
I am currently working with Java and MySQL, and I found an issue I don't know how to solve.
I have a class that stores a String of 365 positions that represents a Binary String "010111010010100...", and I would like to be able to store and read that field from the database.
Once it is read, I will perform an AND Logic operation with another bitarray.
I read about the BitSet class, that allows the logical operators (AND, OR, XOR, ...) between them. I tried it, but I didn't like the solutions I got. I could also try to transform the String to a byte array, and then store and read it from the database, in order to later perform the logic AND operation, but not sure if I would need to always create a BitSet, and how performant could it be.
I don't know which is the most performant way to do what I want:
Convert the Binary String in another element.
Store that element in the database (in the case of BitSet I tried to define the Database field as BLOB, but I had a lot of issues transforming the BitSet to BLOB and reading the BLOB to a BitSet).
Read the element from the database (at this point would be great to directly work with the element without making any cast or transformation).
Perform a logic AND with another bitarray and get the result.
I have tried a lot of options, but they didn't work.
Could someone help me with this problem and how to better approach it from the performance point of view?
Thanks!
Storing bit in a string is bit weird, I used long to store a number, and make bitwise operations on that. It won't work for you, since you use much more bits. If it can remain string, maybe you can write a short function to make the AND operator on each byte of the string, somehow like this:
for (int i = 0; i<366; i++) {
data .= (stringname[i] == binarystring[i]?"1":"0");
}
Go through your string, while checking if it equals binary string (The one you want to AND it), if they equal, concat 1, if not, concat 0;
Hello I am implementing a facebook-like program in java using hadoop framework (I am new to this). The main idea is that I have an input .txt file like this:
Christina Bill,James,Nick,Jessica
James Christina,Mary,Toby,Nick
...
The 1st is the user and the comma separated are his friends.
In the map function I scan each line of the file and emit the user with each one of his friends like
Christina Bill
Christina James
which will be converted to (Christina,[Bill,James,..])...
BUT in the description of my assignment it specifies that the Reduce function will receive as key the tuple of
two users, following by both their friends, you will count the
common ones and if that number is equal or greater than a
set number, like 5, you can safely assume that their
uncommon friends can be suggested. So how exactly do I pass a pair of users to the reduce function. I thought the input of the reduce function has to be the same as the output of the map function. I started coding this but I don't think this is the right approach. Any ideas?
public class ReduceFunction<KEY> extends Reducer<KEY,Text,KEY,Text> {
private Text suggestedFriend = new Text();
public void reduce(KEY key1,KEY key2, Iterable<Text> value1,Iterable<Text> value2,Context context){
}}
The output of the map phase should, indeed, be of the same type as the input of the reduce phase. This means that, if there is a requirement for the input of the reduce phase, you have to change your mapper.
The idea is simple:
map(user u,friends F):
for each f in F do
emit (u-f, F\f)
reduce(userPair u1-u2, friends F1,F2):
#commonFriends = |F1 intersection F2|
To implement this logic, you can just use a Text key, in which you concatenate the names of the users, using, e.g., the '-' character between them.
Note that in each reduce method, you will only receive two lists of friends, assuming that each user appears once in your input data. Then, you only have to compare the two lists for common names of friends.
Check if you can implement custom record reader, read two records at once from input file in mapper class. And then emit context.write(outkey, NullWritable.get()); from mapper class. Now in reducer class you need to handle two records came as a key(outkey) from mapper class. Good luck !
I am using hazelcast -2.5 in a cluster. I have a map (key: String, value: ArrayList of user defined objects). I am able to put/remove fine in most places but in one specific part of my code, the put operation fails silently (the key string used for the put operation is unique and the ArrayList is not empty either). No exceptions are thrown. In case there was a lock involved, I even tried tryPut and that call gave me a true return value. Right after the put operation, I tried printing out the keySet for the map but cannot see the key I just inserted - the size of the map has not changed either (yet the tryPut gave me a true return value and I'm reasonably sure the string I am using for the key is unique - and I am hoping the binary form for the key is unique as well). If the binary form for my key is not unique, I am assuming that the tryPut should return a false return value or at least replace the previously added key/value with the new key/value pair (unless I misinterpreted the docs).
boolean putVal = testMap.tryPut(this.testObj.UUID, testEntity, timeout, TimeUnit.MILLISECONDS); //timeout is 2000L or 2 seconds in this case
Any thoughts on troubleshooting this or figuring out if the binary form for my key is causing the issue will be appreciated.
Thanks
Try to do a get. And see if there is any value associated with that key. If not, the put should be successful.
Okay, so I have been reading a lot about Hadoop and MapReduce, and maybe it’s because I’m not as familiar with iterators as most, but I have a question I can’t seem to find a direct answer too. Basically, as I understand it, the map function is executed in parallel by many machine and/or cores. Thus, whatever you are working on must not depend on prior code being executed for the program to make any kind of speed gains. This works perfectly for me, but what I’m doing requires me to test information in small batches. Basically I need to send batches of lines in a .csv as arrays of 32, 64, 128 or whatever lines each. Like lines 0 – 127 go to core1’s execution of the map function, lines 128 – 255 lines go to core2’s, etc., .etc . Also I need to have the contents of each batch available as a whole inside the function, as if I had passed it an array. I read a little about how the new java API allows for something called push and pull, and that this allows things to be sent in batches, but I couldn’t find any example code. I dunno, I’m going to continue researching, and I’ll post anything I find, but if anyone knows, could they please post in this thread. I would really appreciate any help I might receive.
edit
If you could simply ensure that the chunks of the .csv are sent in sequence you could preform it this way. I guess this also assumes that there are globals in mapreduce.
//** concept not code **//
GLOBAL_COUNTER = 0;
GLOBAL_ARRAY = NEW ARRAY();
map()
{
GLOBAL_ARRAY[GLOBAL_COUNTER] = ITERATOR_VALUE;
GLOBAL_COUNTER++;
if(GLOBAL_COUNTER == 127)
{
//EXECUTE TEST WITH AN ARRAY OF 128 VALUES FOR COMPARISON
GLOBAL_COUNTER = 0;
}
}
If you're trying to get a chunk of lines from your CSV file into the mapper, you might consider writing your own InputFormat/RecordReader and potentially your own WritableComparable object. With the custom InputFormat/RecordReader you'll be able to specify how objects are created and passed to the mapper based on the input you receive.
If the mapper is doing what you want, but you need these chunks of lines sent to the reducer, make the output key for the mapper the same for each line you want in the same reduce function.
The default TextInputFormat will give input to your mapper like this (the keys/offsets in this example are just random numbers):
0 Hello World
123 My name is Sam
456 Foo bar bar foo
Each of those lines will be read into your mapper as a key,value pair. Just modify the key to be the same for each line you need and write it to the output:
0 Hello World
0 My name is Sam
1 Foo bar bar foo
The first time the reduce function is read, it will receive a key,value pair with the key being "0" and the value being an Iterable object containing "Hello World" and "My name is Sam". You'll be able to access both of these values in the same reduce method call by using the Iterable object.
Here is some pseudo code:
int count = 0
map (key, value) {
int newKey = count/2
context.write(newKey,value)
count++
}
reduce (key, values) {
for value in values
// Do something to each line
}
Hope that helps. :)
If the end goal of what you want is to force certain sets to go to certain machines for processing you want to look into writing your own Partitioner. Otherwise, Hadoop will split data automatically for you depending on the number of reducers.
I suggest reading the tutorial on the Hadoop site to get a better understanding of M/R.
If you simply want to send N lines of input to a single mapper, you can user the NLineInputFormat class. You could then do the line parsing (splitting on commas, etc) in the mapper.
If you want to have access to the lines before and after the line the mapper is currently processing, you may have to write your own input format. Subclassing FileInputFormat is usually a good place to start. You could create an InputFormat that reads N lines, concatenates them, and sends them as one block to a mapper, which then splits the input into N lines again and begins processing.
As far as globals in Hadoop go, you can specify some custom parameters when you create the job configuration, but as far as I know, you cannot change them in a worker and expect the change to propagate throughout the cluster. To set a job parameter that will be visible to workers, do the following where you are creating the job:
job.getConfiguration().set(Constants.SOME_PARAM, "my value");
Then to read the parameters value in the mapper or reducer,
public void map(Text key, Text value, Context context) {
Configuration conf = context.getConfiguration();
String someParam = conf.get(Constants.SOME_PARAM);
// use someParam in processing input
}
Hadoop has support for basic types such as int, long, string, bool, etc to be used in parameters.