Construct document-term matrix via Java and MapReduce

Construct document-term matrix via Java and MapReduce - java

Background:
I’m trying to make a “document-term” matrix in Java on Hadoop using MapReduce. A document-term matrix is like a huge table where each row represents a document and each column represents a possible word/term.
Problem Statement:
Assuming that I already have a term index list (so that I know which term is associated with which column number), what is the best way to look up the index for each term in each document so that I can build the matrix row-by-row (i.e., document-by-document)?
So far I can think of two approaches:
Approach #1:
Store the term index list on the Hadoop distributed file system. Each time a mapper reads a new document for indexing, spawn a new MapReduce job -- one job for each unique word in the document, where each job queries the distributed terms list for its term. This approach sounds like overkill, since I am guessing there is some overhead associated with starting up a new job, and since this approach might call for tens of millions of jobs. Also, I’m not sure if it’s possible to call a MapReduce job within another MapReduce job.
Approach #2:
Append the term index list to each document so that each mapper ends up with a local copy of the term index list. This approach is pretty wasteful with storage (there will be as many copies of the term index list as there are documents). Also, I’m not sure how to merge the term index list with each document -- would I merge them in a mapper or in a reducer?
Question Update 1
Input File Format:
The input file will be a CSV (comma separated value) file containing all of the documents (product reviews). There is no column header in the file, but the values for each review appear in the following order: product_id, review_id, review, stars. Below is a fake example:
“Product A”, “1”,“Product A is very, very expensive.”,”2”
“Product G”, ”2”, “Awesome product!!”, “5”
Term Index File Format:
Each line in the term index file consists of the following: an index number, a tab, and then a word. Each possible word is listed only once in the index file, such that the term index file is analogous to what could be a list of primary keys (the words) for an SQL table. For each word in a particular document, my tentative plan is to iterate through each line of the term index file until I find the word. The column number for that word is then defined as the column/term index associated with that word. Below is an example of the term index file, which was constructed using the two example product reviews mentioned earlier.
1 awesome
2 product
3 a
4 is
5 very
6 expensive
Output File Format:
I would like the output to be in the “Matrix Market” (MM) format, which is the industry standard for compressing matrices with many zeros. This is the ideal format because most reviews will contain only a small proportion of all possible words, so for a particular document it is only necessary to specify the non-zero columns.
The first row in the MM format has three tab separated values: the total number of documents, the total number of word columns, and the total number of lines in the MM file excluding the header. After the header, each additional row contains the matrix coordinates associated with a particular entry, and the value of the entry, in this order: reviewID, wordColumnID, entry (how many times this word appears in the review). For more details on the Matrix Market format, see this link: http://math.nist.gov/MatrixMarket/formats.html.
Each review’s ID will equal its row index in the document-term matrix. This way I can preserve the review’s ID in the Matrix Market format so that I can still associate each review with its star rating. My ultimate goal -- which is beyond the scope of this question -- is to build a natural language processing algorithm to predict the number of stars in a new review based on its text.
Using the example above, the final output file would look like this (I can't get Stackoverflow to show tabs instead of spaces):
2 6 7
1 2 1
1 3 1
1 4 1
1 5 2
1 6 1
2 1 1
2 2 1

Well, you can use something analogous to a inverted index concept.
I'm suggesting this becaue, I'm assuming both the files are big. Hence, comparing each other like one-to-one would be real performance bottle neck.
Here's a way that can be used -
You can feed both the Input File Format csv file(s) (say, datafile1, datafile2) and the term index file (say, term_index_file) as input to your job.
Then in each mapper, you filter the source file name, something like this -
Pseudo code for mapper -
map(key, row, context){
String filename= ((FileSplit)context.getInputSplit()).getPath().getName();
if (filename.startsWith("datafile") {
//split the review_id, words from row
....
context.write(new Text("word), new Text("-1 | review_id"));
} else if(filename.startsWith("term_index_file") {
//split index and word
....
context.write(new Text("word"), new Text("index | 0"));
}
}
e.g. output from different mappers
Key Value source
product -1|1 datafile
very 5|0 term_index_file
very -1|1 datafile
product -1|2 datafile
very -1|1 datafile
product 2|0 term_index_file
...
...
Explanation (the example):
As it clearly shows the key will be your word and the value will be made of two parts separated by a delimiter "|"
If the source is a datafile then you emit key=product and value=-1|1, where -1 is a dummy element and 1 is a review_id.
If the source is a term_index_file then you emit key=product and value=2|0, where 2 is a index of word 'product' and 0 is a dummy review_id, which we would use for sorting- explained later.
Definitely, no duplicate index will be processed by two different mappers if we are providing the term_index_file as a normal input file to the job.
So, 'product, vary' or any other indexed word in the term_index_file will only be available to one mapper. Note this is only valid for term_index_file not the datafile.
Next step:
Hadoop mapreduce framework, as you might well know, will group by keys
So, you will have something like this going to different reducers,
reduce-1: key=product, value=<-1|1, -1|2, 2|0>
reduce-2: key=very, value=<5|0, -1|1, -1|1>
But, we have a problem in the above case. We would want a sort in the values after '|' i.e. in the reduce-1 -> 2|0, -1|1, -1|2 and in reduce-2 -> <5|0, -1|1, -1|1>
To achieve that you can use a secondary sort implemented using a sort comparator. Please google for this but here's a link that might help. Mentioning it here can go real lengthy.
In each reduce-1, since the values are sorted as above, when we begin iteration, we would get the '0' in the first iteration and with it the index_id=2, which could then be used for subsequent iterations. In the next two iteration, we get review ids 1 and 2 consecutively, and we use a counter, so that we could keep track of any repeated review ids. When we get repeated review ids that would mean that a word appeared twice in the same review_id row. We reset the counter only when we find a different review_id and emit the previous review_id details for the particular index_id, something like this -
previous_review_id + "\t" + index_id + "\t" + count
When the loop ends, we'll be left with a single previous_review_id, which we finally emit in the same fashion.
Pseudo code for reducer -
reduce(key, Iterable values, context) {
String index_id = null;
count = 1;
String previousReview_id = null;
for(value: values) {
Split split[] = values.split("\\|");
....
//when consecutive review_ids are same, we increment count
//and as soon as the review_id differ, we emit, reset the counter and print
//the previous review_id detected.
if (split[0].equals("-1") && split[1].equals(previousReview_id)) {
count++;
} else if(split[0].equals("-1") && !split[1].equals(prevValue)) {
context.write(previousReview_id + "\t" + index_id + "\t" + count);
previousReview_id = split[1];//resting with new review_id id
count=1;//resetting count for new review_id
} else {
index_id = split[0];
}
}
//the last previousReview_id will be left out,
//so, writing it now after the loop completion
context.write(previousReview_id + "\t" + index_id + "\t" + count);
}
This job is done with multiple reducers in order to leverage Hadoop for what it best known for - performance, as a result, the final output will be scattered, something like the following, deviating from your desired output.
1 4 1
2 1 1
1 5 2
1 2 1
1 3 1
1 6 1
2 2 1
But, if you want everything to be sorted according to the review_id (as your desired outpout), you can write one more job that will do that for your using a single reducer and the output of the previos job as input. And also at the same time calculate 2 6 7 and put it at the front of the output.
This is just an approach ( or an idea), I think, that might help you. You definitely want to modify this, put a better algorithm and use it the your way that you think would benefit you.
You can also use Composite keys for better clarity than using a delimiter such as "|".
I am open for any clarification. Please ask if you think, it might be useful to you.
Thank you!

You can load the term index list in Hadoop distributed cache so that it is available to mappers and reducers. For instance, in Hadoop streaming, you can run your job as follows:
$ hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-streaming-*.jar \
-input myInputDirs \
-output myOutputDir \
-mapper myMapper.py \
-reducer myReducer.py \
-file myMapper.py \
-file myReducer.py \
-file myTermIndexList.txt
Now in myMapper.py you can load the file myTermIndexList.txt and use it to your purpose. If you give a more detailed description of your input and desired output I can give you more details.

Approach #1 is not good but very common if you don't have much hadoop experience. Starting jobs is very expensive. What you are going to want to do is have 2-3 jobs that feed each other to get the desired result. A common solution to similar problems is to have the mapper tokenize the input and output pairs, group them in the reducer executing some kind of calculation and then feed that into job 2. In the mapper in job 2 you invert the data in some way and in the reducer do some other calculation.
I would highly recommend learning more about Hadoop through a training course. Interestingly Cloudera's dev course has a very similar problem to the one you are trying to address. Alternatively or perhaps in addition to a course I would look at "Data-Intensive Text Processing with MapReduce" specifically the sections on "COMPUTING RELATIVE FREQUENCIES" and "Inverted Indexing for Text Retrieval"
http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf

Related

how to output sorted files in java

I have a problem where I want to scan the files that are in a certain folder and output them.
the only problem is that the output is: (1.jpg , 10.jpg , 11.jpg , 12.jpg , ... , 19.jpg , 2.jpg) when I want it to be: (1.jpg , 2.jpg and so on). Since I use: File actual = new File(i.); (i is the number of times the loop repeats) to scan for images, I don't know how to sort the output.
this is my code for now.
//variables
String htmlHeader = ("<!DOCTYPE html>:\n"
+ "<html lang=\"en\">\n"
+ "<head>\n"
+ "<meta charset=\"UTF-8\">\n"
+ "<meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge\">\n"
+ "<meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">\n"
+ "<title>Document</title>\n"
+ "</head>"
+ "<body>;\n");
String mangaName = ("THREE DAYS OF HAPPINESS");
String htmlEnd = ("</body>\n</html>");
String image = ("image-");
//ask for page number
Scanner scan = new Scanner(System.in);
System.out.print("enter a chapter number: ");
int n = scan.nextInt();
//create file for chapter
File creator = new File("manga.html");
//for loop
for (int i = 1; i <= n; ++i) {
//writing to HTML file
BufferedWriter bw = null;
bw = new BufferedWriter(new FileWriter("manga"+i+".html"));
bw.write(htmlHeader);
bw.write("<h2><center>" + mangaName + "</center></h2</br>");
//scaning files
File actual = new File("Three Days Of Happiness Chapter "+i+" - Manganelo_files.");
for (File f : actual.listFiles()) {
String pageName = f.getName();
//create list
List<String> list = Arrays.asList(pageName);
list.sort(Comparator.nullsFirst(Comparator.comparing(String::length).thenComparing(Comparator.naturalOrder())));
System.out.println("list");
//for loop
//writing bpdy to html file
bw.write("<p><center><img src=\"Three Days Of Happiness Chapter "+i+" - Manganelo_files/" + pageName + "\" <br/></p>\n");
System.out.println(pageName);
}
bw.write(htmlEnd);
bw.close();
System.out.println("Process Finished");
}}
}```

When you try to sort the names, you'll most certainly notice that they are sorted alphanumerically (e.g. Comparing 9 with 12; 12 would come before 9 because the leftmost digit 1 < 9).
One way to get around this is to use an extended numbering format when naming & storing your files.
This has been working great for me when sorting pictures, for example. I use YYYY-MM-DD for all dates regardless whether the day contains one digit (e.g. 9) or two digits (11). This would mean that I always type 9 as 09. This also means that every file name in a given folder has the same length, and each digit (when compared to the corresponding digit to any other adjacent file) is compared properly.
One solution to your problem is to do the same and add zeros to the left of the file names so that they are easily sorted both by the OS and by your Java program. The drawback to this solution is that you'll need to decide the maximum number of files you'll want to store in a given folder beforehand – by setting the number of digits properly (e.g. 3 digits would mean a maximum of 1000 uniquely & linearly numbered file names from 000 to 999). The plus, however, is that this will save you the hassle of having to sort unevenly numerered files, while making it so that your files are pre-sorted once and are ready to be quickly read whenever.

Generally, file systems do not have an order to the files in a directory. Instead, anything that lists files (be it an ls or dir command on a command line, calling Files.list in java code, or opening Finder or Explorer) will apply a sorting order.
One common sorting order is 'alphanumerically'. In which case, the order you describe is correct: 2 comes after 1 and also after 10. You can't wave a magic wand and tell the OS or file system driver not to do that; files as a rule don't have an 'ordering' property.
Instead, make your filenames such that they do sort the way you want, when sorting alphanumerically. Thus, the right name for the first file would be 01.jpg. Or possibly even 0001.jpg - you're going to have to make a call about how many digits you're going to use before you start, unfortunately.
String.format("%05d", 1) becomes "00001" - that's pretty useful here.
The same principle applies to reading files - you can't just rely on the OS sorting it for you. Instead, read it all into e.g. a list of some sort and then sort that. You're going to have to write a fairly funky sorting order: Find the dot, strip off the left side, check if it is a number, etc. Quite complicated. It would be a lot simpler if the 'input' is already properly zero-prefixed, then you can just sort them naturally instead of having to write a complex comparator.
That comparator should probably by modal. Comparators work by being handed 2 elements, and you must say which one is 'earlier', and you must be consistent (if a is before b, and later I ask you: SO, how about b and a, you must indicate that b is after a).
Thus, an algorithm would look something like:
Determine if a is numeric or not (find the dot, parseInt the substring from start to the dot).
Determine if b is numeric or not.
If both are numeric, check ordering of these numbers. If they have an order (i.e. aren't identical), return an answer. Otherwise, compare the stuff after the dot (1.jpg should presumably be sorted before 1.png).
If neither are numeric, just compare alphanum (aName.compareTo(bName)).
If one is numeric and the other one is not, the numeric one always wins, and vice versa.

List to txt file getting insanely large, I mean huge

thank you for your help in advance.
I have written a code to gather data on some institutions. From the main website an id is provided, and then I have the use that id, to find a second id that I need from a different source. To save time when institutions are repeated (I can have the same institutions in different years for example, so even though some data will be different, the second id won't) I am storing the pairs of ids within two arrays, with the same index, so if I have the "indexOf" one, I have the other wihtout having to go every time to get the data from the source for that institution.
This works great so far, however the problem is when I try to store this pairs. each string(id) is alphanumeric that can contain spaces but never exceeds 20 characters.
I try to save the list after some iterations to save the job before continuing the next day, so I can start with a dictionary already. The data is put into a text file in the following way, each line contains a pair:
id1;id2
The problem is that this txt file gets huge. I mean, with 16 000 lines, it's already above 300MB, and when it got above 60 000 it was already above 22GB of size.
There is obviously a problem as I have had files with more text taking far less space. I write the file in the following way:
List<String> aux = new ArrayList<>();
for (int i = 0 ; i < nipcs.size() ; i++ ){
aux.add(id1.get(i) + ";" + id2.get(i));
}
Files.write(file.toPath() , aux , StandardCharsets.UTF_8);
I want to know if these file sizes are normal, or if there is another better method to do so. Basically I do not care if the txt file is "human-readable", just to be able to load its content into the id1 and id2 array the next day.
Thank you again!
Edit:
Here it is a sample of the txt file written, random 6 lines, copy-paste. I know I should expect to have around 120 000 lines max when all the institutions are registered.
ESA58357021;645090
507755383;346686
510216099;632378
207781818;840321
513268006;840323
502106344;54991

I think I found out what was happening. I paused Dropbox, and the symptom seems to have disapeared. I believe that the file being written while Dropbox doing the sync probably was messing with the structure of the file.
If this goes south again I will post it here.

Count duplicates in huge text files collection

I have this collection of folders:
60G ./big_folder_6
52G ./big_folder_8
61G ./big_folder_7
60G ./big_folder_4
58G ./big_folder_5
63G ./big_folder_2
54G ./big_folder_9
61G ./big_folder_3
39G ./big_folder_10
74G ./big_folder_1
Each folder contains 100 txt files, with one sentence per line. For example, the file ./big_folder_6/001.txt:
sentence ..
sentence ..
...
Each file in the folder is between 4 and 6 GB (as you can see from the totals reported above) with 40-60 million of sentences more or less. One single file fits in memory.
I need to deduplicate and count the sentences globally unique, so as to obtain a new collection of files where the lines are counted:
count ...unique sentence...
The collection is huge.
My first implementation (using Java) was a "merge sort" approach ordering the lines in a new collection of 500 files (dispatching each line in the right file using the first N characters), then order and aggregate duplicates on the single files.
I know it is a wordcount map-reduce problem but I would prefer to avoid it. The question is: am I using the right approach to solve this kind of problem or I should consider other tool/approach beside MapReduce?

You mean delete duplicated lines of each file? or among all files?
in any case, you cant read the whole file, you need to read line by line or a memory exception will be thrown. Use BufferedReader (example here), use a map storing the string with the count of the repeated line as a value, when you read a line, put in the map incrementing the value if it exist.
After read the file, write all the lines and theirs counts to a new file and release memory.
UPDATE 1
the problem is that you have a lot of gigas. So you cant keep in memory each line because it can thrown a memory exception, but at the same time you have to keep them in memory to quickly validate if they are duplicated. What comes to may mind is instead of having a string representing the key value, put a hash of the string (usgin string.toHash()), and when it was the first, write it to the new file, but flush every 100 lines or more to lower the time writing to the disk. After you processed all the files and write unique lines in the file and you have only integers in the map (hashcode of the string as a key and count as a value), you start reading the file containing only unique lines, then create a new file writing the line and the count values.

How to parse a single row from a CSV file for each java run and for the next run parse the next row?

I'm a beginner to java and I have a java class that reads in data from a CSV file which looks like this:
BUS|ID|Load|Max_Power
1 | 2 | 1 | 10.9
2 | 3 | 2 | 8.0
My problem is this: I have to consider for each java run (program execution), only 1 row at a time. For example for my first run I need to read in only the first row and then for my second run I need to read in the data from the second row.
Would using Hashmaps be the right way to search for the keys for each run?

When your program has to "remember" something beyond it's termination, you need to store this information somewhere (file, registry, ...).
Step 1 Figure out how to do file I/O (read/write files) with java. You need this to store your information, the line number in this case).
Step 2 Implement the logic:
read lineToRead from memory file (e.g. 1)
read line lineToRead (1) from data file and parse the data (take a look at #Kents answer for a nice explanation how to do so)
increment lineToRead (1 -> 2) and save it in to the memory file.
Hint: When mulitple instances of your program are going to run in parallel, you have to ensure the mutual exclusion / make the whole process (read, increment, write) atomic to prevent the lost update effect.

when you read the 1st line (the header), you split by | got string array (headerArray). Then init a hashmap <String, List<String>> (or Multimap if you use guava or other api) with elements in your string array as key.
Then you read each data row, split by |, again you got string array(dataArray), you just get the map value by: map.get(headerArray[index of dataArray]). Once you locate the map entry/value, you can do following logic (add to the list).
You can also design a ValueObjectType type with those attributes, and a special setter accepting int index, String value, there you check which attribute the value should go. In this way, you don't need map any longer, you need a List<ValueObjectType>

You can use com.csvreader.CsvReader class (available in javacsv.jar)
This class provides functionality to read CSV file row by row .
This will serve your purpose .
here is the sample code :-
CsvReader csv = new CsvReader(FileName);
csv.readHeaders(); // headers
while (products.readRecord()) {
String cal = csv.get(0);
}

What determines the number of reducers and how to avoid bottlenecks regarding reducers?

Suppose I have a big tsv file with this kind of information:
2012-09-22 00:00:01.0 249342258346881024 47268866 0 0 0 bo
2012-09-22 00:00:02.0 249342260934746115 1344951 0 0 4 ot
2012-09-22 00:00:02.0 249342261098336257 346095334 1 0 0 ot
2012-09-22 00:05:02.0 249342261500977152 254785340 0 1 0 ot
I want to implement a MapReduce job that enumerates time intervals of five minutes and filter some information of the tsv inputs. The output file would look like this:
0 47268866 bo
0 134495 ot
0 346095334 ot
1 254785340 ot
The key is the number of the interval, e.g., 0 is the reference of the interval between 2012-09-22 00:00:00.0 to 2012-09-22 00:04:59.
I don't know if this problem doesn't fit on MapReduce approach or if I'm not thinking it right. In the map function, I'm just passing the timestamp as key and the filtered information as value. In the reduce function, I count the intervals by using global variables and produce the output mentioned.
i. Does the framework determine the number of reducers in some automatically way or it is user defined? With one reducer, I think that there is no problem on my approach, but I'm wondering if one reduce can become a bottleneck when dealing with really large files, can it?
ii. How can I solve this problem with multiple reducers?
Any suggestions would be really appreciated!
Thanks in advance!
EDIT:
The first question is answered by #Olaf, but the second still gives me some doubts regarding parallelism. The map output of my map function is currently this (I'm just passing the timestamp with minute precision):
2012-09-22 00:00 47268866 bo
2012-09-22 00:00 344951 ot
2012-09-22 00:00 346095334 ot
2012-09-22 00:05 254785340 ot
So in the reduce function I receive inputs that the key represents the minute when the information was collected and the values the information itself and I want to enumerate five minutes intervals beginning with 0. I'm currently using a global variable to store the beginning of the interval and when a key extrapolate it I'm incrementing the interval counter (That is also a global variable).
Here is the code:
private long stepRange = TimeUnit.MINUTES.toMillis(5);
private long stepInitialMillis = 0;
private int stepCounter = 0;
#Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
long millis = Long.valueOf(key.toString());
if (stepInitialMillis == 0) {
stepInitialMillis = millis;
} else {
if (millis - stepInitialMillis > stepRange) {
stepCounter = stepCounter + 1;
stepInitialMillis = millis;
}
}
for (Text value : values) {
context.write(new Text(String.valueOf(stepCounter)),
new Text(key.toString() + "\t" + value));
}
}
So, with multiple reducers, I will have my reduce function running on two or more nodes, in two or more JVMs and I will lose the control given by the global variables and I'm not thinking of a workaround for my case.

The number of reducers depends on the configuration of the cluster, although you can limit the number of reducers used by your MapReduce job.
A single reducer would indeed become a bottleneck in your MapReduce job if you are dealing with any significant amount of data.
Hadoop MapReduce engine gurantees that all values associated with the same key are sent to the same reducer, so your approach should work with multile reducers. See Yahoo! tutorial for details: http://developer.yahoo.com/hadoop/tutorial/module4.html#listreducing
EDIT: To guarantee that all values for the same time interval go to the same reducer, you would have to use some unique identifier of the time interval as the key. You would have to do it in the mapper. I'm reading your question again and, unless you want to somehow aggregate the data between the records corresponding to the same time interval, you don't need any reducer at all.
EDIT: As #SeanOwen pointed, the number of reducers depends on the configuration of the cluster. Usually it is configured between 0.95 and 1.75 times the number of maximum tasks per node times the number of data nodes. If the mapred.reduce.tasks value is not set in the cluster configuration, the default number of reducers is 1.

It looks like you're wanting to aggregate some data by five-minute blocks. Map-reduce with Hadoop works great for this sort of thing! There should be no reason to use any "global variables". Here is how I would set it up:
The mapper reads one line of the TSV. It grabs the timestamp, and computes which five-minute bucket it belongs in. Make that into a string, and emit it as the key, something like "20120922:0000", "20120922:0005", "20120922:0010", etc. As for the value that is emitted along with that key, just keep it simple to start with, and send on the whole tab-delimited line as another Text object.
Now that the mapper has determined how the data needs to be organized, it's the reducer's job to do the aggregation. Each reducer will get a key (one of the five-minute buckers), along with the list of all the lines that fit into that bucket. It can iterate over that list, and extract whatever it wants from it, writing output to the context as needed.
As for mappers, just let hadoop figure that part out. Set the number of reducers to how many nodes you have in the cluster, as a starting point. Should run just fine.
Hope this helps.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.