Java Mapreduce - getting names of files with matches & printing to output file - java

Hi I've been trying to come up with a modified version of the standard WordCount v1.0
wherein I read all files from an input directory (args[0]) and I print the output to an output directory (args[1]) which consists of not just the words and the number of occurrences, but a list of files where matches took place.
So for example I have 2 text files:
//1.txt
I love hadoop
and big data
//2.txt
I like programming
hate big data
The output would be:
//Output.txt
I 2 1.txt 2.txt
love 1 1.txt
hadoop 1 1.txt
and 1 1.txt
big 2 1.txt 2.txt
data 2 1.txt 2.txt
like 1 1.txt
programming 1 2.txt
hate 1 2.txt
At this stage I'm not sure how to extract the name of the file where the match occured. Furthermore I'm not sure how to store the file name - whether I would use a Triple or I would need to use nested maps, so perhaps Map (K1, Map (K2, v))? I don't know which would be possible in a mapreduce program so any tips would be greatly appreciated.

Getting file names is generally not encouraged. Different input formats have different ways of doing this, and some of them may not provide such functionality at all.
Assuming that you are working with simple TextInputFormat, you can use mapper context to retrieve the split:
FileSplit split = (FileSplit)context.getInputSplit();
String filename = split.getPath().getName();
To produce the format desired, let mapper emit tuples <Text(word),Text(filename)>. Reducer should collect them into Map<String(word), Set<String>(filename)>. This assumes no combiner is used.

Related

Mapreduce questions

I am trying to implement a Mapreduce program to do wordcounts from 2 files, and then comparing the word counts from these files to see what are the most common words...
I noticed that after doing wordcount for file 1, the results that go into the directory "/data/output1/", there are 3 files inside.
- "_SUCCESS"
- "_logs"
- "part-r-00000"
The "part-r-00000" is the file that contains the results from file1 wordcount. How do I make my program read that particular file if the file name is generated in real-time without me knowing beforehand the filename?
Also, for the (key, value) pairs, I have added an identifier to the "value", so as to be able to identify which file and count that word belongs to.
public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
Text newValue = new Text();
newValue.set(value.toString() + "_f2");
context.write(key, newValue);
}
at a later stage, how do i "remove" the identifier such that i can just get the "value"?
Just point your next MR job to /data/output1/. It will read all three files as input, but _SUCCESS and _logs are both empty so they'll have no affect on your program. They're just written that way so that you can tell that the MR job writing to the directory has finished successfully.
If you want to implement word count from 2 different files then you could use multipleinput class with help of which you can apply map reduce program on both files simultaneously. Refer this link for a example of how to implement it http://www.hadooptpoint.com/hadoop-multiple-input-files-example-in-mapreduce/ here you will define separate mapper for each input file thus you can add different identifier in both mapper file and then when there output will go to reducer it can identify from which map file that input is coming from and can process accordingly to it. And you can remove identifier in same way you add them like for example if you add a prefix # in mapper 1 output key and # in mapper 2 output key then in reducer you can identify from which map input is coming from using this prefix and then you can simple remove this prefix in reducer.
Aside from it about your other query related to file reading, it is simple the output file name aways have a pattern that if your are using hadoop1.x then result will be stored in file name as part-00000 and onward and with hadoop 2.x result will be stored in file name part-r-00000 if there is another output which need to be write in same ouput path then it will be stored in part-r-00001 and onwards. Other two files which are generated have no significance for developer they more of a act as a half for hadoop itself
Hope this solve your query. Please comment if answer is not clear.

Count duplicates in huge text files collection

I have this collection of folders:
60G ./big_folder_6
52G ./big_folder_8
61G ./big_folder_7
60G ./big_folder_4
58G ./big_folder_5
63G ./big_folder_2
54G ./big_folder_9
61G ./big_folder_3
39G ./big_folder_10
74G ./big_folder_1
Each folder contains 100 txt files, with one sentence per line. For example, the file ./big_folder_6/001.txt:
sentence ..
sentence ..
...
Each file in the folder is between 4 and 6 GB (as you can see from the totals reported above) with 40-60 million of sentences more or less. One single file fits in memory.
I need to deduplicate and count the sentences globally unique, so as to obtain a new collection of files where the lines are counted:
count ...unique sentence...
The collection is huge.
My first implementation (using Java) was a "merge sort" approach ordering the lines in a new collection of 500 files (dispatching each line in the right file using the first N characters), then order and aggregate duplicates on the single files.
I know it is a wordcount map-reduce problem but I would prefer to avoid it. The question is: am I using the right approach to solve this kind of problem or I should consider other tool/approach beside MapReduce?
You mean delete duplicated lines of each file? or among all files?
in any case, you cant read the whole file, you need to read line by line or a memory exception will be thrown. Use BufferedReader (example here), use a map storing the string with the count of the repeated line as a value, when you read a line, put in the map incrementing the value if it exist.
After read the file, write all the lines and theirs counts to a new file and release memory.
UPDATE 1
the problem is that you have a lot of gigas. So you cant keep in memory each line because it can thrown a memory exception, but at the same time you have to keep them in memory to quickly validate if they are duplicated. What comes to may mind is instead of having a string representing the key value, put a hash of the string (usgin string.toHash()), and when it was the first, write it to the new file, but flush every 100 lines or more to lower the time writing to the disk. After you processed all the files and write unique lines in the file and you have only integers in the map (hashcode of the string as a key and count as a value), you start reading the file containing only unique lines, then create a new file writing the line and the count values.

Changes a word in a file that you don't know using java

Backstory:
I am creating a LTspice program, where I am creating a circuit with over 1000 resistors.
There are 9 different types resistors. I need to change the value of each type of resistor, many times. I can do this manually but I don’t want to. The file is like a text file and can be read by a program like notepad. The filetype is .asc
I was going to create a java program to help me with this.
File Snippet:
SYMATTR InstName RiMC3
SYMATTR Value 0.01
SYMBOL res -1952 480 R90
WINDOW 0 0 56 VBottom 2
WINDOW 3 32 56 VTop 2
SYMATTR InstName RiMA3
SYMATTR Value 0.01
SYMBOL res -2336 160 R0
SYMATTR InstName ReC3
SYMATTR Value 8
Question:
How can I changes a word, I don´t know, in a file, but I know where it is, compared to another word I know?
An example:
I know the word "RiMC3", I need to changes the 3th word after this word to "0.02".
In the file Snippet the value is "0.01", but this will not always be the case.
My Solution:
I need a place to start.
Is this call something special? I have not found anything like this on google.
If you want to do this programmatically, you need to think about the limitations and requirements.
We don't know exactly how you want to do this, or in what context. But you can write this out on paper, in English, to give you a place to start.
For example, if we are going to make a standalone Java program (or class) to do this, and given simple line-oriented text, a naive approach might be:
Open the file for read
Open a file for write
Scan the file line by line
For each line:
Match the pattern or regular expression you are looking for and, if
it matches, modify the line in memory
Write out the possibly modified line to the output file
Finish up:
Close the files
Rename the output file to the input file
Buffering, error handling, application domain specifics are left as an exercise for the reader.

Construct document-term matrix via Java and MapReduce

Background:
I’m trying to make a “document-term” matrix in Java on Hadoop using MapReduce. A document-term matrix is like a huge table where each row represents a document and each column represents a possible word/term.
Problem Statement:
Assuming that I already have a term index list (so that I know which term is associated with which column number), what is the best way to look up the index for each term in each document so that I can build the matrix row-by-row (i.e., document-by-document)?
So far I can think of two approaches:
Approach #1:
Store the term index list on the Hadoop distributed file system. Each time a mapper reads a new document for indexing, spawn a new MapReduce job -- one job for each unique word in the document, where each job queries the distributed terms list for its term. This approach sounds like overkill, since I am guessing there is some overhead associated with starting up a new job, and since this approach might call for tens of millions of jobs. Also, I’m not sure if it’s possible to call a MapReduce job within another MapReduce job.
Approach #2:
Append the term index list to each document so that each mapper ends up with a local copy of the term index list. This approach is pretty wasteful with storage (there will be as many copies of the term index list as there are documents). Also, I’m not sure how to merge the term index list with each document -- would I merge them in a mapper or in a reducer?
Question Update 1
Input File Format:
The input file will be a CSV (comma separated value) file containing all of the documents (product reviews). There is no column header in the file, but the values for each review appear in the following order: product_id, review_id, review, stars. Below is a fake example:
“Product A”, “1”,“Product A is very, very expensive.”,”2”
“Product G”, ”2”, “Awesome product!!”, “5”
Term Index File Format:
Each line in the term index file consists of the following: an index number, a tab, and then a word. Each possible word is listed only once in the index file, such that the term index file is analogous to what could be a list of primary keys (the words) for an SQL table. For each word in a particular document, my tentative plan is to iterate through each line of the term index file until I find the word. The column number for that word is then defined as the column/term index associated with that word. Below is an example of the term index file, which was constructed using the two example product reviews mentioned earlier.
1 awesome
2 product
3 a
4 is
5 very
6 expensive
Output File Format:
I would like the output to be in the “Matrix Market” (MM) format, which is the industry standard for compressing matrices with many zeros. This is the ideal format because most reviews will contain only a small proportion of all possible words, so for a particular document it is only necessary to specify the non-zero columns.
The first row in the MM format has three tab separated values: the total number of documents, the total number of word columns, and the total number of lines in the MM file excluding the header. After the header, each additional row contains the matrix coordinates associated with a particular entry, and the value of the entry, in this order: reviewID, wordColumnID, entry (how many times this word appears in the review). For more details on the Matrix Market format, see this link: http://math.nist.gov/MatrixMarket/formats.html.
Each review’s ID will equal its row index in the document-term matrix. This way I can preserve the review’s ID in the Matrix Market format so that I can still associate each review with its star rating. My ultimate goal -- which is beyond the scope of this question -- is to build a natural language processing algorithm to predict the number of stars in a new review based on its text.
Using the example above, the final output file would look like this (I can't get Stackoverflow to show tabs instead of spaces):
2 6 7
1 2 1
1 3 1
1 4 1
1 5 2
1 6 1
2 1 1
2 2 1
Well, you can use something analogous to a inverted index concept.
I'm suggesting this becaue, I'm assuming both the files are big. Hence, comparing each other like one-to-one would be real performance bottle neck.
Here's a way that can be used -
You can feed both the Input File Format csv file(s) (say, datafile1, datafile2) and the term index file (say, term_index_file) as input to your job.
Then in each mapper, you filter the source file name, something like this -
Pseudo code for mapper -
map(key, row, context){
String filename= ((FileSplit)context.getInputSplit()).getPath().getName();
if (filename.startsWith("datafile") {
//split the review_id, words from row
....
context.write(new Text("word), new Text("-1 | review_id"));
} else if(filename.startsWith("term_index_file") {
//split index and word
....
context.write(new Text("word"), new Text("index | 0"));
}
}
e.g. output from different mappers
Key Value source
product -1|1 datafile
very 5|0 term_index_file
very -1|1 datafile
product -1|2 datafile
very -1|1 datafile
product 2|0 term_index_file
...
...
Explanation (the example):
As it clearly shows the key will be your word and the value will be made of two parts separated by a delimiter "|"
If the source is a datafile then you emit key=product and value=-1|1, where -1 is a dummy element and 1 is a review_id.
If the source is a term_index_file then you emit key=product and value=2|0, where 2 is a index of word 'product' and 0 is a dummy review_id, which we would use for sorting- explained later.
Definitely, no duplicate index will be processed by two different mappers if we are providing the term_index_file as a normal input file to the job.
So, 'product, vary' or any other indexed word in the term_index_file will only be available to one mapper. Note this is only valid for term_index_file not the datafile.
Next step:
Hadoop mapreduce framework, as you might well know, will group by keys
So, you will have something like this going to different reducers,
reduce-1: key=product, value=<-1|1, -1|2, 2|0>
reduce-2: key=very, value=<5|0, -1|1, -1|1>
But, we have a problem in the above case. We would want a sort in the values after '|' i.e. in the reduce-1 -> 2|0, -1|1, -1|2 and in reduce-2 -> <5|0, -1|1, -1|1>
To achieve that you can use a secondary sort implemented using a sort comparator. Please google for this but here's a link that might help. Mentioning it here can go real lengthy.
In each reduce-1, since the values are sorted as above, when we begin iteration, we would get the '0' in the first iteration and with it the index_id=2, which could then be used for subsequent iterations. In the next two iteration, we get review ids 1 and 2 consecutively, and we use a counter, so that we could keep track of any repeated review ids. When we get repeated review ids that would mean that a word appeared twice in the same review_id row. We reset the counter only when we find a different review_id and emit the previous review_id details for the particular index_id, something like this -
previous_review_id + "\t" + index_id + "\t" + count
When the loop ends, we'll be left with a single previous_review_id, which we finally emit in the same fashion.
Pseudo code for reducer -
reduce(key, Iterable values, context) {
String index_id = null;
count = 1;
String previousReview_id = null;
for(value: values) {
Split split[] = values.split("\\|");
....
//when consecutive review_ids are same, we increment count
//and as soon as the review_id differ, we emit, reset the counter and print
//the previous review_id detected.
if (split[0].equals("-1") && split[1].equals(previousReview_id)) {
count++;
} else if(split[0].equals("-1") && !split[1].equals(prevValue)) {
context.write(previousReview_id + "\t" + index_id + "\t" + count);
previousReview_id = split[1];//resting with new review_id id
count=1;//resetting count for new review_id
} else {
index_id = split[0];
}
}
//the last previousReview_id will be left out,
//so, writing it now after the loop completion
context.write(previousReview_id + "\t" + index_id + "\t" + count);
}
This job is done with multiple reducers in order to leverage Hadoop for what it best known for - performance, as a result, the final output will be scattered, something like the following, deviating from your desired output.
1 4 1
2 1 1
1 5 2
1 2 1
1 3 1
1 6 1
2 2 1
But, if you want everything to be sorted according to the review_id (as your desired outpout), you can write one more job that will do that for your using a single reducer and the output of the previos job as input. And also at the same time calculate 2 6 7 and put it at the front of the output.
This is just an approach ( or an idea), I think, that might help you. You definitely want to modify this, put a better algorithm and use it the your way that you think would benefit you.
You can also use Composite keys for better clarity than using a delimiter such as "|".
I am open for any clarification. Please ask if you think, it might be useful to you.
Thank you!
You can load the term index list in Hadoop distributed cache so that it is available to mappers and reducers. For instance, in Hadoop streaming, you can run your job as follows:
$ hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-streaming-*.jar \
-input myInputDirs \
-output myOutputDir \
-mapper myMapper.py \
-reducer myReducer.py \
-file myMapper.py \
-file myReducer.py \
-file myTermIndexList.txt
Now in myMapper.py you can load the file myTermIndexList.txt and use it to your purpose. If you give a more detailed description of your input and desired output I can give you more details.
Approach #1 is not good but very common if you don't have much hadoop experience. Starting jobs is very expensive. What you are going to want to do is have 2-3 jobs that feed each other to get the desired result. A common solution to similar problems is to have the mapper tokenize the input and output pairs, group them in the reducer executing some kind of calculation and then feed that into job 2. In the mapper in job 2 you invert the data in some way and in the reducer do some other calculation.
I would highly recommend learning more about Hadoop through a training course. Interestingly Cloudera's dev course has a very similar problem to the one you are trying to address. Alternatively or perhaps in addition to a course I would look at "Data-Intensive Text Processing with MapReduce" specifically the sections on "COMPUTING RELATIVE FREQUENCIES" and "Inverted Indexing for Text Retrieval"
http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf

csv rows into separate txt files?

Task 1: Read each row from one csv file into one seprate txt file.
Task 2: Reverse: in one folder, read text from each txt file and put into a row in a single csv. So, read all txt files into one csv file.
How would you do this? Would Java or Python be good to get this task done in very quickly?
Update:
For Java, there are already some quite useful libraries you can use, for example opencsv or javacsv. But better have a look at wikipedia about csv if no knowledge on csv. And this post tells you all the possibilities in Java.
Note: Due to the simplicity of the question, some one pre-assume this is a homework. I hereby declare it is not.
More background: I am working on my own experiments on machine learning and setting up a large scale test set. I need crawl, scrape and file type transfer as the basic utility for the experiment. Building a lot of things by myself for now, and suddenly want to learn Python due to some recent discoveries and get the feeling Python is more concise than Java for many parsing and file handling situations. Hence got this question.
I just want to save time for both you and me by getting to the gist without stating the not-so-related background. And my questions is more about the second question "Java vs Python". Because I run into few lines of code of Python using some csv library (? not sure, that's why I asked), but just do not know how to use Python. That are all the reasons why I got this question. Thanks.
From what you write there is little need on using something specific for CSV files. In particular for Task 1, this is a pure data I/O operation on text files. In Python for instance:
for i,l in enumerate(open(the_file)):
f = open('new_file_%i.csv' % i, 'w')
f.write(l)
f.close()
For Task 2, if you can guarantee that each file has the same structure (same number of fields per row) it is again a pure data I/O operation:
# glob files
files = glob('file_*.csv')
target = open('combined.csv', 'w')
for f in files:
target.write(open(f).read())
target.write(new_line_speparator_for_your_platform)
target.close()
Whether you do this in Java or Python depends on the availability on the target system and your personal preference only.
In that case I would use python since it is often more concise than Java.
Plus, the CSV files are really easy to handle with Python without installing something. I don't know for Java.
Task 1
It would roughly be this based on an example from the official documentation:
import csv
with open('some.csv', 'r') as f:
reader = csv.reader(f)
rownumber = 0
for row in reader:
g=open("anyfile"+str(rownumber)+".txt","w")
g.write(row)
rownumber = rownumber + 1
g.close()
Task 2
f = open("csvfile.csv","w")
dirList=os.listdir(path)
for fname in dirList:
if fname[-4::] == ".txt":
g = open("fname")
for line in g: f.write(line)
g.close
f.close()
In python,
Task 1:
import csv
with open('file.csv', 'rb') as df:
reader = csv.reader(df)
for rownumber, row in enumerate(reader):
with open(''.join(str(rownumber),'.txt') as f:
f.write(row)
Task 2:
from glob import glob
with open('output.csv', 'wb') as output:
for f in glob('*.txt'):
with open(f) as myFile:
rows = myFile.readlines()
output.write(rows)
You will need to adjust these for your use cases.

Categories

Resources