Bag of Words for each line - java

I currently have a huge csv file. which contains reddit post titles. I would like to create a feature vector for each post.
suppose the post tile is "to be or not to be" and it belongs to "some_category".
the csv file is in the below format.
"some_category1", "some title1"
"some_category2", "some title2"
"some_category1", "some title3"
I would like to create a feature vector as below.
"some_category" : to(2) be(2) or(1) not(1).
I need to do this whole thing on hadoop. I am stuck at the very first step, How do i convert each line into a feature vector(I feel its similar to word count but how do i apply it for each line).
My initial thoughts towards this problem was key to each line(i.e. each post's title and category) is the category of the post and the value is the feature vector of the title (i.e. word count of the title.).
Any help is appreciated regarding how to approach this problem.

To answer your first part:
Reading a csv linewise in Hadoop has been answered in this post:StackOverflow:how-to-read-first-line-in-hadoop-hdfs-file-efficiently-using-java.
Just change the last line to:
final Scanner sc = new Scanner(input);
while (sc.hastNextLine()) {
//doStuff with sc.nextLine()!
}
To create a feature vector, I would use your mentioned counting strategy:
/**
* We will use Java8-Style to do that easily
* 0) Split each line by space separated (split("\\s")
* 1) Create a stream: Arrays.stream(Array)
* 2) Collect the input (.collect) and group it by every identical word (Function.identity) to the corresponding amount (Collectors.counting)
*
* #param title the right hand side after the comma
* #return a map mapping each word to its count
**/
private Map<String, Long> createFeatureVectorForTitle(String title) {
return Arrays.stream(title.split("\\s").collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
}
Your idea for keying each category to the created feature vector sounds legit. Although I'm not too familiar with Hadoop, perhaps somebody can point out a better solution.

I solved it using two map reduce functions and adding index to make each row unique to process.
1, "some_category1", "some title1"
2, "some_category2", "some title2"
3, "some_category1", "some title3"
The output of first map reduce
"1, some_category" to 2
"1, some_category" be 2
"1, some_category" or 3
"1, some_category" not 1
where index and category are the keys to the values i.e. words in the title.
In the second map reduce it final output is of this format.
"some_category" : to(2) be(2) or(1) not(1).

Related

A pair of strings as a KEY in reduce function - HADOOP

Hello I am implementing a facebook-like program in java using hadoop framework (I am new to this). The main idea is that I have an input .txt file like this:
Christina Bill,James,Nick,Jessica
James Christina,Mary,Toby,Nick
...
The 1st is the user and the comma separated are his friends.
In the map function I scan each line of the file and emit the user with each one of his friends like
Christina Bill
Christina James
which will be converted to (Christina,[Bill,James,..])...
BUT in the description of my assignment it specifies that the Reduce function will receive as key the tuple of
two users, following by both their friends, you will count the
common ones and if that number is equal or greater than a
set number, like 5, you can safely assume that their
uncommon friends can be suggested. So how exactly do I pass a pair of users to the reduce function. I thought the input of the reduce function has to be the same as the output of the map function. I started coding this but I don't think this is the right approach. Any ideas?
public class ReduceFunction<KEY> extends Reducer<KEY,Text,KEY,Text> {
private Text suggestedFriend = new Text();
public void reduce(KEY key1,KEY key2, Iterable<Text> value1,Iterable<Text> value2,Context context){
}}
The output of the map phase should, indeed, be of the same type as the input of the reduce phase. This means that, if there is a requirement for the input of the reduce phase, you have to change your mapper.
The idea is simple:
map(user u,friends F):
for each f in F do
emit (u-f, F\f)
reduce(userPair u1-u2, friends F1,F2):
#commonFriends = |F1 intersection F2|
To implement this logic, you can just use a Text key, in which you concatenate the names of the users, using, e.g., the '-' character between them.
Note that in each reduce method, you will only receive two lists of friends, assuming that each user appears once in your input data. Then, you only have to compare the two lists for common names of friends.
Check if you can implement custom record reader, read two records at once from input file in mapper class. And then emit context.write(outkey, NullWritable.get()); from mapper class. Now in reducer class you need to handle two records came as a key(outkey) from mapper class. Good luck !

How to do a counter of specific words in Java?

Hey guys I am developing a project where I have 4 questions where someone can evaluate as (great, good, regular, and poor), and after that I would need to check how many people voted as great, how many voted as good, regular, and poor, for each of the 4 questions. So I would like to make a count to check the .txt and count how many times the word (great, good, regular, and poor) apears on it. I was trying to do it like in Python, where you only need a dictionary (or a counter) and simply do something like:
dict["great"] += 1
However, it isn't possible to do so in Java. Does anyone know any method that would be similar to this one in Java, or another way to do it simply (without having to create a lot of variables to save each question's answer).
Thank you very much for your help.
In java 8 the compute method was added to the Map interface. It may be a bit more complicated than in python, but it's probably the closest it gets to the python code:
Map<String, Integer> map = new HashMap<>();
String rating = ...
map.compute(rating, (key, oldValue) -> ((oldValue == null) ? 1 : oldValue+1));
The lambda expression passed as second parameter to compute receives the old value the key was mapped to as second parameter or null, if there was no mapping.
This is 100% possible in Java.
Use a HashMap to store the values.
For example:
HashMap counts = new HashMap<String, Integer>();
counts.put("great", 0);
counts.put("good", 0);
counts.put("regular", 0);
counts.put("poor", 0);
Now, suppose you read in a string input.
To increase the counter, do :
counts.put(input, counts.get(input) + 1);
This will increase the counter in that position by 1.
Use counts.get(input) to get the counter of input string.

Parsing and manipulating oddly formatted data, whilst maintaining formatting

I'm a pretty newbie programmer and basically I'm trying to parse and manipulate a DL_POLY config file, which has the layout
CONFIG file created from Xmol file config.xmol
2 3 10000000 0.5000000000E-03
31.309729731729 0.000000000000 0.000000000000
0.000000000000 31.309729731729 0.000000000000
0.000000000000 0.000000000000 31.309729731729
Ca 1
6.421269411 -1.034199034 1.228702751
-1.06475894897 1.10274459622 1.31459311620
-6319.67959205 -10299.4183311 468.606019012
which sort of goes on for about 150 odd more entries of just the
Ca 1
6.421269411 -1.034199034 1.228702751
-1.06475894897 1.10274459622 1.31459311620
-6319.67959205 -10299.4183311 468.606019012
segment, where the second row represents x, y and z coordinates, which I need to manipulate by adding a slight displacement to, and the top row, where Ca represents the atom (in this instance, calcium) and the integer is an atom counter (this is the first atom, I have a system of about 75 CaCO3).
Now I've written some java code which reads in the string, sticks it in an arrayList and tokenises it and from there I'm pretty sure how to add the displacement only maintaining this weird formatting complicates it all. Obviously I'm aiming for as general a solution as I can get here, so I can reuse this, whilst I'm sure I could force it into the correct format, it means I can only ever use it for that file.
So, my questions are, how can I manipulate values in a file in java, keeping the format 100% intact? And within this system, how can I tell it to add the displacement on only the second row of each segment?
It's a bit complicated (or maybe not, I really don't know) but I would really appreciate some help.
So, I've got something like this:
import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Scanner;
import java.util.ArrayList;
import java.io.FileReader;
public class testArrayReader {
static ArrayList<String> temp = new ArrayList<String>();
public static void main(String[] args) {
String[] arr = null;
String[][] twodim = null;
System.out.println("Array List initialised!");
try{
FileReader input = new FileReader(urlfortextfile);
BufferedReader reader = new BufferedReader(input);
System.out.println("Scanned!");
String line;
int onedimcounter = 0;
while((line = reader.readLine()) != null){
temp.add(onedimcounter++, line);
}
System.out.println(temp);
twodim = temp.toArray(new String[temp.size()][temp.get(0).length()]);
System.out.println("stage 2 complete");
System.out.println(twodim);
}
catch(FileNotFoundException ex){
System.out.println("No file found boss.");
}
catch(IOException ex){
System.out.println("IO error.");
}
}
}
Few more queries,
1) [1st line, 2nd line, ..., nth line] - the comma denotes that the first and second line are separate elements, right?
2) I'm getting an ArrayStoreException and I'm really not 100% sure why - the documentation mentioned something about a casting error, so I'm assuming my arraylist items are still stuck as objects. How do I fix this?
3) Current plan for modification is to list the element index in the final array, modify and reprint, but I've chunked it line by line to preserve the formatting. Need a bit of conformation I'm on the right track here, my idea was to parse the line for doubles, do what I need to do and then try and get the computer to count the number of whitespaces between digits and replace build a string, which then I can just reinsert. Something like a counter with an if statement based off of some boolean looking for white space, then the counter will insert " " when I concatenate the final string.
Cheers.
First, parse the file to a table of values with associated position-in-file metadata.
Second implement all mutations on that table in terms of atomic duplication/insertion/removal of cells/rows/columns which also update position-in-file.
Third, implement a table serialize operator which takes in the old content so that you can look up the white-space between data lines and between cells within a line, and so you can deduce the number format (number of sig digits) from the old file when serializing changed numeric values.
how do I find and parse the position in file metadata?
To associate position information, keep track of
/** Number of line breaks since start of file */
int lineNumber;
/** Number of chars since start of file */
int charInFile;
/** Number of chars since start of line (if on the zero-th line) or last line break. */
int charInLine;
Then with each token, associate the position before the first character, and the position after the last character in the token.
When you parse a complex construct like a table, table row, or table cell, store with it the position before the first token that it spans, and the position after the last token it spans.
what's a table serialize operator? I know of serialization just not that
By operator, I just means part of a programming language that allows you to specify a relation between inputs and outputs. I use it to avoid language-specific jargon like function, method, or procedure.
how do you enter a return key in stack overflow
See "What is the reason for the top secret two space newline markdown weirdness?"

Construct document-term matrix via Java and MapReduce

Background:
I’m trying to make a “document-term” matrix in Java on Hadoop using MapReduce. A document-term matrix is like a huge table where each row represents a document and each column represents a possible word/term.
Problem Statement:
Assuming that I already have a term index list (so that I know which term is associated with which column number), what is the best way to look up the index for each term in each document so that I can build the matrix row-by-row (i.e., document-by-document)?
So far I can think of two approaches:
Approach #1:
Store the term index list on the Hadoop distributed file system. Each time a mapper reads a new document for indexing, spawn a new MapReduce job -- one job for each unique word in the document, where each job queries the distributed terms list for its term. This approach sounds like overkill, since I am guessing there is some overhead associated with starting up a new job, and since this approach might call for tens of millions of jobs. Also, I’m not sure if it’s possible to call a MapReduce job within another MapReduce job.
Approach #2:
Append the term index list to each document so that each mapper ends up with a local copy of the term index list. This approach is pretty wasteful with storage (there will be as many copies of the term index list as there are documents). Also, I’m not sure how to merge the term index list with each document -- would I merge them in a mapper or in a reducer?
Question Update 1
Input File Format:
The input file will be a CSV (comma separated value) file containing all of the documents (product reviews). There is no column header in the file, but the values for each review appear in the following order: product_id, review_id, review, stars. Below is a fake example:
“Product A”, “1”,“Product A is very, very expensive.”,”2”
“Product G”, ”2”, “Awesome product!!”, “5”
Term Index File Format:
Each line in the term index file consists of the following: an index number, a tab, and then a word. Each possible word is listed only once in the index file, such that the term index file is analogous to what could be a list of primary keys (the words) for an SQL table. For each word in a particular document, my tentative plan is to iterate through each line of the term index file until I find the word. The column number for that word is then defined as the column/term index associated with that word. Below is an example of the term index file, which was constructed using the two example product reviews mentioned earlier.
1 awesome
2 product
3 a
4 is
5 very
6 expensive
Output File Format:
I would like the output to be in the “Matrix Market” (MM) format, which is the industry standard for compressing matrices with many zeros. This is the ideal format because most reviews will contain only a small proportion of all possible words, so for a particular document it is only necessary to specify the non-zero columns.
The first row in the MM format has three tab separated values: the total number of documents, the total number of word columns, and the total number of lines in the MM file excluding the header. After the header, each additional row contains the matrix coordinates associated with a particular entry, and the value of the entry, in this order: reviewID, wordColumnID, entry (how many times this word appears in the review). For more details on the Matrix Market format, see this link: http://math.nist.gov/MatrixMarket/formats.html.
Each review’s ID will equal its row index in the document-term matrix. This way I can preserve the review’s ID in the Matrix Market format so that I can still associate each review with its star rating. My ultimate goal -- which is beyond the scope of this question -- is to build a natural language processing algorithm to predict the number of stars in a new review based on its text.
Using the example above, the final output file would look like this (I can't get Stackoverflow to show tabs instead of spaces):
2 6 7
1 2 1
1 3 1
1 4 1
1 5 2
1 6 1
2 1 1
2 2 1
Well, you can use something analogous to a inverted index concept.
I'm suggesting this becaue, I'm assuming both the files are big. Hence, comparing each other like one-to-one would be real performance bottle neck.
Here's a way that can be used -
You can feed both the Input File Format csv file(s) (say, datafile1, datafile2) and the term index file (say, term_index_file) as input to your job.
Then in each mapper, you filter the source file name, something like this -
Pseudo code for mapper -
map(key, row, context){
String filename= ((FileSplit)context.getInputSplit()).getPath().getName();
if (filename.startsWith("datafile") {
//split the review_id, words from row
....
context.write(new Text("word), new Text("-1 | review_id"));
} else if(filename.startsWith("term_index_file") {
//split index and word
....
context.write(new Text("word"), new Text("index | 0"));
}
}
e.g. output from different mappers
Key Value source
product -1|1 datafile
very 5|0 term_index_file
very -1|1 datafile
product -1|2 datafile
very -1|1 datafile
product 2|0 term_index_file
...
...
Explanation (the example):
As it clearly shows the key will be your word and the value will be made of two parts separated by a delimiter "|"
If the source is a datafile then you emit key=product and value=-1|1, where -1 is a dummy element and 1 is a review_id.
If the source is a term_index_file then you emit key=product and value=2|0, where 2 is a index of word 'product' and 0 is a dummy review_id, which we would use for sorting- explained later.
Definitely, no duplicate index will be processed by two different mappers if we are providing the term_index_file as a normal input file to the job.
So, 'product, vary' or any other indexed word in the term_index_file will only be available to one mapper. Note this is only valid for term_index_file not the datafile.
Next step:
Hadoop mapreduce framework, as you might well know, will group by keys
So, you will have something like this going to different reducers,
reduce-1: key=product, value=<-1|1, -1|2, 2|0>
reduce-2: key=very, value=<5|0, -1|1, -1|1>
But, we have a problem in the above case. We would want a sort in the values after '|' i.e. in the reduce-1 -> 2|0, -1|1, -1|2 and in reduce-2 -> <5|0, -1|1, -1|1>
To achieve that you can use a secondary sort implemented using a sort comparator. Please google for this but here's a link that might help. Mentioning it here can go real lengthy.
In each reduce-1, since the values are sorted as above, when we begin iteration, we would get the '0' in the first iteration and with it the index_id=2, which could then be used for subsequent iterations. In the next two iteration, we get review ids 1 and 2 consecutively, and we use a counter, so that we could keep track of any repeated review ids. When we get repeated review ids that would mean that a word appeared twice in the same review_id row. We reset the counter only when we find a different review_id and emit the previous review_id details for the particular index_id, something like this -
previous_review_id + "\t" + index_id + "\t" + count
When the loop ends, we'll be left with a single previous_review_id, which we finally emit in the same fashion.
Pseudo code for reducer -
reduce(key, Iterable values, context) {
String index_id = null;
count = 1;
String previousReview_id = null;
for(value: values) {
Split split[] = values.split("\\|");
....
//when consecutive review_ids are same, we increment count
//and as soon as the review_id differ, we emit, reset the counter and print
//the previous review_id detected.
if (split[0].equals("-1") && split[1].equals(previousReview_id)) {
count++;
} else if(split[0].equals("-1") && !split[1].equals(prevValue)) {
context.write(previousReview_id + "\t" + index_id + "\t" + count);
previousReview_id = split[1];//resting with new review_id id
count=1;//resetting count for new review_id
} else {
index_id = split[0];
}
}
//the last previousReview_id will be left out,
//so, writing it now after the loop completion
context.write(previousReview_id + "\t" + index_id + "\t" + count);
}
This job is done with multiple reducers in order to leverage Hadoop for what it best known for - performance, as a result, the final output will be scattered, something like the following, deviating from your desired output.
1 4 1
2 1 1
1 5 2
1 2 1
1 3 1
1 6 1
2 2 1
But, if you want everything to be sorted according to the review_id (as your desired outpout), you can write one more job that will do that for your using a single reducer and the output of the previos job as input. And also at the same time calculate 2 6 7 and put it at the front of the output.
This is just an approach ( or an idea), I think, that might help you. You definitely want to modify this, put a better algorithm and use it the your way that you think would benefit you.
You can also use Composite keys for better clarity than using a delimiter such as "|".
I am open for any clarification. Please ask if you think, it might be useful to you.
Thank you!
You can load the term index list in Hadoop distributed cache so that it is available to mappers and reducers. For instance, in Hadoop streaming, you can run your job as follows:
$ hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-streaming-*.jar \
-input myInputDirs \
-output myOutputDir \
-mapper myMapper.py \
-reducer myReducer.py \
-file myMapper.py \
-file myReducer.py \
-file myTermIndexList.txt
Now in myMapper.py you can load the file myTermIndexList.txt and use it to your purpose. If you give a more detailed description of your input and desired output I can give you more details.
Approach #1 is not good but very common if you don't have much hadoop experience. Starting jobs is very expensive. What you are going to want to do is have 2-3 jobs that feed each other to get the desired result. A common solution to similar problems is to have the mapper tokenize the input and output pairs, group them in the reducer executing some kind of calculation and then feed that into job 2. In the mapper in job 2 you invert the data in some way and in the reducer do some other calculation.
I would highly recommend learning more about Hadoop through a training course. Interestingly Cloudera's dev course has a very similar problem to the one you are trying to address. Alternatively or perhaps in addition to a course I would look at "Data-Intensive Text Processing with MapReduce" specifically the sections on "COMPUTING RELATIVE FREQUENCIES" and "Inverted Indexing for Text Retrieval"
http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf

cleanly lining up output

I have a text file that contains data that was saved using the cvs encryption and I want to open it in java and display it lined up perfectly. I have come to the point of reading it from the text file but now I want it to be split at the commas and now I want to display it all perfectly aligned.
Last, First, car year, car model
barry, john, 1956, chevy impala
and I want it to display like this:
last First car year car model
barry john 1956 chevy impala
and I am just using the scanner class to get the data from the text file.
Determine the max lengths of the column values (including column headers), then create a format String and use that format string to build the aligned rows:
// some easy magic first
String[][] values = getCsvValues(file);
int[] maxLengths = determineMaxLengths(values);
// create formatstring, something like "%10s %5s %10s %n"
StringBuilder formatBuilder = new StringBuilder();
for (int maxLength:maxLengths)
formatBuilder.append("%").append(maxLength).append("s ");
formatBuilder.append("%n"); // newline
// output
for (String[] row:values)
System.out.printf(formatBuilder.toString, row);
depending on how certain you need to be that it all lines up, you might have to go through and find the longest string for each column, but I would use tabs: "\t". Two or three between columns usually works for me, for simple debugging that's what I use.
If you're really serious about it being right all the way down, try looking at Formatters and printf.

Categories

Resources