building a simple index for corpus - java

so I am working on this small project to build an index for 1400 files corpus then searching for keywords using the index. the index should have the frequency of the keyword and its position "file name". Output should be top ten relevant docs according to frequency of the keyword in each.
for Example:
flower text1.txt 3
flower text2.txt 2
.
.
this is what I have so far and I'm having difficulty with the tuple as I want to add 3 values to the hashmap
import java.io.*;
import java.nio.charset.*;
import java.nio.file.*;
import java.util.*;
public class MyIndex {
static Map<String, Tuple<Integer, String>> map = new HashMap();
static String readFile(String path, Charset encoding) throws IOException {
byte[] encoded = Files.readAllBytes(Paths.get(path));
return new String(encoded, encoding);
}
public static void main (String [] args) throws IOException {
File myDirectory = new File(
"/Users/.../processedFiles");
File[] fileList = myDirectory.listFiles();
for(int i = 1;i<fileList.length;i++) {
Scanner scan = new Scanner (new File(fileList[i].getPath()));
while(scan.hasNextLine()) {
String line = scan.nextLine();
map.put(line, new Tuple (1,fileList[i].getName())); //tuple is frequency of word and file name
}
}
}
public class Tuple<X, Y> {
public final X x;
public final Y y;
public Tuple(X x, Y y) {
this.x = x;
this.y = y;
}
}
}
the error is in put(...)
I didn't add the frequency method yet and this is what I have so far
static void frequency(String [] array) {
Map<String, Integer> map = new HashMap<String, Integer>();
for (String string : array) {
int count = 0;
if (!map.containsKey(string)) {
map.put(string, 1);
}
else {
count = map.get(string);
map.put(string, count + 1);
}
}
is there a better way to do this from scratch as we cannot use lucene etc..
how to put it all together to read and index 1400 files using the Tuple class?
I am open to any suggestions
thanks

I want to add 3 values to the hashmap
Your map's definition only stores 1 tuple per string. I suggest letting the second parameter be an Arraylist of Tuples. (P.S. The Pair class exists so you don't have to create a Tuple class). This will transform your map from what you asked for:
flower text1.txt 3, flower text2.txt 2
into
flower text1.txt 3, text2.txt 2
where the key is "flower" and the val is an Arraylist with position 0 = Tuple(3, text1.txt), and position 1 = Tuple(2, text2.txt). You can refer to the code below.
Arraylist<Tuple> A = map.get("flower")
System.out.println(A.get(0).y + " " + A.get(0).x)
System.out.println(A.get(1).y + " " + A.get(1).x)
I'm not sure why there's a need for your frequency method since you can update the frequency while you read the files. Because this sounds like your assignment, I won't give you all the details but point you in the right direction:
while(scan.hasNextLine()) {
//Read all the words in the line and update their count in the map while being aware of the name of the file you're currently reading.
}
There are still things you need to figure out but I hope I helped.

Related

Load 2D array variables into class intances

I'm lacking the knowledge on 2d arrays and I need help populating data from an array into a few class variables.
So I have simple product class that looks like this:
public class Product{
int prodID;
String prodName;
Double prodCost;
int prodQuantity;
I also have a class with two methods:
Taking a CSV and converting it to an array - done
Taking variables from the array and adding them to the appropriate variables - not finished
The array/CSV looks like this:
product ID | product name | product cost | quantity
-----001----- | -----item1----- | -----5.99----- | -----3-----
-----002----- | -----item2----- | -----2.99----- | -----5-----
I want to write code that iterates over the array, and creates Product instances for each line. Eventually I will have a list of products. I can always assume the CSV is in fixed format so there will always only be 4 variables as seen in the table above.
So this is what I have so far:
import java.io.File;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Scanner;
public class productsImport extends Product {
public static List<List<String>> csvToArray() {
String fileName = "c:\\temp\\test.csv";
File file = new File(fileName);
// this gives you a 2-dimensional array of strings
List<List<String>> lines = new ArrayList<>();
Scanner inputStream;
try {
inputStream = new Scanner(file);
while (inputStream.hasNext()) {
String line = inputStream.next();
String[] values = line.split(",");
// this adds the currently parsed line to the 2-dimensional string array
lines.add(Arrays.asList(values));
}
inputStream.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
}
return lines;
}
public static void mapToProdcut(List<List<String>> lines){
for (List<String> line : lines) {
Product p = new Product();
for (String value : line) {
???
}
}
}
public static void main(String[] args) {
csvToArray();
mapToProdcut(csvToArray());
}
}
The first method converts the CSV to an array. The second method is where I'm stuck. I don't know how to iterate properly over the array to make sure that p.prodID, p.prodName, p.prodCost and p.prodQuantity are all populated with the corresponding column. I want to skip over the first row, because it will always show the field titles and they're not relevant.
Any help with this would be great :)
First thing to do is to create a constructor that takes all variables as parameters to simplify the code
public Product(int prodID, String prodName, Double prodCost, int prodQuantity) {
this.prodID = prodID;
this.prodName = prodName;
this.prodCost = prodCost;
this.prodQuantity = prodQuantity;
}
If you are running Java 8 you can use streams
List<Product> products =
lines.stream()
.skip(1)
.map(s -> new Product(
Integer.valueOf(s.get(0)),
s.get(1),
Double.valueOf(s.get(2)), Integer.valueOf(s.get(3))
)).collect(Collectors.toList());
Otherwise you can use a for loop
List<Product> products = new ArrayList<>();
for (int i = 1; i < lines.size(); i++) {
List<String> s = lines.get(i);
Product product = new Product(
Integer.valueOf(s.get(0)),
s.get(1),
Double.valueOf(s.get(2)),
Integer.valueOf(s.get(3)));
products.add(product);
}

Displaying word frequencies of 0 in an ArrayList

i'm looking for some assistance. I've made a program that uses two classes - that i've also made. The first class is called CollectionOfWords that reads in text-files and store the words contained in the text-files within a HashMap. The second is called WordFrequencies that calls an object called Collection from the CollectionOfWords class, which in turn reads in another document and to see if the documents contents are in the Collection. This then outputs an ArrayList with the frequencies counted in the document.
Whilst this works and returns the frequencies of the words found in both the collection and document, i'd like it to be able to produce zero values for the words that are in the collection, but not in the document, if that makes sense? For example, test3 returns [1, 1, 1], but i'd like it to return [1, 0, 0, 0, 1, 0, 1] - where the zeroes represent the words in the collection, but are not found in test3.
The test text-files i use can be found here:
https://drive.google.com/open?id=1B1cDpjmZZo01HizxJUSWSVIlHcQke2mU
Cheers
WordFrequencies
public class WordFrequencies {
static HashMap<String, Integer> collection = new HashMap<>();
private static ArrayList<Integer> processDocument(String inFileName) throws IOException {
// Rests collections frequency values to zero
collection.clear();
// Reads in the new document file to an ArrayList
Scanner textFile = new Scanner(new File(inFileName));
ArrayList<String> file = new ArrayList<String>();
while(textFile.hasNext()) {
file.add(textFile.next().trim().toLowerCase());
}
/* Iterates the ArrayList of words -and- updates collection with
frequency of words in the document */
for(String word : file) {
Integer dict = collection.get(word);
if (!collection.containsKey(word)) {
collection.put(word, 1);
} else {
collection.put(word, dict + 1);
}
}
textFile.close();
// Stores the frequency values in an ArrayList
ArrayList<Integer> values = new ArrayList<>(collection.values());
return values;
}
public static void main(String[] args) {
// Stores text files for the dictionary (collection of words)
List<String> textFileList = Arrays.asList("Test.txt", "Test2.txt");
// Declares empty ArrayLists for output of processDocument function
ArrayList<Integer> test3 = new ArrayList<Integer>();
ArrayList<Integer> test4 = new ArrayList<Integer>();
// Creates a new CollectionOfWords object called dictionary
CollectionOfWords dictionary = new CollectionOfWords(collection);
// Reads in the ArrayLists text files and processes it
for (String text : textFileList) {
dictionary.scanFile(text);
}
try {
test3 = processDocument("test3.txt");
test4 = processDocument("test4.txt");
} catch(IOException e){
e.printStackTrace();
}
System.out.println(test3);
System.out.println(test4);
}
}
CollectionOfWords
public class CollectionOfWords {
// Declare set in a higher scope (making it a property within the object)
private HashMap<String, Integer> collection = new HashMap<String, Integer>();
// Assigns the value of the parameter to the field of the same name
public CollectionOfWords(HashMap<String, Integer> collection) {
this.collection = collection;
}
// Gets input text file, removes white spaces and adds to dictionary object
public void scanFile(String textFileName) {
try {
Scanner textFile = new Scanner(new File(textFileName));
while (textFile.hasNext()) {
collection.put(textFile.next().trim(), 0);
}
textFile.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
public void printDict(HashMap<String, Integer> dictionary) {
System.out.println(dictionary.keySet());
}
}
I didn't go through the trouble of figuring out your entire code, so sorry if this answer is stupid.
As a solution to your problem, you could initialize the map with every word in the dictionary mapping to zero. Right now, you use the clear method on the hashmap, this does not set everything to zero, but removes all the mappings.
The following code should work, use it instead of collection.clear()
for (Map.Entry<String, Integer> entry : collection.entrySet()) {
entry.setValue(0);
}

Show duplicates in a String Array from csv File (Java)

My problem is that I created an array from a csv file and I now have to output any values with duplicates.
The file has a layout of 5x9952. It consists of the data:
id,birthday,name,sex, first name
I'd now like the program to show me in each column (e.g. name) which duplicates there are. Like if there are two people which the same name. But whatever I try from what I found on the Internet only shows me the duplicates of rows (like if name and first name are the same).
Here's what I got so far:
package javacvs;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;
/**
*
* #author Tobias
*/
public class main {
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
String csvFile = "/Users/Tobias/Desktop/PatDaten/123.csv";
String line = "";
String cvsSplitBy = ",";
try (BufferedReader br = new BufferedReader(new FileReader(csvFile))) {
while ((line = br.readLine()) != null) {
// use comma as separator
String[] patDaten = line.split(cvsSplitBy);
for (int i = 0; i < patDaten.length-1; i++)
{
for (int j = i+1; j < patDaten.length; j++)
{
if( (patDaten[i].equals(patDaten[j])) && (i != j) )
{
System.out.println("Duplicate Element is : "+patDaten[j]);
}
}
}
}
}catch (IOException e) {
e.printStackTrace();
}
}
}
(I changed the name of the csv as it contains confidential data)
The real thing here: stop thinking "low level". Good OOP is about creating helpful abstractions.
In other words: your first stop should be to create a meaningful class definition that represents the content of one row, lets call it the Person class for now. And then you separate your further concerns:
you create one class/method that does nothing else but reading that CSV file - and creating one Person object per row
you create a meaningful data structure that tells you about duplicates
The later could (for example) some kind of reverse indexing. Meaning: you have a Map<String, List<Person>>. And after you have read all your Person objects (maybe in a simple list), you can do this:
Map<String, List<Person>> personsByName = new HashMap<>();
for (Person p : persons) {
List<Person> personsForName = personsByName.get(p.getName());
if (personsByName == null) {
personsForName = new ArrayList<>();
personsByName.put(p.getName(), personsForName);
}
personsForName.add(p);
}
After that loop that map contains all names used in your table - and for each name you have a list of the corresponding persons.
You are iterating upon the rows instead of iterating upon the column. What you need to do is to have the same cycle but upon the column.
What you can do is to acumulate the names in a separate array and than iterate it. I am sure you know what index is the column you want to compare. So you will need one cycle extra to accumulate the column you want to check for duplications.
It's a bit unclear what you want presented, the whole record, or only that there are duplicate names.
For the name only:
String csvFile = "test.csv";
List<String> readAllLines = Files.readAllLines(Paths.get(csvFile));
Set<String> names = new HashSet<>();
readAllLines.stream().map(s -> s.split(",")[2]).forEach(name -> {
if (!names.add(name)) {
System.out.println("Duplicate name: " + name);
}
});
For the whole record:
String csvFile = "test.csv";
List<String> readAllLines = Files.readAllLines(Paths.get(csvFile));
Set<String> names = new HashSet<>();
readAllLines.stream().forEach(record -> {
String name = record.split(",")[2];
if (!names.add(name)) {
System.out.println("Duplicate name: " + name + " with record " + record);
}
});
Your problem is the nesting of your loops. What you do is, that you read one line, split it up and then you compare the fields of this one row with each other. You do not even compare one line with other lines!
So first you need an array for all lines so you can compare these lines. As GhostCat recommended in his answer you should use your own class Person which has the five fields as attributes. But you could use a second array, so you can work with the indexes as Alexander Petrov said in his answer. In the latter case, you get a two-dimensional array:
String[][] patDaten;
After that you read all lines of your csv-file and for each line you create a new Person or a new inner array.
After reading the entire file, you compare the fields as you want. Here you use your double loop. So you compare patDaten[i].getName() with patDaten[j].getName() or with the array patDaten[i][1] with patDaten[j][1].

Searching through a hash map for multiple keys in Java

I am trying to figure out how to go about searching some user input for multiple keywords.The keywords come from a hash map called Synonym. So basically I enter some sentence and if the sentence contains one or more keywords or keyword synonyms I want to call a parse file method. So far I could only search for one keyword. I am stuck trying to get a user input which could be a long sentence or just one word containing the keyword(s) and search the hash map key for that matching word. For example, If the hash map is
responses.put("textbook name", new String[] { "name of textbook", "text", "portfolio" });
responses.put("current assignment", new String[] { "homework","current work" });
and the user inputs " what is the name of textbook that has the homework" I want to search a text file for textbook current assignment. Assuming that the text file contains the sentence The current assignment is in the second textbook name ralphy". I mean i got most of my implementation done, the issue is dealing with more than one keyword. Can someone help me solve this?
Here is my code
private static HashMap<String, String[]> responses = new HashMap<String, String[]>(); // this
public static void parseFile(String s) throws FileNotFoundException {
File file = new File("data.txt");
Scanner scanner = new Scanner(file);
while (scanner.hasNextLine()) {
final String lineFromFile = scanner.nextLine();
if (lineFromFile.contains(s)) {
// a match!
System.out.println(lineFromFile);
// break;
}
}
}
private static HashMap<String, String[]> populateSynonymMap() {
responses.put("test", new String[] { "test load", "quantity of test","amount of test" });
responses.put("textbook name", new String[] { "name of textbook", "text", "portfolio" });
responses.put("professor office", new String[] { "room", "post", "place" });
responses.put("day", new String[] { "time", "date" });
responses.put("current assignment", new String[] { "homework","current work" });
return responses;
}
public static void main(String args[]) throws ParseException, IOException {
/* Initialization */
HashMap<String, String[]> synonymMap = new HashMap<String, String[]>();
synonymMap = populateSynonymMap(); // populate the map
Scanner scanner = new Scanner(System.in);
String input = null;
/*End Initialization*/
System.out.println("Welcome To DataBase ");
System.out.println("What would you like to know?");
System.out.print("> ");
input = scanner.nextLine().toLowerCase();
String[] inputs = input.split(" ");
for (String ing : inputs) { // iterate over each word of the sentence.
boolean found = false;
for (Map.Entry<String, String[]> entry : synonymMap.entrySet()) {
String key = entry.getKey();
String[] value = entry.getValue();
if (input.contains(key) || key.contains(input)|| Arrays.asList(value).contains(input)) {
found = true;
parseFile(entry.getKey());
}
}
}
}
Any help would be appreciated
I have answered very similar question Understand two or more keys with Hashmaps. But I'll make my point more clear. In the current set of datastructures that you have used lets consider the following structures
1) Input List --> Spilt words in the sentence (may be in order) and keep it in a list example [what,is,the,name,of,textbook,that,has,the,homework]
2) Keyword list --> All keys from the Hashmap database you are using example [test,textbook name,professor office]
Now you have to set some criteria by which you say I can have max 3 words phrase out of sentence (example 'name of textbook')as keyword, why this criteria - to limit the processing, otherwise you'll end up checking lot of combinations of input.
Once you have this, you check whats common in input list and keyword list for criteria you have set. If you don't set criteria then you may try all the combinations against the key set.Once you find single or multiple match, output the synonym list etc.
Example check [name of textbook] against all your keys of the map.
If you want to reverse check, the do the same process by creating a list of synonyms and checking it.
My two tips tackling this problem
1) Define set of keywords and don't check with value list, Hash map structure is not good for that. In this be prepared for redundant data.
2) Set how many words in order you want to search in this keyset. And preferably only keep distinct words.
Hope this helps!
You could use a single regex pattern per "dictionary entry" and test each pattern against your input. Depending on your performance requirements and the size of your dictionary and input, it might be a good solution.
If you're using java 8, try this:
public static class DicEntry {
String key;
String[] syns;
Pattern pattern;
public DicEntry(String key, String... syns) {
this.key = key;
this.syns = syns;
pattern = Pattern.compile(".*(?:" + Stream.concat(Stream.of(key), Stream.of(syns))
.map(x -> "\\b" + Pattern.quote(x) + "\\b")
.collect(Collectors.joining("|")) + ").*");
}
}
public static void main(String args[]) throws ParseException, IOException {
// Initialization
List<DicEntry> synonymMap = populateSynonymMap();
Scanner scanner = new Scanner(System.in);
// End Initialization
System.out.println("Welcome To DataBase ");
System.out.println("What would you like to know?");
System.out.print("> ");
String input = scanner.nextLine().toLowerCase();
boolean found;
for (DicEntry entry : synonymMap) {
if (entry.pattern.matcher(input).matches()) {
found = true;
System.out.println(entry.key);
parseFile(entry.key);
}
}
}
private static List<DicEntry> populateSynonymMap() {
List<DicEntry> responses = new ArrayList<>();
responses.add(new DicEntry("test", "test load", "quantity of test", "amount of test"));
responses.add(new DicEntry("textbook name", "name of textbook", "text", "portfolio"));
responses.add(new DicEntry("professor office", "room", "post", "place"));
responses.add(new DicEntry("day", "time", "date"));
responses.add(new DicEntry("current assignment", "homework", "current work"));
return responses;
}
Sample output:
Welcome To DataBase
What would you like to know?
> what is the name of textbook that has the homework
textbook name
current assignment
Make a list/append the keys that match. As for the given example , when keyword "textbook" matches store it in a "temp" variable. Now, continue the loop, now keyword "current" matches , append this to variable temp. So, now temp contains "textbook current". Similairly, continue and append the next keyword "assignment" into "temp".
Now, temp contains "textbook current assignment".
Now at the end call the parseFile(temp).
This should work for single or multiple matches.
//Only limitation is the keys are to be given in a ordered sequence , if you want
// to evaluate all the possible combinations then better add all the keys in a list
// And append them in the required combination.
//There might be corner cases which I havent thought of but this might help/point to a more better solution
String temp = "";
//flag - used to indicate whether any word was found in the dictionary or not?
int flag = 0;
for (String ing : inputs) { // iterate over each word of the sentence.
boolean found = false;
for (Map.Entry<String, String[]> entry : synonymMap.entrySet()) {
String key = entry.getKey();
String[] value = entry.getValue();
if (input.contains(key)) {
flag = 1;
found = true;
temp = temp +" "+ key;
}
else if (key.contains(input)) {
flag = 1;
found = true;
temp = temp +" "+ input;
}
else if (Arrays.asList(value).contains(input)) {
flag = 1;
found = true;
temp = temp +" "+ input;
}
}
}
if (flag == 1){
parseFile(temp);
}

Using maps and sets together as a data structure in Java

i have a program that takes tracks and how many times it was played and output it.. simple.. but i couldn't make the counting in a descending order. My second problem is that if there are multiple tracks with the same count, it should look at the track's name and print them in alphabetical order.. i reached the point where i can print everything as it should be without the order though, because I am using maps and whenever I use a list to sort it out, it gets sorted in ascending order.
Here is my code and output
import java.util.*;
import java.io.*;
import java.lang.*;
import lab.itunes.*;
public class Music {
public static void main(String[] args) throws Exception {
try {
Scanner input = new Scanner(System.in);
PrintStream output = new PrintStream(System.out);
Map<String,Integer> mapp = new HashMap<String,Integer>();
List<Integer> list1 = new ArrayList<Integer>();
output.print("Enter the name of the iTunes library XML file:");
String entry = input.nextLine();
Scanner fileInput = new Scanner(new File(entry));
Library music = new Library(entry); // this class was given to us.
Iterator<Track> itr = music.iterator(); // scan through it
while (itr.hasNext())
{
Track token = itr.next(); // get the tracks
mapp.put(token.getName(),token.getPlayCount()); // fill our map
list1.add(token.getPlayCount()); // fill our list too
}
for(Map.Entry<String,Integer> testo : mapp.entrySet()) {
String keys = testo.getKey();
Integer values = testo.getValue();
output.printf("%d\t%s%n",values,keys); // printing the keys and values in random order.
}
} catch (FileNotFoundException E) {
System.out.print("That file does not exist");
}
}
}
the output is this..
Enter the name of the iTunes library XML file:library.txt
87 Hotel California
54 Like a Rolling Stone
19 Billie Jean
75 Respect
26 Imagine
19 In the Ghetto
74 Macarena
27 Hey Jude
67 I Gotta Feeling
99 The Twist
can you please give me a hint for this? i worked for at least 4 hours to get this far.. thanks
Does the Library class have a sort() method? If not, you could add one and call sort() on the Library music just before you ask it for its iterator().
public class Library
{
// ... existing code ...
public void sort()
{
class TrackPlayCountComparator implements Comparator<Track>
{
#Override
public int compare(Track t1, Track t2) {
int compare = t2.getPlayCount() - t1.getPlayCount();
if (compare == 0) {
return t1.getName().compareTo(t2.getName());
}
return compare;
}
}
Collections.sort(this.tracks, new TrackPlayCountComparator());
}
}
Simplifies your code to this:
public static void main(String[] args) throws Exception
{
Scanner input = new Scanner(System.in);
System.out.print("Enter the name of the iTunes library XML file: ");
String entry = input.nextLine();
try {
input = new Scanner(new File(entry));
input.close();
Library music = new Library(entry); // this class was given to us.
music.sort(); // sort the tracks
PrintStream output = new PrintStream(System.out)
for (Iterator<Track> itr = music.iterator(); itr.hasNext(); ) {
Track track = itr.next();
output.printf("%d\t%s%n", track.getPlayCount(), track.getName());
}
} catch (FileNotFoundException E) {
System.out.print("That file does not exist");
}
}
I'm assuming your question is: how can I sort a map on the values, rather than the keys?
If so, here is some sample code to get you started:
map.entrySet().stream()
.sorted(Map.Entry.comparingByValue())
.map(entry -> entry.getKey() + "\t + entry.getValue())
.forEach(output::println);
If you need to sort in reverse order then just change the comparingByValue comparator:
.sorted(Map.Entry.comparingByValue((val1, val2) -> val2 - val2))
To sort by value then alphabetically:
.sorted((entry1, entry2) -> entry1.getValue() == entry2.getValue() ? entry1.getKey().compareTo(entry2.getKey())) : entry2.getValue() - entry1.getValue())
You could make that a bit neater by putting the comparator in a separate method.
private Comparator<Map.Entry<String, Integer>> songComparator() {
return (entry1, entry2) -> {
int difference = entry2.getValue() - entry1.getValue();
if (difference == 0) {
return entry1.getKey().compareTo(entry2.getKey()));
} else {
return difference;
}
}
}
you would then use songComparator to generate the comparator for sorted.
Use Collections.sort() to sort a collection by its natural order, or define a Comparator and pass it as the second argument.
First you must change your List to take the 'Track' type, and you no longer need a Map:
// the list will store every track
List<Track> tracks = new ArrayList<Track>();
String entry = input.nextLine();
Scanner fileInput = new Scanner(new File(entry));
Library music = new Library(entry); // this class was given to us.
Iterator<Track> itr = music.iterator(); // scan through it
while (itr.hasNext()) {
tracks.add(itr.next()); // add each track
}
// you can define classes anonymously:
Collections.sort(tracks, new Comparator<Track>()
{
#Override
public int compare(Track t1, Track t2) {
int diff = t2.getPlayCount() - t1.getPlayCount();
// if there is no difference in play count, return name comparison
return (diff == 0 ? t1.getName().compareTo(t2.getName()) : diff);
}
});
See Anonymous Classes for more information.

Categories

Resources