Indexing for each word in the TextFile Content Using Java

Indexing for each word in the TextFile Content Using Java - java

I am trying to index each word in a text file Using java
Index means i am denoting indexing of words here..
This is my sample file https://pastebin.com/hxB8t56p
(the actual file I want to index is much larger)
This is the code I have tried so far
ArrayList<String> ar = new ArrayList<String>();
ArrayList<String> sen = new ArrayList<String>();
ArrayList<String> fin = new ArrayList<String>();
ArrayList<String> word = new ArrayList<String>();
String content = new String(Files.readAllBytes(Paths.get("D:\\folder\\poem.txt")), StandardCharsets.UTF_8);
String[] split = content.split("\\s"); // Split text file content
for(String b:split) {
ar.add(b); // added into the ar arraylist //ar contains every line of poem
}
FileInputStream fstream = null;
String answer = "";fstream=new FileInputStream("D:\\folder\\poemt.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
String strLine;
int count = 1;
int songnum = 0;
while((strLine=br.readLine())!=null) {
String text = strLine.replaceAll("[0-9]", ""); // Replace numbers from txt
String nums = strLine.split("(?=\\D)")[0]; // get digits from strLine
if (nums.matches(".*[0-9].*")) {
songnum = Integer.parseInt(nums); // Parse string to int
}
String regex = ".*\\d+.*";
boolean result = strLine.matches(regex);
if (result == true) { // check if strLine contain digit
count = 1;
}
answer = songnum + "." + count + "(" + text + ")";
count++;
sen.add(answer); // added songnum + line number and text to sen
}
for(int i = 0;i<sen.size();i++) { // loop to match and get word+poem number+line number
for (int j = 0; j < ar.size(); j++) {
if (sen.get(i).contains(ar.get(j))) {
if (!ar.get(j).isEmpty()) {
String x = ar.get(j) + " - " + sen.get(i);
x = x.replaceAll("\\(.*\\)", ""); // replace single line sentence
String[] sp = x.split("\\s+");
word.add(sp[0]); // each word in the poem is added to the word arraylist
fin.add(x); // word+poem number+line number
}
}
}
}
Set<String> listWithoutDuplicates = new LinkedHashSet<String>(fin); // Remove duplicates
fin.clear();fin.addAll(listWithoutDuplicates);
Locale lithuanian = new Locale("ta");
Collator lithuanianCollator = Collator.getInstance(lithuanian); // sort array
Collections.sort(fin,lithuanianCollator);
System.out.println(fin);
(change in blossom. - 0.2,1.2, & the - 0.1,1.2, & then - 0.1,1.2)

I will first copy the intended output for your pasted example, and then go over the code to find how to change it:
Poem.txt
0.And then the day came,
to remain blossom.
1.more painful
then the blossom.
Expected output
[blossom. - 0.2,1.2, came, - 0.1, day - 0.1, painful - 1.1, remain - 0.2, the - 0.1,1.2, then - 0.1,1.2, to - 0.2]
As #Pal Laden notes in comments, some words (the, and) are not being indexed. It is probable that stopwords are being ignored for indexing purposes.
Current output of code is
[blossom. - 0.2, blossom. - 1.2, came, - 0.1, day - 0.1, painful - 1.1, remain - 0.2, the - 0.1, the - 1.2, then - 0.1, then - 1.2, to - 0.2]
So, assuming you fix your stopwords, you are actually quite close. Your fin array contains word+poem number+line number, but it should contain word+*list* of poem number+line number. There are several ways to fix this. First, we will need to do stopword removal:
// build stopword-removal set "toIgnore"
String[] stopWords = new String[]{ "a", "the", "of", "more", /*others*/ };
Set<String> toIgnore = new HashSet<>();
for (String s: stopWords) toIgnore.add(s);
if ( ! toIgnore.contains(sp[0)) fin.add(x); // only process non-ignored words
// was: fin.add(x);
Now, lets fix the list problem. The easiest (but ugly) way is to fix "fin" at the very end:
List<String> fixed = new ArrayList<>();
String prevWord = "";
String prevLocs = "";
for (String s : fin) {
String[] parts = s.split(" - ");
if (parts[0].equals(prevWord)) {
prevLocs += "," + parts[1];
} else {
if (! prevWord.isEmpty()) fixed.add(prevWord + " - " + prevLocs);
prevWord = parts[0];
prevLocs = parts[1];
}
}
// last iteration
if (! prevWord.isEmpty()) fixed.add(prevWord + " - " + prevLocs);
System.out.println(fixed);
How to do it the right way (TM)
You code can be much improved. In particular, using flat ArrayLists for everything is not always the best idea. Maps are great for building indices:
// build stopwords
String[] stopWords = new String[]{ "and", "a", "the", "to", "of", "more", /*others*/ };
Set<String> toIgnore = new HashSet<>();
for (String s: stopWords) toIgnore.add(s);
// prepare always-sorted, quick-lookup set of terms
Collator lithuanianCollator = Collator.getInstance(new Locale("ta"));
Map<String, List<String>> terms = new TreeMap<>((o1, o2) -> lithuanianCollator.compare(o1, o2));
// read lines; if line starts with number, store separately
Pattern countPattern = Pattern.compile("([0-9]+)\\.(.*)");
String content = new String(Files.readAllBytes(Paths.get("/tmp/poem.txt")), StandardCharsets.UTF_8);
int poemCount = 0;
int lineCount = 1;
for (String line: content.split("[\n\r]+")) {
line = line.toLowerCase().trim(); // remove spaces on both sides
// update locations
Matcher m = countPattern.matcher(line);
if (m.matches()) {
poemCount = Integer.parseInt(m.group(1));
lineCount = 1;
line = m.group(2); // ignore number for word-finding purposes
} else {
lineCount ++;
}
// read words in line, with locations already taken care of
for (String word: line.split(" ")) {
if ( ! toIgnore.contains(word)) {
if ( ! terms.containsKey(word)) {
terms.put(word, new ArrayList<>());
}
terms.get(word).add(poemCount + "." + lineCount);
}
}
}
// output formatting to match that of your code
List<String> output = new ArrayList<>();
for (Map.Entry<String, List<String>> e: terms.entrySet()) {
output.add(e.getKey() + " - " + String.join(",", e.getValue()));
}
System.out.println(output);
Which gives me [blossom. - 0.2,1.2, came, - 0.1, day - 0.1, painful - 1.1, remain - 0.2, to - 0.2]. I have not fixed the list of stopwords to get a perfect match, but that should be easy to do.

Related

Calculating Word Frequency Using StreamTokenizer () , HashMap() , HashSet(). in Java Core

import java.io.*;
import java.util.*;
class A {
public static void main(String args[]) throws Exception {
Console con = System.console();
String str;
int i=0;
HashMap map = new HashMap();
HashSet set = new HashSet();
System.out.println("Enter File Name : ");
str = con.readLine();
File f = new File(str);
f.createNewFile();
FileInputStream fis = new FileInputStream(str);
StreamTokenizer st = new StreamTokenizer(fis);
while(st.nextToken()!=StreamTokenizer.TT_EOF) {
String s;
switch(st.ttype) {
case StreamTokenizer.TT_NUMBER: s = st.nval+"";
break;
case StreamTokenizer.TT_WORD: s = st.sval;
break;
default: s = ""+((char)st.ttype);
}
map.put(i+"",s);
set.add(s);
i++;
}
Iterator iter = set.iterator();
System.out.println("Frequency Of Words :");
while(iter.hasNext()) {
String word;
int count=0;
word=(String)iter.next();
for(int j=0; j<i ; j++) {
String word2;
word2=(String)map.get(j+"");
if(word.equals(word2))
count++;
}
System.out.println(" WORD : "+ word+" = "+count);
}
System.out.println("Total Words In Files: "+i);
}
}
In This code First I have already created a text file which contains the following data :
# Hello Hii World # * c++ java salesforce
And the output of this code is :
**Frequency Of Words :
WORD : # = 1
WORD : # = 1
WORD : c = 1
WORD : salesforce = 1
WORD : * = 1
WORD : Hii = 1
WORD : + = 2
WORD : java = 1
WORD : World = 1
WORD : Hello = 1
Total Words In Files: 11**
where i am unable to find why this shows c++ as a seperate words . I
want to combine c++ as a single word as in the output

You can do it in this way
// Create the file at path specified in the String str
// ...
HashMap<String, Integer> map = new HashMap<>();
InputStream fis = new FileInputStream(str);
Reader bufferedReader = new BufferedReader(new InputStreamReader(fis));
StreamTokenizer st = new StreamTokenizer(bufferedReader);
st.wordChars('+', '+');
while(st.nextToken() != StreamTokenizer.TT_EOF) {
String s;
switch(st.ttype) {
case StreamTokenizer.TT_NUMBER:
s = String.valueOf(st.nval);
break;
case StreamTokenizer.TT_WORD:
s = st.sval;
break;
default:
s = String.valueOf((char)st.ttype);
}
Integer val = map.get(s);
if(val == null)
val = 1;
else
val++;
map.put(s, val);
}
Set<String> keySet = map.keySet();
Iterator<String> iter = keySet.iterator();
System.out.println("Frequency Of Words :");
int sum = 0;
while(iter.hasNext()) {
String word = iter.next();
int count = map.get(word);
sum += count;
System.out.println(" WORD : " + word + " = " + count);
}
System.out.println("Total Words In Files: " + sum);
Note that I've updated your code using Generics instead of the raw version of HashMap and Iterator. Moreover, the constructor you used for StreamTokenizer was deprecated. The use of both map and set was useless because you can iterate over the key set of the map using .keySet() method. The map now goes from String (the word) to Integer (the number of word count).
Anyway, regarding the example you did, I think that a simple split method would have been more appropriate.
For further information about the wordChars method of StreamTokenizer you can give a look at #wordChars(int, int)

How to check if a text file contain a single string or two strings?

Lets say I have a text file which have two strings in each line:
New York 52.523405 13.4114
San Antonio 41.387917 2.169919
Los Angeles 51.050991 13.733634
and this is my code to split the string out from the line:
for (int i = 0; i < noOfStores;i++){
nextLine = console.readLine();
nextLine = nextLine.trim();
String temp[] = nextLine.split(" ");
String Word = temp[0] + " " + temp[1];
storeNames[i] = firstWord;
latitudes[i] = Double.parseDouble(temp[2]);
longitudes[i] = Double.parseDouble(temp[3]);
}
but what if a text file contain only one string in each line like this:
Berlin 52.523405 13.4114
Barcelona 41.387917 2.169919
Dresden 51.050991 13.733634
How can I check whether a text file contain one or two string when reading it?

Use split(" "), get the returned arrays length, and then parse the last two String array items in the array, items length - 1 and length - 2, as doubles, and then iterate through the remaining String items prior to the last two items and combine them as the City String. Something like,
for (int i = 0; i < noOfStores;i++){
nextLine = console.readLine();
nextLine = nextLine.trim();
String temp[] = nextLine.split(" ");
int length = temp.length;
if (length < 3) {
// output is not as expected; throw some type of exception here.
}
latitudes[i] = Double.parseDouble(temp[length - 2]);
longitudes[i] = Double.parseDouble(temp[length - 1]);
// this should handle city names with 1, 2 or any number of tokens
StringBuilder wordSb = new StringBuilder();
for (int j = 0; j < length - 2; j++) {
wordSb.append(temp[j]);
if (j != length - 3) {
wordSb.append(" ");
}
}
storeNames[i] = wordSb.toString();
}

Use a regular expression.
String testData = "New York 52.523405 13.4114\n" +
"San Antonio 41.387917 2.169919\n" +
"Los Angeles 51.050991 13.733634\n" +
"Berlin 52.523405 13.4114\n" +
"Barcelona 41.387917 2.169919\n" +
"Dresden 51.050991 13.733634";
Pattern p = Pattern.compile("\\s*(.*?)\\s+(-?[0-9.]+)\\s+(-?[0-9.]+)\\s*");
try (BufferedReader in = new BufferedReader(new StringReader(testData))) {
String line;
while ((line = in.readLine()) != null) {
Matcher m = p.matcher(line);
if (! m.matches())
throw new IllegalArgumentException("Bad data: " + line);
String storeName = m.group(1);
double latitude = Double.parseDouble(m.group(2));
double longitude = Double.parseDouble(m.group(3));
System.out.printf("Store '%s' is at %f, %f%n", storeName, latitude, longitude);
}
}
Output
Store 'New York' is at 52.523405, 13.411400
Store 'San Antonio' is at 41.387917, 2.169919
Store 'Los Angeles' is at 51.050991, 13.733634
Store 'Berlin' is at 52.523405, 13.411400
Store 'Barcelona' is at 41.387917, 2.169919
Store 'Dresden' is at 51.050991, 13.733634

The most suitable tool for you task are regexps.
As you can be sure that two of your numbers don't contain any spaces you can define them as "\S+", leaving anything else to be matched by pattern for name.
This allows you to have any number of words (and literally anything else) in name part, in the same time allowing to have numbers in any format (like scientific notation) as long as they don't have spaces inside.
String[] lines = new String[]{
"New York 52.523405 13.4114",
"San Antonio 41.387917 2.169919",
"Los Angeles 51.050991 13.733634",
"Berlin 52.523405 13.4114",
"Barcelona 41.387917 2.169919",
"Dresden 51.050991 13.733634",
"Some scientific notation 1E-4 13.733634"
};
Pattern pattern = Pattern.compile("(.*)\\s+(\\S+)\\s+(\\S+)");
for (String line : lines) {
Matcher matcher = pattern.matcher(line);
if (matcher.matches()) {
String name = matcher.group(1);
double latitude = Double.parseDouble(matcher.group(2));
double longitude = Double.parseDouble(matcher.group(3));
System.out.printf("'%s', %.4f %.4f\n", name, latitude, longitude);
}
}
Result:
'New York', 52.5234 13.4114
'San Antonio', 41.3879 2.1699
'Los Angeles', 51.0510 13.7336
'Berlin', 52.5234 13.4114
'Barcelona', 41.3879 2.1699
'Dresden', 51.0510 13.7336
'Some scientific notation', 0.0001 13.7336

Creating an ArrayList from data in a text file

I am trying to write a program that uses two classes to find the total $ amount from a text file of retail transactions. The first class must read the file, and the second class must perform the calculations. The problem I am having is that in the first class, the ArrayList only seems to get the price of the last item in the file. Here is the input (which is in a text file):
$69.99 3 Shoes
$79.99 1 Pants
$17.99 1 Belt
And here is my first class:
class ReadInputFile {
static ArrayList<Double> priceArray = new ArrayList<>();
static ArrayList<Double> quantityArray = new ArrayList<>();
static String priceSubstring = new String();
static String quantitySubstring = new String();
public void gatherData () {
String s = "C:\\filepath";
try {
FileReader inputFile = new FileReader(s);
BufferedReader bufferReader = new BufferedReader(inputFile);
String line;
String substring = " ";
while ((line = bufferReader.readLine()) != null)
substring = line.substring(1, line.lastIndexOf(" ") + 1);
priceSubstring = substring.substring(0,substring.indexOf(" "));
quantitySubstring = substring.substring(substring.indexOf(" ") + 1 , substring.lastIndexOf(" ") );
double price = Double.parseDouble(priceSubstring);
double quantity = Double.parseDouble(quantitySubstring);
priceArray.add(price);
quantityArray.add(quantity);
System.out.println(priceArray);
} catch (IOException e) {
e.printStackTrace();
}
}
The output and value of priceArray is [17.99], but the desired output is [69.99,79.99,17.99].
Not sure where the problem is, but thanks in advance for any help!

Basically what you have is:
while ((line = bufferReader.readLine()) != null) {
substring = line.substring(1, line.lastIndexOf(" ") + 1);
}
priceSubstring = substring.substring(0,substring.indexOf(" "));
quantitySubstring = substring.substring(substring.indexOf(" ") + 1 , substring.lastIndexOf(" ") );
double price = Double.parseDouble(priceSubstring);
double quantity = Double.parseDouble(quantitySubstring);
priceArray.add(price);
quantityArray.add(quantity);
System.out.println(priceArray);
So all you are doing is creating a substring of the line you just read, then reading the next line, so basically, only the substring of the last will get processed by the remaining code.
Wrap the code in {...} which you want to be executed on each iteration of the loop
For example...
while ((line = bufferReader.readLine()) != null) {
substring = line.substring(1, line.lastIndexOf(" ") + 1);
priceSubstring = substring.substring(0,substring.indexOf(" "));
quantitySubstring = substring.substring(substring.indexOf(" ") + 1 , substring.lastIndexOf(" ") );
double price = Double.parseDouble(priceSubstring);
double quantity = Double.parseDouble(quantitySubstring);
priceArray.add(price);
quantityArray.add(quantity);
System.out.println(priceArray);
}
This will execute all the code within the {...} block for each line of the file

End line StringBuilder in RandomAccessFile

I'm trying use the class RandomAccessFile, but I have a problem with the Strings.
This is the first part. Write in a File:
public static void main(String[] args) throws IOException {
File file = new File("/home/pep/java/randomFile.dat");
RandomAccessFile fitxerAleatori = new RandomAccessFile(file, "rw");
String[] surnames = { "SMITH",
"LOMU" };
int[] dep = { 10,
20 };
Double[] salary = { 1200.50,
1200.50 };
StringBuilder buffer = null;
int n = surnames.length;
for (int i = 0; i<n; i++){
randomFile.writeInt(i+1); //ID
buffer = new StringBuilder(surnames[i]);
buffer.setLength(10); //10 characters
randomFile.writeChars(buffer.toString());
randomFile.writeInt(dep[i]);
randomFile.writeDouble(salary[i]);
}
randomFile.close();
}
In the second part, I try read this file:
File file = new File("/home/pep/java/randomFile.dat");
RandomAccessFile randomFile = new RandomAccessFile(file, "r");
char[] surname = new char[10];
char aux;
int id, dep, pos;
Double salary;
pos = 0;
for (;;) {
randomFile.seek(pos);
id = randomFile.readInt();
for (int i = 0; i < surname.length; i++) {
aux = randomFile.readChar();
surname[i] = aux;
}
String surnameStr = new String(surname); //HERE IS THE PROBLEM!!
dep = randomFile.readInt();
salary = randomFile.readDouble();
System.out.println("ID: " + id + ", Surname: " + surnameStr + ", Departament: " + dep + ", Salary: " + salary);
pos = pos + 36; // 4 + 20 + 4 + 8
if (randomFile.getFilePointer() == randomFile.length())
break;
}
randomFile.close();
}
Well, when I hope read:
ID: 1, Surname: SMITH, Dep: 10, Salary: 1200.50
I recived:
ID: 1, Surname: SMITH
It's like in the surname there is a end of line, because if I don't display the surname, the other info is correct.
Thank you!

Where does cognom come from? [Edit: OK, I found it. It's Catalan for surname. And now the typo coming from departamento is also clear. :-]
What do you get if you insert System.out.println( Arrays.toString( surname )) before the problem line? I assume it's something like [S, M, I, T, H, [], [], [], [], []] (in Eclipse's Console view). Where [] stands for a square, i.e. a non-printable character.
What do you get if you insert System.out.println( (int) surname[5] )? I assume it's 0. And I assume this 0 value is causing the problem.
What do you get if you use a surname that's exactly 10 characters long?
Hint 1: There's a typo in Departament.
Hint 2: Give System.out.printf(...) a chance in favour of println(...).
Hint 3: The if in your solution can be shortened to the more elegant:
cognom[i] = aux != 0 ? aux : ' ';

The problem was in the char array. I change de loop for that read the chars:
for (int i = 0; i < surname.length; i++) {
aux = randomFile.readChar();
surname[i] = aux != 0 ? aux : ' ';
}

Creating a StringBuffer and setting its length to ten will cause nulls to be written for strings shorter than ten characters, and that in turn will cause a decoding problem when you read. It would be much better to create a String, pad it with spaces to ten chars, write it, then trim() the resulting String when you read it.

array in array list

In the input file, there are 2 columns: 1) stem, 2) affixes. In my coding, i recognise each of the columns as tokens i.e. tokens[1] and tokens[2]. However, for tokens[2] the contents are: ng ny nge
stem affixes
---- -------
nyak ng ny nge
my problem here, how can I declare the contents under tokens[2]? Below are my the snippet of the coding:
try {
FileInputStream fstream2 = new FileInputStream(file2);
DataInputStream in2 = new DataInputStream(fstream2);
BufferedReader br2 = new BufferedReader(new InputStreamReader(in2));
String str2 = "";
String affixes = " ";
while ((str2 = br2.readLine()) != null) {
System.out.println("Original:" + str2);
tokens = str2.split("\\s");
if (tokens.length < 4) {
continue;
}
String stem = tokens[1];
System.out.println("stem is: " + stem);
// here is my point
affixes = tokens[3].split(" ");
for (int x=0; x < tokens.length; x++)
System.out.println("affix is: " + affixes);
}
in2.close();
} catch (Exception e) {
System.err.println(e);
} //end of try2

You are using tokens as an array (tokens[1]) and assigning the value of a String.split(" ") to it. So it makes things clear that the type of tokens is a String[] array.
Next,
you are trying to set the value for affixes after splitting tokens[3], we know that tokens[3] is of type String so calling the split function on that string will yield another String[] array.
so the following is wrong because you are creating a String whereas you need String[]
String affixes = " ";
so the correct type should go like this:
String[] affixes = null;
then you can go ahead and assign it an array.
affixes = tokens[3].split(" ");

Are you looking for something like this?
public static void main(String[] args) {
String line = "nyak ng ny nge";
MyObject object = new MyObject(line);
System.out.println("Stem: " + object.stem);
System.out.println("Affixes: ");
for (String affix : object.affixes) {
System.out.println(" " + affix);
}
}
static class MyObject {
public final String stem;
public final String[] affixes;
public MyObject(String line) {
String[] stemSplit = line.split(" +", 2);
stem = stemSplit[0];
affixes = stemSplit[1].split(" +");
}
}
Output:
Stem: nyak
Affixes:
ng
ny
nge

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Indexing for each word in the TextFile Content Using Java - java

Related

Calculating Word Frequency Using StreamTokenizer () , HashMap() , HashSet(). in Java Core

How to check if a text file contain a single string or two strings?

Creating an ArrayList from data in a text file

End line StringBuilder in RandomAccessFile

array in array list

Categories

Resources