Find the line number of a text file by each word - java

I want to find the line number of a text file by each word, however, the method I wrote below only gives the first number while I need a list of line numbers.
For instance, if "a" occurs in lines: 1,3,5, it should have a list of [1,3,5]. This list result then will be passed into another method for further process. But, my result only shows [1] for "a".
Can someone help me fix this? Thank you!
public SomeObject<Word> buildIndex(String fileName, Comparator<Word> comparator) {
SomeObject<Word> someObject = new SomeObject<>(comparator);
Comparator<Word> comp = checkComparator(someObject.comparator());
int num = 0;
if (fileName != null) {
File file = new File(fileName);
try (Scanner scanner = new Scanner(file, "latin1")) {
while (scanner.hasNextLine()) {
String lines;
if (comparator instanceof IgnoreCase) {
lines = scanner.nextLine().toLowerCase();
} else {
lines = scanner.nextLine();
}
if (lines != null) {
String[] lineFromText = lines.split("\n");
List<Integer> list = new ArrayList<>();
for (int i = 0; i < lineFromText.length; i++) {
String[] wordsFromText = lineFromText[i].split("\\W");
num++;
for (String s : wordsFromText) {
if (s != null && lineFromText[i].contains(s)) {
list.add(num);
}
if (s != null && !s.trim().isEmpty() && s.matches("^[a-zA-Z]*$")) {
doInsert(s, comp, someObject, list);
}
}
}
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
return someObject;
}

Does something like this work for you?
It reads in the lines one at a time.
Finds the words by splitting on spaces.
Then puts the words and the line numbers in a map where the
key is the word an the value is a list of line numbers.
int lineCount = 1;
String fileName = "SomeFileName";
Map<String, List<Integer>> index = new HashMap<>();
Scanner scanner = new Scanner("fileName");
while (scanner.hasNextLine()) {
//get single line from file
String line = scanner.nextLine().toLowerCase();
//split into words
for (String word : line.split("\\s+")) {
// add to lineNumber to map if List already there.
// otherwise add new List and then add lineNumber
index.compute(word,
(wd, list) -> list == null ? new ArrayList<>()
: list).add(lineCount);
}
// bump lineCount for next line
lineCount++;
}
Print them out.
index.forEach((k, v) -> System.out.println(k + " --> " + v));

Related

How to ignore duplicate strings when using RegEx to match string?

EDIT: editted for clarity as to what I'm having trouble with. I'm not getting the right responses as its counting dupes. I HAVE to use RegEx, can use tokenizer however but I did not.
What I am trying to do here is, there is 5 input files. I need to calculate how many "USER DEFINED VARIABLES" there are. Please ignore the messy code, I'm just learning Java.
I replaced: everything within ( and ), all non-word characters, any statements such as int, main etc, any digit with a space infront of it, and any blank space with a new line then trim it.
This leaves me with a list that has a variety of strings which I will match with my RegEx. However, at this point, how make my count only include unique identifiers?
EXAMPLE:
For example, in the input file I have attached beneath the code, I am receiving
"distinct/unique identifiers: 10" in my output file, when it should be "distinct/unique identifiers: 3"
And for example, in the 5th input file I have attached, I should have "distinct/unique identifiers: 3" instead I currently have "distinct/unique identifiers: 6"
I cannot use Set, Map etc.
Any help is great! Thanks.
import java.util.*
import java.util.regex.*;
import java.io.*;
public class A1_123456789 {
public static void main(String[] args) throws IOException {
if (args.length < 1) {
System.out.println("Wrong number of arguments");
System.exit(1);
}
for (int i = 0; i < args.length; i++) {
FileReader jk = new FileReader(args[i]);
BufferedReader ij = new BufferedReader(jk);
FileWriter fw = null;
BufferedWriter bw = null;
String regex = "\\b(\\w+)(\\s+\\1\\b)+";
Pattern p = Pattern.compile("[_a-zA-Z][_a-zA-Z0-9]{0,30}");
String line;
int count = 0;
while ((line = ij.readLine()) != null) {
line = line.replaceAll("\\(([^\\)]+)\\)", " " );
line = line.replaceAll("[^\\w]", " ");
line = line.replaceAll("\\bint\\b|\\breturn\\b|\\bmain\\b|\\bprintf\\b|\\bif\\b|\\belse\\b|\\bwhile\\b", " ");
line = line.replaceAll(" \\d", "");
line = line.replaceAll(" ", "\n");
line = line.trim();
Matcher m = p.matcher(line);
while (m.find()) {
count++;
}
}
try {
String s1 = args[i];
String s2 = s1.replaceAll("input","output");
fw = new FileWriter(s2);
bw = new BufferedWriter(fw);
bw.write("distinct/unique identifiers: " + count);
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
if (bw != null) {
bw.close();
}
if (fw != null) {
bw.close();
}
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
}
//This is the 3rd input file below.
int celTofah(int cel)
{
int fah;
fah = 1.8*cel+32;
return fah;
}
int main()
{
int cel, fah;
cel = 25;
fah = celTofah(cel);
printf("Fah: %d", fah);
return 0;
}
//This is the 5th input file below.
int func2(int i)
{
while(i<10)
{
printf("%d\t%d\n", i, i*i);
i++;
}
}
int func1()
{
int i = 0;
func2(i);
}
int main()
{
func1();
return 0;
}
Try this
LinkedList dtaa = new LinkedList();
String[] parts =line.split(" ");
for(int ii =0;ii<parts.length;ii++){
if(ii == 0)
dtaa.add(parts[ii]);
else{
if(dtaa.contains(parts[ii]))
continue;
else
dtaa.add(parts[ii]);
}
}
count = dtaa.size();
instead of
Matcher m = p.matcher(line);
while (m.find()) {
count++;
}
Amal Dev has suggested a correct implementation, but given the OP wants to keep Matcher, we have:
// Previous code to here
// Linked list of unique entries
LinkedList uniqueMatches = new LinkedList();
// Existing code
while ((line = ij.readLine()) != null) {
line = line.replaceAll("\\(([^\\)]+)\\)", " " );
line = line.replaceAll("[^\\w]", " ");
line = line.replaceAll("\\bint\\b|\\breturn\\b|\\bmain\\b|\\bprintf\\b|\\bif\\b|\\belse\\b|\\bwhile\\b", " ");
line = line.replaceAll(" \\d", "");
line = line.replaceAll(" ", "\n");
line = line.trim();
Matcher m = p.matcher(line);
while (m.find()) {
// New code - get this match
String thisMatch = m.group();
// If we haven't seen this string before, add it to the list
if(!uniqueMatches.contains(thisMatch))
uniqueMatches.add(thisMatch);
}
}
// Now see how many unique strings we have collected
count = uniqueMatches.size();
Note I haven't compiled this, but hopefully it works as is...

how to fix error: cannot find symbol

I feel as if I am missing something really simple but I can't find it.
The goal of this code is to take a Shakespeare file and use a hash map to find the number of times a word is given by the text as well as words of "n" characters long. However I can't even get to the debugging portion because I get the error
Bard.java:13: error: cannot find symbol
Pattern getout = Pattern.compile("[\\w']+"); //this will take only the words
^ symbol: class Pattern location: class Bard
Bard.java:13: error: cannot find symbol
Pattern getout = Pattern.compile("[\\w']+"); //this will take only the words
plus a few more location. Help would be greatly appreciated.
import java.io.*;
import java.util.*;
public class Bard {
public static void main(String[] args) {
HashMap < String, Integer > m1 = new HashMap < String, Integer > (); // sets the hashmap
//create file reader for the shakespere text
try (BufferedReader br = new BufferedReader(new FileReader("shakespeare.txt"))) {
String line = br.readLine();
Pattern getout = Pattern.compile("[\\w']+"); //this will take only the words
//create the hashmap
while (line != null) {
Matcher m = getout.matcher(line); //find the relevent information
while (m.find()) {
if (m1.get(m.group()) == null && !m.group().toUpperCase().equals(m.group())) { //find new word that is not in all caps.
m1.put(m.gourp(), 1);
} else { //increments the onld word
int newValue = m1.get(m.group());
newValue++;
m1.put(m.group, newValue);
}
}
line = br.readLine();
}
} catch (Exception e) {
e.printStackTrace();
}
try (BufferedReader br2 = new BufferedReader(new FileReader("input.txt"))) {
String line2 = br2.readLine();
FileWriter output = new FileWriter("analysis.txt");
while (line2 != null) {
if (line2.matches("[\\d\\s]+")) { // if i am dealing with the two integers
String[] args = line.split(" "); // split them up
wordSize = Integer.parseInt(args[0]); // set the first on the the word size
numberOfWords = Integer.parseInt(args[1]); // set the other one to the number of words wanted
String[] wordsToReturn = new String[numberOfWords]; //create array to place the words
int i = 0;
int j;
for (String word: m1.keySet()) { //
if (word.length() == wordSize) {
wordToReturn[i] = word;
i++;
}
for (j = 0; numberOfWords > j; j++) {
output.write(wordToReturn[j]);
}
}
} else {
output.write(m1.get(line2));
}
}
line2 = br2.readLine();
} catch (Exception e) {
e.printStackTrace();
}
}
}
You have not imported the Pattern class. Import it with :-
import java.util.regex.*;

Splitting up a text file into two files (java)

I need some help into figuring out how to split a text file into two files in java.
I have a text file in which each line contains in alphabetical order a word a space and its index, i.e.
...
stand 345
stand 498
stare 894
...
What I would like to do is to read in this file and then write two separate files. One file should contain only one instance of the word and the other the positions of the word in the document.
The file is really big and I was wondering if I can use an array or a list to store the word and index before creating the file or if there is a better way.
I don't really know how to think.
I would suggest you to create a HashMap using the word as key and a list of indexes as value, like HashMap< String, ArrayList< String >>. This way you can easily check the words you already have put in the map, and update its index list.
List<String> list = map.get(word);
if (list == null)
{
list = new ArrayList<String>();
map.put(word, list);
}
list.add(index);
After reading and storing all values, you just need to iterate through the map and write its keys in one file and values in another.
for (Map.Entry<String, Object> entry : map.entrySet()) {
String key = entry.getKey();
ArrayList value = (ArrayList) entry.getValue();
// writing code here
}
If your file is really long, then you should consider using a database. If your file is not too big then you can use a HashMap. You can also use a class like this, it requires that the file is sorted, and it writes the words in one file and the indices in another file:
public class Split {
private String fileName;
private PrintWriter fileWords;
private PrintWriter fileIndices;
public Split(String fname) {
fileName = fname;
if (initFiles()) {
writeList();
}
closeFiles();
}
private boolean initFiles() {
boolean retval = false;
try {
fileWords = new PrintWriter("words-" + fileName, "UTF-8");
fileIndices = new PrintWriter("indices-" + fileName, "UTF-8");
retval = true;
} catch (Exception e) {
System.err.println(e.getMessage());
}
return retval;
}
private void closeFiles() {
if (null != fileWords) {
fileWords.close();
}
if (null != fileIndices) {
fileIndices.close();
}
}
private void writeList() {
String lastWord = null;
List<String> wordIndices = new ArrayList<String>();
Path file = Paths.get(fileName);
Charset charset = Charset.forName("UTF-8");
try (BufferedReader reader = Files.newBufferedReader(file, charset)) {
String line = null;
while ((line = reader.readLine()) != null) {
int len = line.length();
if (len > 0) {
int ind = line.indexOf(' ');
if (ind > 0 && ind < (len - 1)) {
String word = line.substring(0, ind);
String indice = line.substring(ind + 1, len);
if (!word.equals(lastWord)) {
if (null != lastWord) {
writeToFiles(lastWord, wordIndices);
}
lastWord = word;
wordIndices = new ArrayList<String>();
wordIndices.add(indice);
} else {
wordIndices.add(indice);
}
}
}
}
if (null != lastWord) {
writeToFiles(lastWord, wordIndices);
}
} catch (IOException x) {
System.err.format("IOException: %s%n", x);
}
}
private void writeToFiles(String word, List<String> list) {
boolean first = true;
fileWords.println(word);
for (String elem : list) {
if (first) {
first = false;
}
else {
fileIndices.print(" ");
}
fileIndices.print(elem);
}
fileIndices.println();
}
}
Be careful that the file name handling is not very robust, you can use it that way:
Split split = new Split("data.txt") ;
You can use this to save the words and the indices. You just need to call addLine for each line of your file.
Map<String, Set<Integer>> entries = new LinkedHashMap<>();
public void addLine(String word, Integer index) {
Set<Integer> indicesOfWord = entries.get(word);
if (indicesOfWord == null) {
entries.put(word, indicesOfWord = new TreeSet<>());
}
indicesOfWord.add(index);
}
To store them in separate files you can use this method:
public void storeInSeparateFiles(){
for (Entry<String, Set<Integer>> entry : entries.entrySet()) {
String word = entry.getKey();
Set<Integer> indices = entry.getValue();
// TODO: Save in separate files.
}
}

how to get specifics rows of 2d array returned by reading CSV file in java

This is data.csv file, now I want rows having classtype x (any number) and store those extarcted rows into new array, so if i have n classtype then i will have n new arrays.
age sex zipcode classtype
21 m 23423 1
12 f 23133 2
23 m 32323 2
23 f 23211 1
Example: If I want to retrieve rows which have classtype 1 and store this values in a new 2d array. Then output should come like this:
array1={{21,m,23423,1},{23,f,23211,1}}
I have written the below code which gives me arrayList as output.
public class CsvParser {
public static void main(String[] args) {
try {
FileReader fr = new FileReader((args.length > 0) ? args[0] : "data.csv");
Map<String, List<String>> values = parseCsv(fr, "\\s,", true);
System.out.println(values);
} catch (IOException e) {
e.printStackTrace();
}
}
public static Map<String, List<String>> parseCsv(Reader reader, String separator, boolean hasHeader) throws IOException {
Map<String, List<String>> values = new LinkedHashMap<String, List<String>>();
List<String> columnNames = new LinkedList<String>();
BufferedReader br = null;
br = new BufferedReader(reader);
String line;
int numLines = 0;
while ((line = br.readLine()) != null) {
if (StringUtils.isNotBlank(line)) {
if (!line.startsWith("#")) {
String[] tokens = line.split(separator);
if (tokens != null) {
for (int i = 0; i < tokens.length; ++i) {
if (numLines == 0) {
columnNames.add(hasHeader ? tokens[i] : ("row_"+i));
} else {
List<String> column = values.get(columnNames.get(i));
if (column == null) {
column = new LinkedList<String>();
}
column.add(tokens[i]);
values.put(columnNames.get(i), column);
}
}
}
++numLines;
}
}
}
return values;
}
The ouput of this code is:
{age=[21,12,23,23],sex=[m,f,m,f],zipcode=[23423,23133,32323,23211],classtype=[1,2,2,1]}
I got few links, which says about grouping elements in "java collectors class", But dont whether that is useful.
http://docs.oracle.com/javase/8/docs/api/java/util/stream/Collectors.html#groupingBy-java.util.function.Function-
Your help will be very useful.
You can try something like
String[][] allArrays = new String[50][]; //Set it to however many you need
String classType = "1";
int counter = 0;
Scanner s = new Scanner(new File(fileName));
while(s.hasNextLine()) {
String row = s.nextLine();
if (row.endsWith(classType) {
allArrays[counter++] = row.split(","); //Adds the row, with each element being split by the comma
}
}
Do not reinvent the wheel, you can use an existing library to dump the content of CSV file to a Java Collection. I usually use OpenCSV to dump the contents of CSV file to List<String[]>. It has a one liner code to read all.
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"));
List<String[]> lines= reader.readAll();
Then iterate the list like this to do the grouping.
Map<String, List<String[]>> values = new LinkedHashMap<String, List<String[]>>();
for(String[] line : lines){
String key = line[4];
if(values.get(key) == null){
values.put(key, new ArrayList<String[]>());
}
values.get(key).add(line);
}
System.out.println(values);

Combining a line counter method with a word counting method

I have a method that counts the occurrences of words in a text file, and returns the number of time the word is found on a particular line. However, it doesn't keep track of which line number the words are located. i have a separate method that counts the number of lines in the text file and i would like to combine the two methods into a method that tracks the line numbers, and keeps a log of the words occurrences on each line.
here are the two methods i would like to combine to give a result something like "Word occurs X times on line Y"
public class Hash
{
private static final Object dummy = new Object(); // dummy variable
public void hashbuild()
{
File file = new File("getty.txt");
// LineNumberReader lnr1 = null;
String line1;
try{
Scanner scanner = new Scanner(file);
//lnr1 = new LineNumberReader(new FileReader("getty.txt"));
// try{while((line1 = lnr1.readLine()) != null)
// {}}catch(Exception e){}
while(scanner.hasNextLine())
{
String line= scanner.nextLine();
List<String> wordList1 = Arrays.asList(line.split("\\s+"));
Map<Object, Integer> hm = new LinkedHashMap<Object, Integer>();
for (Object item : wordList1)
{
Integer count = hm.get(item);
if (hm.put(item, (count == null ? 1 : count + 1))!=null)
{
System.out.println("Found Duplicate : " +item);
}
}
for ( Object key : hm.keySet() )
{
int value = hm.get( key );
if (value>1)
{
System.out.println(key + " occurs " + (value) + " times on line # "+lnr1.getLineNumber());
}
}
}
} catch (FileNotFoundException f)
{f.printStackTrace();}
}
}
here is my original line counting method
public void countLines()
{
LineNumberReader lnr = null; String line;
try
{
lnr = new LineNumberReader(new FileReader("getty.txt"));
while ((line = lnr.readLine()) != null)
{
System.out.print("\n" +lnr.getLineNumber() + " " +line);
}
System.out.println("\n");
}catch(Exception e){}
}
Why don't you just remember the line number in the while loop? Initialize a new variable and increase it when calling nextline.

Categories

Resources