I have been working on an assignment in that I have to read words from a file and find the longest word and check how many sub words contains in that longest word?
this should work for all the words in the file.
I tried using java the code I wrote works for the small amount of data in file but my task is to process huge amount of data.
Example:
File words: "call","me","later","hey","how","callmelater","now","iam","busy","noway","nowiambusy"
o/p:
callmelater : subwords->call,me,later
In this I'm reading file words storing in linked list and then finding the longest word & removing it from the list then checking how many sub-words extracted word contains.
Main Class Assignment:
import java.util.Scanner;
public class Assignment {
public static void main (String[] args){
long start = System.currentTimeMillis();;
Assignment a = new Assignment();
a.throwInstructions();
Scanner userInput = new Scanner(System.in);
String filename = userInput.nextLine();
// String filename = "ab.txt";
// String filename = "abc.txt";
Logic testRun = new Logic(filename);
// //testRun.result();
long end = System.currentTimeMillis();;
System.out.println("Time taken:"+(end - start) + " ms");
}
public void throwInstructions(){
System.out.println("Keep input file in same directory, where the code is");
System.out.println("Please specify the fie name : ");
}
Subclass Logic for processing:
import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.Set;
public class Logic {
private String filename;
private File file;
private List<String> words = new LinkedList<String>();
private Map<String, String> matchedWords = new HashMap();
#Override
public String toString() {
return "Logic [words=" + words + "]";
}
// constructor
public Logic(String filename) {
this.filename = filename;
file = new File(this.filename);
fetchFile();
run();
result();
}
// find the such words and store in map
public void run() {
while (!words.isEmpty()) {
String LongestWord = extractLongestWord(words);
findMatch(LongestWord);
}
}
// find longest word
private String extractLongestWord(List<String> words) {
String longWord;
longWord = words.get(0);
int maxLength = words.get(0).length();
for (int i = 0; i < words.size(); i++) {
if (maxLength < words.get(i).length()) {
maxLength = words.get(i).length();
longWord = words.get(i);
}
}
words.remove(words.indexOf(longWord));
return longWord;
}
// find the match for word in array of sub words
private void findMatch(String LongestWord) {
boolean chunkFound = false;
int chunkCount = 0;
StringBuilder subWords = new StringBuilder();
for (int i = 0; i < words.size(); i++) {
if (LongestWord.indexOf(words.get(i)) != -1) {
subWords.append(words.get(i) + ",");
chunkFound = true;
chunkCount++;
}
}
if (chunkFound) {
matchedWords.put(LongestWord,
"\t" + (subWords.substring(0, subWords.length() - 1))
+ "\t:Subword Count:" + chunkCount);
}
}
// fetch data from file and store in list
public void fetchFile() {
String word;
try {
FileReader fr = new FileReader(file);
BufferedReader br = new BufferedReader(fr);
while ((word = br.readLine()) != null) {
words.add(word);
}
fr.close();
br.close();
} catch (FileNotFoundException e) {
// e.printStackTrace();
System.out
.println("ERROR: File -> "
+ file.toString()
+ " not Exists,Please check filename or location and try again.");
} catch (IOException e) {
// e.printStackTrace();
System.out.println("ERROR: Problem reading -> " + file.toString()
+ " File, Some problem with file format.");
}
}
// display result
public void result() {
Set set = matchedWords.entrySet();
Iterator i = set.iterator();
System.out.println("WORD:\tWORD-LENGTH:\tSUBWORDS:\tSUBWORDS-COUNT");
while (i.hasNext()) {
Map.Entry me = (Map.Entry) i.next();
System.out.print(me.getKey() + ": ");
System.out.print("\t" + ((String) me.getKey()).length() + ": ");
System.out.println(me.getValue());
}
}
}
This is where my programs lacks and goes into some never ending loop.
Complexity of my program is high.
To reduce the processing time I need an efficient approach like Binary/merge sort approach which will take least time like O(log n) or O(nlog n).
If someone can help me with this or at least suggestion in which direction I should proceed. Also please suggest me which programming language would be good to implement such text processing tasks in fast way ?
Thanks in advance
This problem requires a Trie. But you have to augment your trie: a generic one will not do. Geek Viewpoint has a good Trie written in Java. Where your particular work will happen is in the method getWordList. Your getWordList will take as input the longest word (i.e. longestWord) and then try to see if each substring comprises words that exist in the dictionary. I think I have given you enough -- I can't do your work for you. But if you have further question, don't hesitate to ask.
Other than in getWordList, you might be able to pretty much keep the trie from Geek Viewpoint the way it is.
You are also in luck because Geek Viewpoint demonstrates the trie using a Boggle example and your problem is a very very trivial version of Boggle.
Not sure I understand your context, but from reading the problem description it sounds to me like a Linked List is an inappropriate data structure. You don't need to check every single word to the longest word.
A "trie" is probably a perfect data structure for this application.
But if you haven't learned about that in your class, then perhaps you can at least cut down your search space with hashtables. While you are doing the initial list processing calculating the longest word, you can simultaneously process each word into a hash table based on first letter. That way, when you are ready to check your longest word for subwords, you can check only those words with first letters in the longest word. (I'm assuming there could be overlapping words, unlike your example.)
Do you know anything about the input you will be receiving? If you have more details about the input word distribution, then you can customize your solution to the data you expect.
If you can choose your language, and time efficiency is important, you might want to switch to C++, as for many applications it's several times faster than Java.
Related
I am trying to solve this question.
Problem Statement
You are developing a File Manager but encountered a problem. You realised that two files cannot have the same names and if a conflict arises, the file which came later has to be appended with a number N such that N is the smallest positive number that is not used with that particular file name. The number is append in the form of file_name(N). Write a code to solve your problem. You will be given an array of strings of file names. You need to assume that if a file name appears earlier in an array, it was created first.
NOTE: file_name and file_name(2) are two different file names i.e if a file name already has a number appended to it, its a different file name.
Input
The first line contains N, the number of strings.
The next line contains N space-separated strings (file names).
Output
Print the names of files, after making the necessary changes separated by space.
Constraints
1 ≤ N ≤ 50
1 ≤ file_name.length ≤ 25
filename has no white space characters
Sample Input
7
file sample sample file file file(1) file(1)
Sample Output
file sample sample(1) file(1) file(2) file(1)(1) file(1)(2)
Below is my code. When I tested it with my own file names, it renames well but when I submit it, the tests fail. I would like to know what's wrong with my code and why its not working.
import java.util.Scanner;
public class Dcoder {
public static void main (String[] args) {
Scanner scanner = new Scanner (System.in);
// Read number of file names and create
// an array to hold them
String[] fileNames = new String[scanner.nextInt ()];
// Fill the array with the supplied names
// from System.in
for (int i = 0; i < fileNames.length; i++)
fileNames [i] = scanner.next ();
// Modify the file names
for (String fileName : fileNames) {
int count = 0;
for (int i = 0; i < fileNames.length; i++)
if (fileName.equals (fileNames [i])) {
fileNames [i] = fileNames [i] + (count == 0 ? "" : "(" + count + ")");
count++;
}
}
// Print out the modified list of file names
for (String fileName : fileNames)
System.out.print (" " + fileName);
}
}
If all tests fail, then it is likely because your output has a space before the first name.
The output should be the file name, space-separated, not space-prefixed.
If you try input "file file(1) file file", your code outputs
file file(1) file(1)(1) file(2)
but correct output is
file file(1) file(2) file(3)
For better performance, you should use a Set.
static void printUnique(String... fileNames) {
Set<String> used = new HashSet<>();
for (int i = 0; i < fileNames.length; i++) {
String newName = fileNames[i];
for (int j = 1; ! used.add(newName); j++)
newName = fileNames[i] + "(" + j + ")";
if (i != 0)
System.out.print(" ");
System.out.print(newName);
}
System.out.println();
}
Test
printUnique("file", "sample", "sample", "file", "file", "file(1)", "file(1)");
printUnique("file", "file(1)", "file", "file");
Output
file sample sample(1) file(1) file(2) file(1)(1) file(1)(2)
file file(1) file(2) file(3)
Your solution is a procedural approach to the Problem.
Procedural approaches are not bad on their own.
But Java is an Object Oriented programming language and if you want to become a good Java programmer you should start looking for more OO-like solutions.
But OOP doesn't mean to "split up" code into random classes.
The ultimate goal of OOP is to reduce code duplication, improve readability and support reuse as well as extending the code.
Doing OOP means that you follow certain principles which are (among others):
information hiding / encapsulation
single responsibility
separation of concerns
KISS (Keep it simple (and) stupid.)
DRY (Don't repeat yourself.)
"Tell! Don't ask."
Law of demeter ("Don't talk to strangers!")
So what could a more OO-like approach look like?
The underlaying question of that problem is: "How often does a specific file name appear in the input?" We want to find an association between Strings (file Names) and integer values (number of occurrence). This could be represented as a Map<String,Integer>. The whole logic is as simple as looking in the output if the current fileName already exists there and if so add the counter suffix. This means we need another Collection to hold the output.
My Solution would look like this:
import java.util.HashMap;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
public class FileNameCounter {
public List<String> renameDoubledFiles(List<String> input) {
Map<String, Integer> occurrencesOfNames = new HashMap<>();
LinkedList<String> output = new LinkedList<>();
for (String fileName : input) {
if (output.contains(fileName)) {
Integer counter = updateCountFor(fileName, occurrencesOfNames);
String suffixedName = appendCounterSuffix(fileName, counter);
output.add(suffixedName);
} else {
output.add(fileName);
}
}
return output;
}
private Integer updateCountFor(String fileName, Map<String, Integer> occurrencesOfNames) {
Integer counter = occurrencesOfNames.getOrDefault(fileName, Integer.valueOf(0));
occurrencesOfNames.put(fileName, ++counter);
return counter;
}
private String appendCounterSuffix(String fileName, Integer counter) {
return String.format("%s(%d)", fileName, counter);
}
}
and here is the JUnit test to prove that it works:
import static org.junit.jupiter.api.Assertions.*;
import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;
import org.junit.jupiter.api.Test;
class FileNameCounterTest {
#Test
void test() {
List<String> input = Arrays.asList("file sample sample file file file(1) file(1)".split(" "));
List<String> renamedDoubledFiles = new FileNameCounter().renameDoubledFiles(input);
String output = renamedDoubledFiles.stream().collect(Collectors.joining(" "));
assertEquals("file sample sample(1) file(1) file(2) file(1)(1) file(1)(2)", output);
}
}
**Edit after reviewing Tormod's answer and implementing his advice.
As the title states I'm attempting to print the total number of different words after receiving a file name from command line input. I receive the following message after attempting to compile the program:
Note: Project.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
Here is my code. Any help is greatly appreciated:
import java.lang.*;
import java.util.*;
import java.io.*;
public class Project {
public static void main(String[] args) throws IOException {
File file = new File(args[0]);
Scanner s = new Scanner(file);
HashSet lib = new HashSet<>();
try (Scanner sc = new Scanner(new FileInputStream(file))) {
int count = 0;
while(sc.hasNext()) {
sc.next();
count++;
}
System.out.println("The total number of word in the file is: " + count);
}
while (s.hasNext()) {
String data = s.nextLine();
String[] pieces = data.split("\\s+");
for (int count = 0; count < pieces.length; count++)
{
if(!lib.contains(pieces[count])) {
lib.add(pieces[count]);
}
}
}
System.out.print(lib.size());
}
}
I would implement it using a HashSet Add all the words, and read out the size. If you want to make it case insensitive just manipulate all the words to uppercase or something like that. this uses some memory but...
one problem you got with the algorithm is that you do only have one "words". it only holds the words at the same line. so you only count same words at the same line.
HashSet stores strings by their hash value, and thus stores one word only one time.
construction: HashSet lib = new HashSet<>();
inside the loop: if(!lib.contains(word)){lib.add(word);}
check the word count: lib.size()
for(String s : words) {
if(s.equals(word))
count++;
}
You are comparing the words to an empty String, since it's a word it's always gonna be false.
Like Tormod said, the best would be to store the words in a HashSet, as it won't keep duplicates. Then just read out its size.
import java.io.*;
import java.util.*;
public class ListSetMap2
{
public static void main(String[] args)
{
Map<String, Integer> my_collection = new HashMap<String, Integer>();
Scanner keyboard = new Scanner(System.in);
System.out.println("Enter a file name");
String filenameString = keyboard.nextLine();
File filename = new File(filenameString);
int word_position = 1;
int word_num = 1;
try
{
Scanner data_store = new Scanner(filename);
System.out.println("Opening " + filenameString);
while(data_store.hasNext())
{
String word = data_store.next();
if(word.length() > 5)
{
if(my_collection.containsKey(word))
{
my_collection.get(my_collection.containsKey(word));
Integer p = (Integer) my_collection.get(word_num++);
my_collection.put(word, p);
}
else
{
Integer i = (Integer) my_collection.get(word_num);
my_collection.put(word, i);
}
}
}
}
catch (FileNotFoundException e)
{
System.out.println("Nope!");
}
}
}
I'm trying to write a program where it inputs/scans a file, logs the words in a HashMap collection, and count's the times that word occurs in the document, with only words over 5 characters being counted.
It's a bit of a mess in the middle, but I'm running into issues on how to count the number of times that word occurs, and keeping a individual count for each word. I'm sure there is a simple solution here and I'm just missing it. Please help!
Your logic of setting the frequency of word is wrong. Here is a simple approach that should work for you:
// if the word is already present in the hashmap
if (my_collection.containsKey(word)) {
// just increment the current frequency of the word
// this overrides the existing frequency
my_collection.put(word, my_collection.get(word) + 1);
} else {
// since the word is not there just put it with a frequency 1
my_collection.put(word, 1);
}
(Only giving hints, since this seems to be homework.) my_collection is (correctly) a HashMap that maps String keys to Integer values; in your situation, a key is supposed to be a word, and the corresponding value is supposed to be the number of times you have seen that word (frequency). Each time you call my_collection.get(x), the parameter x needs to be a String, namely the word whose frequency you want to know (unfortunately, HashMap doesn't enforce this). Each time you call my_collection.put(x, y), x needs to be a String, and y needs to be an Integer or int, namely the frequency for that word.
Given this, give some more thought to what you're using as parameters, and the sequence in which you need to make the calls and how you need to manipulate the values. For example, if you've already determined that my_collection doesn't contain the word, does it make sense to ask my_collection for the word's frequency? If it does contain the word, how do you need to change the frequency before putting the new value into my_collection?
(Also, please choose a more descriptive name for my_collection, e.g. frequencies.)
Try this way -
while(data_store.hasNext()) {
String word = data_store.next();
if(word.length() > 5){
if(my_collection.get(word)==null) my_collection.put(1);
else{
my_collection.put(my_collection.get(word)+1);
}
}
}
This is my code. It produces the error java.util.NoSuchElementException.
It is meant to search a file, example.txt for a word (eg. and) and find all instances of the the word and print the word either side of it also (eg. cheese and ham, tom and jerry) in ONE JOptionPane. Code:
import java.io.File;
import java.util.Arrays;
import java.util.Scanner;
import javax.swing.JOptionPane;
public class openFileSearchWord {
public static void main(String Args[])
{
int i=0,j=0;
String searchWord = JOptionPane.showInputDialog("What Word Do You Want To Search For?");
File file = new File("example.txt");
try
{
Scanner fileScanner = new Scanner(file);
String[] array = new String[5];
String[] input = new String[1000];
while (fileScanner.hasNextLine())
{
for(i=0;i<1000;i++)
{
input[i] = fileScanner.next();
if(input[i].equalsIgnoreCase(searchWord))
{
array[j] = input[i-1] + input[i] + input[i+1];
j++;
}
}
}
Arrays.toString(array);
JOptionPane.showMessageDialog(null, array);
fileScanner.close();
}
catch(Exception e)
{
System.out.println(e);
}
}
}
It looks like you're assuming each line will have 1000 words.
while (fileScanner.hasNextLine())
{
for(i=0;i<1000;i++) <-------- Hardcoded limit?
{
....
}
}
You can try putting another catch loop, or check hasNext() during that for loop.
while (fileScanner.hasNextLine())
{
for(i=0;i<1000 && fileScanner.hasNext();i++)
{
....
}
}
There are also many issues with your code, like if input[i-1] hits the -1 index, or if your 'array' array hits the limit.
I took the liberty to have some fun.
Scanner fileScanner = new Scanner(file);
List<String> array = new ArrayList<String>();
String previous, current, next;
while (fileScanner.hasNext())
{
next = fileScanner.next()); // Get the next word
if(current.equalsIgnoreCase(searchWord))
{
array.add( previous + current + next );
}
// Shift stuff
previous = current;
current = next;
next = "";
}
fileScanner.close();
// Edge case check - if the last word was the keyword
if(current.equalsIgnoreCase(searchWord))
{
array.add( previous + current );
}
// Do whatever with array
....
I see a few error here ...
You are creating two arrays one with 5 and one with 1000 elements.
In your code you are referencing elements directly by index ... but this index might not be present.
input[i-1] ... what if i = 0? ...index is -1
array[j] ... what if j > 4 ... index 5 doesn't exist
I suggest using List of elements instead of fixed arrays.
List<String> array = new ArrayList<>();
You are assuming that the input is something but don't do anything to check what it actually is.
Just as Drejc told you, The first iteration would fail because of the negative index and the program will fail as well if it finds more than 5 matches of the desired word.
Also I want to add another one. You should think that when you do this line:
array[j] = input[i-1] + input[i] + input[i+1];
You have not assigned input[i+1] yet. In that iteration you've just assigned input[i], but no the next one.
You should process the concatenation of the three elements (previousWord + match + nextWord) when reaching nextWord.
Another solution, but inefficient, would be copying all the words to an Array at beginning and using your actual code without modifying. This would work, but you would go twice through all the words.
Write a java program to read input from a file, and then sort the characters within each word. Once you have done that, sort all the resulting words in ascending order and finally followed by the sum of numeric values in the file.
Remove the special characters and stop words while processing the data
Measure the time taken to execute the code
Lets Say the content of file is: Sachin Tendulkar scored 18111 ODI runs and 14692 Test runs.
Output:achins adeklnrtu adn cdeors dio estt nrsu nrsu 32803
Time Taken: 3 milliseconds
My Code takes 15milliseconds to execute.....
please suggest me any fast way to solve this problem...........
Code:
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.*;
public class Sorting {
public static void main(String[] ags)throws Exception
{
long st=System.currentTimeMillis();
int v=0;
List ls=new ArrayList();
//To read data from file
BufferedReader in=new BufferedReader(
new FileReader("D:\\Bhive\\File.txt"));
String read=in.readLine().toLowerCase();
//Spliting the string based on spaces
String[] sp=read.replaceAll("\\.","").split(" ");
for(int i=0;i<sp.length;i++)
{
//Check for the array if it matches number
if(sp[i].matches("(\\d+)"))
//Adding the numbers
v+=Integer.parseInt(sp[i]);
else
{
//sorting the characters
char[] c=sp[i].toCharArray();
Arrays.sort(c);
String r=new String(c);
//Adding the resulting word into list
ls.add(r);
}
}
//Sorting the resulting words in ascending order
Collections.sort(ls);
//Appending the number in the end of the list
ls.add(v);
//Displaying the string using Iteartor
Iterator it=ls.iterator();
while(it.hasNext())
System.out.print(it.next()+" ");
long time=System.currentTimeMillis()-st;
System.out.println("\n Time Taken:"+time);
}
}
Use indexOf() to extract words from your string instead of split(" "). It improves performance.
See this thread: Performance of StringTokenizer class vs. split method in Java
Also, try to increase the size of the output, copy-paste the line Sachin Tendulkar scored 18111 ODI runs and 14692 Test runs. 50,000 times in the text file and measure the performance. That way, you will be able to see considerable time difference when you try different optimizations.
EDIT
Tested this code (used .indexOf())
long st = System.currentTimeMillis();
int v = 0;
List ls = new ArrayList();
// To read data from file
BufferedReader in = new BufferedReader(new FileReader("D:\\File.txt"));
String read = in.readLine().toLowerCase();
read.replaceAll("\\.", "");
int pos = 0, end;
while ((end = read.indexOf(' ', pos)) >= 0) {
String curString = read.substring(pos,end);
pos = end + 1;
// Check for the array if it matches number
try {
// Adding the numbers
v += Integer.parseInt(curString);
}
catch (NumberFormatException e) {
// sorting the characters
char[] c = curString.toCharArray();
Arrays.sort(c);
String r = new String(c);
// Adding the resulting word into TreeSet
ls.add(r);
}
}
//sorting the list
Collections.sort(ls);
//adding the number
list.add(v);
// Displaying the string using Iteartor
Iterator<String> it = ls.iterator();
while (it.hasNext()) {
System.out.print(it.next() + " ");
}
long time = System.currentTimeMillis() - st;
System.out.println("\n Time Taken: " + time + " ms");
Performance using 1 line in file
Your code: 3 ms
My code: 2 ms
Performance using 50K lines in file
Your code: 45 ms
My code: 32 ms
As you see, the difference is significant when the input size increases. Please test it on your machine and share results.
The only thing I see: the following line is needlessly expensive:
System.out.print(it.next()+" ");
That's because print is inefficient, due to all the flushing going on. Instead, construct the entire string using a string builder, and then reduce to one call of print.
I removed the list and read it using Arrays only, In my machine the code to 6 msec with your code, by using Arrays only it taking 4 to 5 msec. Run this code in your machine and let me know the time.
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.*;
public class Sorting {
public static void main(String[] ags)throws Exception
{
long st=System.currentTimeMillis();
int v=0;
//To read data from file
BufferedReader in=new BufferedReader(new FileReader("File.txt"));
String read=in.readLine().toLowerCase();
//Spliting the string based on spaces
String[] sp=read.replaceAll("\\.","").split(" ");
int j=0;
for(int i=0;i<sp.length;i++)
{
//Check for the array if it matches number
if(sp[i].matches("(\\d+)"))
//Adding the numbers
v+=Integer.parseInt(sp[i]);
else
{
//sorting the characters
char[] c=sp[i].toCharArray();
Arrays.sort(c);
read=new String(c);
sp[j]= read;
j++;
}
}
//Sorting the resulting words in ascending order
Arrays.sort(sp);
//Appending the number in the end of the list
//Displaying the string using Iteartor
for(int i=0;i<j; i++)
System.out.print(sp[i]+" ");
System.out.print(v);
st=System.currentTimeMillis()-st;
System.out.println("\n Time Taken:"+st);
}
}
I ran the same code using a PriorityQueue instead of a List. Also, as nes1983 suggested, building the output string first, instead of printing every word individually helps reduce the runtime.
My runtime after these modifications was definitely reduced.
I have modified the code like this further by including #Teja logic as well and resulted in 1 millisecond from 2 millisescond:
long st=System.currentTimeMillis();
BufferedReader in=new BufferedReader(new InputStreamReader(new FileInputStream("D:\\Bhive\\File.txt")));
String read= in.readLine().toLowerCase();
String[] sp=read.replaceAll("\\.","").split(" ");
int v=0;
int len = sp.length;
int j=0;
for(int i=0;i<len;i++)
{
if(isNum(sp[i]))
v+=Integer.parseInt(sp[i]);
else
{
char[] c=sp[i].toCharArray();
Arrays.sort(c);
String r=new String(c);
sp[j] = r;
j++;
}
}
Arrays.sort(sp, 0, len);
long time=System.currentTimeMillis()-st;
System.out.println("\n Time Taken:"+time);
for(int i=0;i<j; i++)
System.out.print(sp[i]+" ");
System.out.print(v);
Wrote small utility to perform for checking a string contains number instead of regular expression:
private static boolean isNum(String cs){
char [] s = cs.toCharArray();
for(char c : s)
{
if(Character.isDigit(c))
{
return true;
}
}
return false;
}
Calcluate time before calling System.out operation as this one is blocking operation.