Problem Description
I have list of strings, which contains 8000 items. Items which contains list are described below.
List<String> stringList = new List<String>(8000);
stringList.add("this is first string.");
stringList.add("text which I want to search.");
stringList.add("separated string items.");
....
So you can see that every item in my list is a sentence which have more then three words.
Question.
User from outside can search through the list in the following way. For example user want to search word "first" the search algorithm must work in this way.
Search algorithm must run over the list and compare word "first" with all words in the sentence and if any word in the sentence starts with "first" it must return that sentence ". So in order to realize this algorithm I write following code, you can see code below.
Algorithm which I implement works very slow, So I want to know if there is faster algorithm or how I can made my algorithm faster ?
Code Example
Iterator<ContactInformation> stringListIter = stringList .iterator();
while (stringListIter.hasNext()) {
String currItem = stringListIter.next();
String[] separatedStr = currItem.split(" ");
for(int i=0; i<separatedStr.lenght; ++i)
if(separatedStr[i].startsWith(textToFind))
retList.add(currItem);
}
You could use the String#contains method along with String#startsWithinstead of splitting your String and searching each token.
String currItem = stringListIter.next();
if(currItem.startsWith(textToFind.concat(space))){
retList.add(currItem);
} else if(currItem.endsWith(space.concat(textToFind))){
retList.add(currItem);
} else if(currItem.contains(space.concat(textToFind).concat(space))){
retList.add(currItem);
} else if(currItem.equals(textToFind)){
retList.add(currItem);
}
First if - Checks if its the first word.
Second if - Checks if its the last word.
Third if - Checks if its somewhere in the middle.
Last if - Checks if its the only word.
I would hold a Map<String, Set<Integer>> where every word is a key and the value are the indexes of the sentence that contains this word.
A task perfectly suited for Lucene.
for(String s : yourList){
if(s.contains(textToFind)){
retList.add(s);
}
}
Related
I am having some problem with Java programming which includes List. Basically, what I am trying to count the occurences of each word in a sentence from a list containing several sentences.
The code for the list containing sentences is as below:
List<List<String>> sort = new ArrayList<>();
for (String sentence : complete.split("[.?!]\\s*"))
{
sort.add(Arrays.asList(sentence.split("[ ,;:]+"))); //put each sentences in list
}
The output from the list is as follows:
[hurricane, gilbert, head, dominican, coast]
[hurricane, gilbert, sweep, dominican, republic, sunday, civil, defense, alert, heavily, populate, south, coast, prepare, high, wind]
[storm, approach, southeast, sustain, wind, mph, mph]
[there, alarm, civil, defense, director, a, television, alert, shortly]
The output desired should be as follows (only an example). It will output all the unique word in the list and calculate the occurences by sentences.
Word: hurricane
Sentence 1: 1 times
Sentence 2: 1 times
Sentence 3: 0 times
Sentence 4: 0 times
Word: gilbert
Sentence 1: 0 times
Sentence 2: 2 times
Sentence 3: 1 times
Sentence 4: 0 times
Word: head
Sentence 1: 3 times
Sentence 2: 2 times
Sentence 3: 0 times
Sentence 4: 0 times
and goes on....
With the example above, the word 'hurricane' occur 1 time in the first sentence, 1 time in second sentence, none in third sentence and none in forth sentence.
How do I achieve the output? I was thinking of a 2D matrices for building them. Any help will be appreciated. Thanks!
This is a working solution. I did not take care of the printing. The result is a Map -> Word, Array. Where Array contains the count of Word in each sentence indexed from 0. Runs in O(N) time. Play here: https://repl.it/Bg6D
List<List<String>> sort = new ArrayList<>();
Map<String, ArrayList<Integer>> res = new HashMap<>();
// split by sentence
for (String sentence : someText.split("[.?!]\\s*")) {
sort.add(Arrays.asList(sentence.split("[ ,;:]+"))); //put each sentences in list
}
// put all word in a hashmap with 0 count initialized
final int sentenceCount = sort.size();
sort.stream().forEach(sentence -> sentence.stream().forEach(s -> res.put(s, new ArrayList<Integer>(Collections.nCopies(sentenceCount, 0)))));
int index = 0;
// count the occurrences of each word for each sentence.
for (List<String> sentence: sort) {
for (String s : sentence) {
res.get(s).set(index, res.get(s).get(index) + 1);
}
index++;
}
EDIT:
In answer to your comment.
List<Integer> getSentence(int sentence, Map<String, ArrayList<Integer>> map) {
return map.entrySet().stream().map(e -> e.getValue().get(sentence)).collect(Collectors.toList());
}
Then you can call
List<Integer> sentence0List = getSentence(0, res);
However be aware that this approach is not optimal since it runs in O(K) time with K being the number of sentences. For small K it is totally fine but it does not scale. You have to clarify yourself what will you do with the result. If you need to call getSentence many times, this is not the correct approach. In that case you will need the data structured differently. Something like
Sentences = [
{'word1': N, 'word2': N},... // sentence 1
{'word1': N, 'word2': N},... // sentence 2
]
So you are able to easily access the word count per each sentence.
EDIT 2:
Call this method:
Map<String, Float> getFrequency(Map<String, ArrayList<Integer>> stringMap) {
Map<String, Float> res = new HashMap<>();
stringMap.entrySet().stream().forEach(e -> res.put(e.getKey()
, e.getValue().stream().mapToInt(Integer::intValue).sum() / (float)e.getValue().size()));
return res;
}
Will return something like this:
{standard=0.25, but=0.25, industry's=0.25, been=0.25, 1500s=0.25, software=0.25, release=0.25, type=0.5, when=0.25, dummy=0.5, Aldus=0.25, only=0.25, passages=0.25, text=0.5, has=0.5, 1960s=0.25, Ipsum=1.0, five=0.25, publishing=0.25, took=0.25, centuries=0.25, including=0.25, in=0.25, like=0.25, containing=0.25, printer=0.25, is=0.25, t
You could solve your problem by first creating an index for each word. You could use a Hashmap and put just put all the single words on it, which you find in your text (so you would have no need for checking double occurrences).
Then you can iterate the HashMap and check for every Word in every sentence. You can count occurrences by using the indexOf method of your list. As long as it returns a value greater than -1 you can count up the occurrence in the sentence. This method does only return the first occurrence so you
Some Pseudocode would be like:
Array sentences = text.split(sentence delimiter)
for each word in text
put word on hashmap
for each entry in hashmap
for each sentence
int count = 0
while subList(count, sentence.length) indexOf(entry) > -1
count for entry ++
Note that this is very greedy and not performance oriented at all. Oh yea, and also note, that there are some java nlp libraries out there which may have already solved your problem in a performance oriented and reusable way.
First you can segment your sentences and then tokenize them using a text segmentor such as NLTK or Stanford tokenizer. Splitting the string (containing sentences) around "[.?!]" is not a good idea. What happens to an "etc." or "e.g." that occurs in the middle of the sentence? Splitting a sentence around "[ ,;:]" is also not a good idea. You can have plenty of other symbols in a sentence such as quotation marks, dash and so on.
After segmentation and tokenization you can split your sentences around space and store them in a List<List<String>>:
List<List<String>> sentenceList = new ArraList();
Then for your index you can create a HashMap<String,List<Integer>>:
HashMap<String,List<Integer>> words = new HashMap();
Keys are all words in all sentences. Values you can update as follows:
for(int i = 0 ; i < sentenceList.size() ; i++){
for(String w : words){
if(sentence.contains(w)){
List tmp = words.get(w);
tmp.get(i)++;
words.put(w, tmp);
}
}
}
This solution has the time complexity of O(number_of_sentences*number_of_words) which is equivalent to O(n^2). An optimized solution is:
for(int i = 0 ; i < sentenceList.size() ; i++){
for(String w : sentenceList.get(i)){
List tmp = words.get(w);
tmp.get(i)++;
words.put(w, tmp);
}
}
This has the time complexity of O(number_of_sentences*average_length_of_sentences). Since average_length_of_sentences is usually small this is equivalent to O(n).
I'm trying to create a program that can abbreviate certain words in a string given by the user.
This is how I've laid it out so far:
Create a hashmap from a .txt file such as the following:
thanks,thx
your,yr
probably,prob
people,ppl
Take a string from the user
Split the string into words
Check the hashmap to see if that word exists as a key
Use hashmap.get() to return the key value
Replace the word with the key value returned
Return an updated string
It all works perfectly fine until I try to update the string:
public String shortenMessage( String inMessage ) {
String updatedstring = "";
String rawstring = inMessage;
String[] words = rawstring.replaceAll("[^a-zA-Z ]", "").toLowerCase().split("\\s+");
for (String word : words) {
System.out.println(word);
if (map.containsKey(word) == true) {
String x = map.get(word);
updatedstring = rawstring.replace(word, x);
}
}
System.out.println(updatedstring);
return updatedstring;
}
Input:
thanks, your, probably, people
Output:
thanks, your, probably, ppl
Does anyone know how I can update all the words in the string?
Thanks in advance
updatedstring = rawstring.replace(word, x);
This keeps replacing your updatedstring with the rawstring with a the single replacement.
You need to do something like
updatedstring = rawstring;
...
updatedString = updatedString.replace(word, x);
Edit:
That is the solution to the problem you are seeing but there are a few other problems with your code:
Your replacement won't work for things that you needed to lowercased or remove characters from. You create the words array that you iterate from altered version of your rawstring. Then you go back and try to replace the altered versions from your original rawstring where they don't exist. This will not find the words you think you are replacing.
If you are doing global replacements, you could just create a set of words instead of an array since once the word is replaced, it shouldn't come up again.
You might want to be replacing the words one at a time, because your global replacement could cause weird bugs where a word in the replacement map is a sub word of another replacement word. Instead of using String.replace, make an array/list of words, iterate the words and replace the element in the list if needed and join them. In java 8:
String.join(" ", elements);
I'm really really really not sure what is the best way to approach this. I've gotten as far as I can, but I basically want to scan a user response with an array of words and search for matches so that my AI can tell what mood someone is in based off the words they used. However, I've yet to find a clear or helpful answer. My code is pretty cluttered too because of how many different methods I've tried to use. I either need a way to compare sections of arrays to each other or portions of strings. I've found things for finding a part of an array. Like finding eggs in green eggs and ham, but I've found nothing that finds a section of an array in a section of another array.
public class MoodCompare extends Mood1 {
public static void MoodCompare(String inputMood){
int inputMoodLength = inputMood.length();
int HappyLength = Arrays.toString(Happy).length();
boolean itWorks = false;
String[] inputMoodArray = inputMood.split(" ");
if(Arrays.toString(Happy).contains(Arrays.toString(inputMoodArray)) == true)
System.out.println("Success!");
InputMood is the data the user has input that should have keywords lurking in them to their mood. Happy is an array of the class Mood1 that is being extended. This is only a small piece of the class, much less the program, but it should be all I need to make a valid comparison to complete the class.
If anyone can help me with this, you will save me hours of work. So THANK YOU!!!
Manipulating strings will be nicer when you do not use the relative primitive arrays, where you have to walk through yourself etcetera. A Dutch proverb says: not seeing the wood through the trees.
In this case it seems you check words of the input against a set of words for some mood.
Lets use java collections:
Turning an input string into a list of words:
String input = "...";
List<String> sentence = Arrays.asList(input.split("\\W+"));
sentence.remove("");
\\W+ is a sequence of one or more non-word characters. Mind "word" mean A-Za-z0-9_.
Now a mood would be a set of unique words:
Set<String> moodWords = new HashSet<>();
Collections.addAll(moodWords, "happy", "wow", "hurray", "great");
Evaluation could be:
int matches = 0;
for (String word : sentence) {
if (moodWords.contains(word)) {
++matches;
}
}
int percent = sentence.isEmpty() ? 0 : matches * 100 / sentence.size();
System.out.printf("Happiness: %d %%%n", percent);
In java 8 even compacter.
int matches = sentence.stream().filter(moodWords::contains).count();
Explanation:
The foreach-word-in-sentence takes every word. For every word it checks whether it is contained in moodWords, the set of all mood words.
The percentage is taken over the number of words in the sentence being moody. The boundary condition of an empty sentence is handled by the if-then-else expression ... ? ... : ... - an empty sentence given the arbitrary percentage 0%.
The printf format used %d for the integer, %% for the percent sign % (self-escaped) and %n for the line break character(s).
If I'm understanding your question correctly, you mean something like this?
String words[] = {"green", "eggs", "and", "ham"};
String response = "eggs or ham";
Mood mood = new Mood();
for(String foo : words)
{
if(response.contains(foo))
{
//Check if happy etc...
if(response.equals("green")
mood.sad++;
...
}
}
System.out.println("Success");
...
//CheckMood() etc... other methods.
Try to use tokens.
Every time that the program needs to compare the contents of a row from one array to the other array, just tokenize the contents in parallel and compare them.
Visit the following Java Doc page for farther reference: http://docs.oracle.com/javase/7/docs/api/java/util/StringTokenizer.html
or even view the following web pages:
http://introcs.cs.princeton.edu/java/72regular/Tokenizer.java.html
I have made an array list that stores words from a text file. but now I need to make a separate array list that only takes the letters from the words in the file reverts them to lower case and removes all punctuation in the words or around it. So basically it restores all the words with all the mentioned elements removed from them.
List<String> grams = new ArrayList<String>();
for(String gram : words){
gram = gram.trim();
for(int i=0,int l=gram.size();i<l;++i){ //this line is wrong
const String punChars = ",[]:'-!_().?\/~"; //this line is wrong
if(gram.indexOf(i) != -1){ //this line is wrong
gram.remove(i); //this line is wrong
}
gram.add(gram.remove(0).toLowerCase()); //this line is wrong
}
}
I'm basically trying to read each character in a select string in the array as I put it into the new array list and then if it has any punctuation around or in it I remove it: I try to do this using a const to store the punctuation values and then I check the string with an if statement to remove that position in the string.
Next I try and add the word but remove the uppercase and change it to lowercase.
I'm a little lost and am not sure what I am doing with this bit here...
My interpretation of your problem is that you want to go through a list of words, convert them into lowercase, remove punctuation and then insert them into another list? If that's the case, then I don't see why you need to modify the original list. You can do something like this:
for(String gram : words) {
gram = gram.trim(); //trim string
gram = gram.replaceAll("[^A-Za-z0-9]", ""); //remove any non alphanumeric characters
grams.add(gram); //add to the grams list
}
Why not cycle through each character in the String like this:
for (char c : gram.toCharArray()) {
...
}
Then check to see if your character is in list of excluded characters and if not then lower case it and add it to your output list?
I have a simple, general question regarding a real small issue that bothers me:
I'm printing a list of elements on the fly, so I don't have prior knowledge about the number of printed elements. I want a simple format where the elements are separated by a comma (elem1, elem2...) or something similar. Now, if I use a simple loop like:
while(elements to check) {
if (elem should be printed) {
print elem . ","
}
}
I get a comma after the last element...
I know this sounds quite stupid, but is there a way to handle this?
Let's assume that "should be printed" means "at least one non-whitespace character. In Perl, the idiomatic way to write this would be (you'll need to adjust the grep to taste):
print join "," => grep { /\S/ } #elements;
The "grep" is like "filter" in other languages and the /S/ is a regex matching one non-whitespace character. Only elements matching the expression are returned. The "join" should be self-explanatory.
perldoc -f join
perldoc -f grep
the way of having all your data in an array and then
print join(',', #yourarray)
is a good one.
You can also, after looping for your concatenation
declare eltToPrint
while (LOOP on elt) {
eltToPrint .= elt.','
}
remove the last comma with a regex :
eltToPrint =~s/,$//;
ps : works also if you put the comma at the beginning
eltToPrint =~s/^,//;
Java does not have a build-in join, but if you don't want to reinvent the wheel, you can use Guava's Joiner. It can skipNulls, or useForNull(something).
An object which joins pieces of text (specified as an array, Iterable, varargs or even a Map) with a separator. It either appends the results to an Appendable or returns them as a String. Example:
Joiner joiner = Joiner.on("; ").skipNulls();
return joiner.join("Harry", null, "Ron", "Hermione");
This returns the string "Harry; Ron; Hermione". Note that all input elements are converted to strings using Object.toString() before being appended.
Why not add a comma BEFORE each element (but the first one)? Pseudo-code:
is_first = true
loop element over element_array
BEGIN LOOP
if (! is_first)
print ","
end if
print element
is_first = false
END
print NEWLINE
I guess the simplest way is to create a new array containing only the elements from the original array that you need to print (i.e. a filter operation). Then print the newly created array, preferably using your language's built-in array/vector print/join function.
(In Perl)
#orig=("a","bc","d","ef","g");
#new_list=();
for $x(#orig){
push(#new_list,$x) if (length($x)==1);
}
print join(',',#new_list)."\n";
(In Java)
List<String> orig=Arrays.asList(new String[]{"a","bc","d","ef","g"});
List<String> new_list=new ArrayList<String>();
for(String x: orig){
if (x.length()==1)
new_list.add(x);
}
System.out.println(new_list);
You have several options depending on language.
e.g. in JavaScript just do:
var prettyString = someArray.join(', ');
in PHP you can implode()
$someArray = array('apple', 'orange', 'pear');
$prettyString = implode(",", $someArray);
if all else fails, you can either add the comma after every entry and trim the last one when done, or check in you while/foreach loop (bad for perf) if this is not the last item (if so, add a comma)
update: since you noted Java... you could create a method like this:
public static String join(String[] strings, String separator) {
StringBuffer sb = new StringBuffer();
for (int i=0; i < strings.length; i++) {
if (i != 0) sb.append(separator);
sb.append(strings[i]);
}
return sb.toString();
}
update 2: sounds like you really want this then if you are not outputting every element (pseudo-code):
first = true;
for(item in list){
if(item meets condition){
if(!first){
print ", ";
} else {
first = false;
}
print item;
}
}