I'm really really really not sure what is the best way to approach this. I've gotten as far as I can, but I basically want to scan a user response with an array of words and search for matches so that my AI can tell what mood someone is in based off the words they used. However, I've yet to find a clear or helpful answer. My code is pretty cluttered too because of how many different methods I've tried to use. I either need a way to compare sections of arrays to each other or portions of strings. I've found things for finding a part of an array. Like finding eggs in green eggs and ham, but I've found nothing that finds a section of an array in a section of another array.
public class MoodCompare extends Mood1 {
public static void MoodCompare(String inputMood){
int inputMoodLength = inputMood.length();
int HappyLength = Arrays.toString(Happy).length();
boolean itWorks = false;
String[] inputMoodArray = inputMood.split(" ");
if(Arrays.toString(Happy).contains(Arrays.toString(inputMoodArray)) == true)
System.out.println("Success!");
InputMood is the data the user has input that should have keywords lurking in them to their mood. Happy is an array of the class Mood1 that is being extended. This is only a small piece of the class, much less the program, but it should be all I need to make a valid comparison to complete the class.
If anyone can help me with this, you will save me hours of work. So THANK YOU!!!
Manipulating strings will be nicer when you do not use the relative primitive arrays, where you have to walk through yourself etcetera. A Dutch proverb says: not seeing the wood through the trees.
In this case it seems you check words of the input against a set of words for some mood.
Lets use java collections:
Turning an input string into a list of words:
String input = "...";
List<String> sentence = Arrays.asList(input.split("\\W+"));
sentence.remove("");
\\W+ is a sequence of one or more non-word characters. Mind "word" mean A-Za-z0-9_.
Now a mood would be a set of unique words:
Set<String> moodWords = new HashSet<>();
Collections.addAll(moodWords, "happy", "wow", "hurray", "great");
Evaluation could be:
int matches = 0;
for (String word : sentence) {
if (moodWords.contains(word)) {
++matches;
}
}
int percent = sentence.isEmpty() ? 0 : matches * 100 / sentence.size();
System.out.printf("Happiness: %d %%%n", percent);
In java 8 even compacter.
int matches = sentence.stream().filter(moodWords::contains).count();
Explanation:
The foreach-word-in-sentence takes every word. For every word it checks whether it is contained in moodWords, the set of all mood words.
The percentage is taken over the number of words in the sentence being moody. The boundary condition of an empty sentence is handled by the if-then-else expression ... ? ... : ... - an empty sentence given the arbitrary percentage 0%.
The printf format used %d for the integer, %% for the percent sign % (self-escaped) and %n for the line break character(s).
If I'm understanding your question correctly, you mean something like this?
String words[] = {"green", "eggs", "and", "ham"};
String response = "eggs or ham";
Mood mood = new Mood();
for(String foo : words)
{
if(response.contains(foo))
{
//Check if happy etc...
if(response.equals("green")
mood.sad++;
...
}
}
System.out.println("Success");
...
//CheckMood() etc... other methods.
Try to use tokens.
Every time that the program needs to compare the contents of a row from one array to the other array, just tokenize the contents in parallel and compare them.
Visit the following Java Doc page for farther reference: http://docs.oracle.com/javase/7/docs/api/java/util/StringTokenizer.html
or even view the following web pages:
http://introcs.cs.princeton.edu/java/72regular/Tokenizer.java.html
Related
This question already has answers here:
Java Dictionary Searcher
(3 answers)
Closed 3 years ago.
i need to find out a string that is made by removing a space between two words contains a word from a dictionary.
I already have stored in a dictionary in a BST.
I get as a input a text file with random spaces removed.
For example:
We left in pretty good time, and came after nightfallto Klausenburgh.
Here I stopped for the night at the Hotel Royale. I had for dinner, or
rather supper, a chicken done up some way with red pepper, which was
very goodbut thirsty. (Mem., get recipe for Mina.) I asked the
waiter, and he said it was called "paprika hendl," and that, as it was a
nationaldish, I should be able to get it anywhere along the
Carpathians. I found my smattering of German very useful here; indeed, I
don't know how I should be able to get on without it.
I read the file and saved every word in a list.
I need to verify if a word is in the dictionary and count its frequency, i already did this part. the hard part is that i need to verify if i can get words in the dictionary from a space removed string.
For example,
'goodbut' should give me 'good' and should be added in the frequency counter. since 'but' is not in my dictionary.
I have a list with all the strings from the text file that was not in the dictionary when i looked for the frequencies. i need to go trough those words to see if i can get a legal word in them.
But i don't know how. nor where to start
For each word in the text:
Iterable<String> words = ...;
for (String word : words) {
processSubWords(word);
}
You want to generate each possible sub-word (this can only happen for words with 2 or more characters):
void processSubWords(String word) {
if (word.length() > 1) {
for (int i = 1; i < word.length(); i++) {
final String left = word.substring(0, i);
final String right = word.substring(i);
lookupAndUpdate(left);
lookupAndUpdate(right);
}
}
}
Then in lookupAndUpdate you would do a dictionary lookup and update as necessary if there's a match.
As an example, if you passed goodbut to processSubWords, it would call lookupAndUpdate with the following strings:
g
oodbut
go
odbut
goo
dbut
good
but
goodb
ut
goodbu
t
Of those, only good should (likely) match your dictionary.
I think a regex matcher with counter should do the desired result. The example code will be something like this:
public int countWords(String key, String source) {
Pattern pattern = Pattern.compile(key);
Matcher matcher = pattern.matcher(source);
int count = 0;
while (matcher.find()) {
count++;
}
return count;
}
Where key is from your example the word "good" and the source is the text. The method returned count 2 for this setup.
Following are the intended output and the original output I got from using this line of code :- ArrayList<String> nodes = new ArrayList<String>
(Arrays.asList(str.split("(?i:"+word+")"+"[.,!?:;]?")));
on the input :-
input : "Cow shouts COW! other cows shout COWABUNGA! stupid cow."
The string will be split into an ArrayList at the acceptable "cow" versions.
Original Output(from line above) :
ArrayList nodes = {, shouts , other , s shout ,ABUNGA! stupid }
vs
Intended Output :
ArrayList nodes = {, shouts , other cows shout COWABUNGA! stupid }
What I'm trying to achieve :
Case insensitive search. (ACHIEVED)
Takes into account the possibilities of these punctuations ".,:;!?" behind the word that is to be split. hence "[.,!?:;]?" (ACHIEVED)
Only splits if it finds exact word lengths + "[.,!?:;]?". It will not split at "cows" nor "COWABUNGA!" (NOT ACHIEVED, need help)
Find a possible way to add the acceptable splitting-word versions {Cow,COW!,cow.} into another arrayList for future use later in the method. (IN PROGRESS)
As you can see, I have fulfilled 1. and 2. and I am pasting this question first whilst I work on 4.. I know this issue can be solved with more extra lines but I'd like to keep it minimal and efficient.
UPDATE : I found that "{"+input.length+"}" can limit the matches down to letter length but I don't know if it'll work or not.
All help will be appreciated. I apologize if this question is too trivial but alas, I am new. Thanks in advance!
The following code produces the output you specified given your input. I have broken the regular expression down into named components, so each bit should be self-explanatory.
import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;
public class Moo {
public static void main(String[] args) {
String input = "Cow shouts COW! other cows shout COWABUNGA! stupid cow.";
System.out.println(splitter(input, "cow"));
}
public static List<String> splitter(String input, String word) {
String beginningOfInputOrWordBoundary = "(\\A|\\W)";
String caseInsensitiveWord = "(?i:"+Pattern.quote(word)+")";
String optionalPunctuation = "\\p{Punct}?";
String endOfInputOrWordBoundary = "(\\z|\\W)";
String regex =
beginningOfInputOrWordBoundary +
caseInsensitiveWord +
optionalPunctuation +
endOfInputOrWordBoundary;
return Arrays.asList(input.split(regex));
}
}
Resulting output:
[, shouts, other cows shout COWABUNGA! stupid]
A word is a sequence of letters. Any character that is not a letter implies the end of a word.
Thus, this should provide the desired result:
(?i:Cow)[^\\p{IsAlphabetic}]
The problem is as follows:
You are given a dictionary which contains a list of words and has the method .contains() which returns a boolean indicating if the word is in the dictionary or not. The actual implementation of the dictionary doesn´t matter for the problem.
The input is a string of words which all spaces are removed and contains words in a dictionary. However, it may also contain characters which aren´t found in the dictionary. The output must be a String with the words separated by a space and any word that is not found in the dictionary must be joined with a word which is found in the dictionary.
For example:
Diccionary = ["hi", "mike", "java"]
Input = "HiMikeJava"
Output = "Hi Mike Java"
Input = "HiMikeLJava"
Output = "Hi MikeL Java"
Input = "HiMikeLJavaSS"
Output = "Hi MikeL JavaSS"
The problem that I find is that the input could contain characters not found in the dictionary. Any help is appreciated.
Note: If you answer in code, please answer in Java since it is the only programming language I know. Thanks.
I've just done this, so I have a solution... it's not very elegant but it works. That said, this is your homework so you have to do the work!
But, some pointers:
Create a method which takes an array of String inputs
Create an outer loop to run while the input has chars in it
Create an inner loop to go through each word
In this loop take a substring of input the length of the word you
are looking at and compare it to the word (Think: How will you
handle the fact that the cases might be different?)
If they match - great print a space and then the substring and
remove the substring from input.
If none of them match then print the first character of the input
and remove this character from the input (How will you know whether
it has been found or not? Do you need an extra variable)
If you implement it exactly like this there might be some cases where you will get an IndexOutOfBoundsException. Also, you will have an unwanted space before the first word. I leave this to you to figure out, granted it's homework!
/// <summary>
/// Word Break Problem
/// </summary>
/// <param name="inputDictionary"></param>
/// <param name="inputString"></param>
/// <param name="outputString"></param>
public static void WordBreakProblem(List<string> inputDictionary, string inputString, string outputString)
{
if (inputString.Length == 0)
{
Console.WriteLine(outputString);
return;
}
for (int i = 1; i <= inputString.Length; i++)
{
string subString = inputString.Substring(0, i);
if (inputDictionary.Contains(subString, StringComparer.OrdinalIgnoreCase))
WordBreakProblem(inputDictionary, inputString[i..], outputString + " " + subString);
}
}
I need help for completing this little project
Program will take a phone number as an input and convert it into a proper English word.
Explaination:
There is some letters related to digits from 0-9 saved in a text file in first ten lines, something like
1 akl
2 dgh
3 qnm
4 rtu
5 zx
6 cvf
7 eip
8 wjs
9 yb
0 o
On line# 11 total number of words is present i-e 50000
after that, from line number 12 all 50000 words are present; one word per line.
Now program will take number(s) as an input form user until user enters -1
and then generate a proper English matching word from this text file.Each letter represents a digit from the list.
for example user enters
6182703
output will be :
Fashion
for more than 1 matching words , system will list all the words hyphen '-' seperated.
How should I start this, what approach should I use ?
If someone gives Pseudo code or hints .. It would be really great.
I would take a dictionary of words and sort it in a file by your needs.
e.g:
apple = 17717
cherry = 627449
Then go through the file with a search algorithm.
EDIT: or you could store the data in a Relational DB (http://hsqldb.org/ is simple) to avoid a bigger memory footprint. If you like the solution you also could investigate some key/value stores etc.
A lot of the detail in your question relates to the input spec, which is all pretty trivial.
After parsing your input, you're going to have a list of "candidate" words (all the words), and a mapping of digits to the set of characters it can be represented with.
List<String> words;
Map<Character, Set<Character>> digitMapping;
The simplest way of generating the word for a number is probably this: sequentially filter the list of candidates, testing if they match the input digits, and removing them otherwise. Something like this might do the trick (consider this pseudocode - I haven't tried compiling it):
List<String> getMatches(String inputDigits) {
// Take a copy of the word list. You don't want to ruin the list for the next caller
List<String> candidates = new ArrayList<String>(words);
for (Iterator<String> it = candidates.iterator(); it.hasNext() && !candidats.isEmpty(); ) {
String candidate = it.getNext();
for (int i = 0; i < inputDigits.length; ++i) {
Character c = new Character(candidate.charAt(i));
Character d = new Character(inputDigits.charAt(i));
if (!digitMapping.get(d).contains(c)) {
it.remove();
}
}
}
return candidates;
}
It will return all the words that match, so in your example, "555" will likely return an empty list. "6182703" might only return a single word, "fashion", while "202" might return several words in a list ("dog", "hog", "god"). You'll need to decide how you want to handle the zero and multiple cases.
Edit: Details on populating digitMapping:
The digitMapping will be something like:
Map<Character, Set<Character>> digitMapping = new HashMap<Character, Set<Character>>();
Then you'll need to grab a char and a String from the input. For the input line "1 akl", your char will be '1', while your String will be "akl". You're mapping from the character to the set of characters in the string, so will need to construct an empty set, put it into the map, then populate the set. Something like (again, I haven't even tried compiling this, so take it with a grain of salt):
private void addDigitToMap(char digit, String chars) {
Set<Character> set = new HashSet<Character>();
digitMapping.put(set);
for (char c : chars.toCharArray()) {
set.add(new Character(c));
}
}
So now the map will have an entry that points to a set of the characters it can be represented by.
I am having this problem with word boundary identification. I removed all the markup of the wikipedia document, now I want to get a list of entities.(meaningful terms). I am planning to take bi-grams, tri-grams of the document and check if it exists in dictionary(wordnet). Is there a better way to achieve this.
Below is the sample text. I want to identify entities(shown as surrounded by double quotes)
Vulcans are a humanoid species in the fictional "Star Trek" universe who evolved on the planet Vulcan and are noted for their attempt to live by reason and logic with no interference from
emotion They were the first extraterrestrial species officially to make first contact with Humans and later became one of the founding members of the "United Federation of Planets"
I think what you're talking about is really still a subject of burgeoning research rather than a simple matter of applying well-established algorithms.
I can't give you a simple "do this" answer, but here are some pointers off the top of my head:
I think using WordNet could work (not sure where bigrams/trigrams come into it though), but you should view WordNet lookups as part of a hybrid system, not the be-all and end-all to spotting named entities
then, start by applying some simple, common-sense criteria (sequences of capitalised words; try and accommodate frequent lower-case function words like 'of' into these; sequences consisting of "known title" plus capitalisd word(s));
look for sequences of words that statistically you wouldn't expect to appear next to one another by chance as candidates for entities;
can you build in dynamic web lookup? (your system spots the capitalised sequence "IBM" and sees if it finds e.g. a wikipedia entry with the text pattern "IBM is ... [organisation|company|...]".
see if anything here and in the "information extraction" literature in general gives you some ideas: http://www-nlpir.nist.gov/related_projects/muc/proceedings/muc_7_toc.html
The truth is that when you look at what literature there is out there, it doesn't seem like people are using terribly sophisticated, well-established algorithms. So I think there's a lot of room for looking at your data, exploration and seeing what you can come up with... Good luck!
If I understand correctly, you're looking to extract substrings delimited by double-quotation marks ("). You could use capture-groups in regular expressions:
String text = "Vulcans are a humanoid species in the fictional \"Star Trek\"" +
" universe who evolved on the planet Vulcan and are noted for their " +
"attempt to live by reason and logic with no interference from emotion" +
" They were the first extraterrestrial species officially to make first" +
" contact with Humans and later became one of the founding members of the" +
" \"United Federation of Planets\"";
String[] entities = new String[10]; // An array to hold matched substrings
Pattern pattern = Pattern.compile("[\"](.*?)[\"]"); // The regex pattern to use
Matcher matcher = pattern.matcher(text); // The matcher - our text - to run the regex on
int startFrom = text.indexOf('"'); // The index position of the first " character
int endAt = text.lastIndexOf('"'); // The index position of the last " character
int count = 0; // An index for the array of matches
while (startFrom <= endAt) { // startFrom will be changed to the index position of the end of the last match
matcher.find(startFrom); // Run the regex find() method, starting at the first " character
entities[count++] = matcher.group(1); // Add the match to the array, without its " marks
startFrom = matcher.end(); // Update the startFrom index position to the end of the matched region
}
OR write a "parser" with String functions:
int startFrom = text.indexOf('"'); // The index-position of the first " character
int nextQuote = text.indexOf('"', startFrom+1); // The index-position of the next " character
int count = 0; // An index for the array of matches
while (startFrom > -1) { // Keep looping as long as there is another " character (if there isn't, or if it's index is negative, the value of startFrom will be less-than-or-equal-to -1)
entities[count++] = text.substring(startFrom+1, nextQuote); // Retrieve the substring and add it to the array
startFrom = text.indexOf('"', nextQuote+1); // Find the next " character after nextQuote
nextQuote = text.indexOf('"', startFrom+1); // Find the next " character after that
}
In both, the sample-text is hard-coded for the sake of the example and the same variable is presumed to be present (the String variable named text).
If you want to test the contents of the entities array:
int i = 0;
while (i < count) {
System.out.println(entities[i]);
i++;
}
I have to warn you, there may be issues with border/boundary cases (i.e. when a " character is at the beginning or end of a string. These examples will not work as expected if the parity of " characters is uneven (i.e. if there is an odd number of " characters in the text). You could use a simple parity-check before-hand:
static int countQuoteChars(String text) {
int nextQuote = text.indexOf('"'); // Find the first " character
int count = 0; // A counter for " characters found
while (nextQuote != -1) { // While there is another " character ahead
count++; // Increase the count by 1
nextQuote = text.indexOf('"', nextQuote+1); // Find the next " character
}
return count; // Return the result
}
static boolean quoteCharacterParity(int numQuotes) {
if (numQuotes % 2 == 0) { // If the number of " characters modulo 2 is 0
return true; // Return true for even
}
return false; // Otherwise return false
}
Note that if numQuotes happens to be 0 this method still returns true (because 0 modulo any number is 0, so (count % 2 == 0) will be true) though you wouldn't want to go ahead with the parsing if there are no " characters, so you'd want to check for this condition somewhere.
Hope this helps!
Someone else asked a similar question about how to find "interesting" words in a corpus of text. You should read the answers. In particular, Bolo's answer points to an interesting article which uses the density of appearance of a word to decide how important it is---using the observation that when a text talks about something, it usually refers to that something fairly often. This article is interesting because the technique does not require prior knowledge on the text that is being processed (for instance, you don't need a dictionary targeted at the specific lexicon).
The article suggests two algorithms.
The first algorithm rates single words (such as "Federation", or "Trek", etc.) according to their measured importance. It is straightforward to implement, and I could even provide a (not very elegant) implementation in Python.
The second algorithm is more interesting as it extracts noun phrases (such as "Star Trek", etc.) by completely ignoring whitespace and using a tree-structure to decide how to split noun phrases. The results given by this algorithm when applied to Darwin's seminal text on evolution are very impressive. However, I admit implementing this algorithm would take a bit more thought as the description given by the article is fairly elusive, and what more the authors seem a bit difficult to track down. That said, I did not spend much time, so you may have better luck.