Hi friends i'm doing my final year project for semantic similarity between sentences.
so i'm using word-net 2.1 database to retrieve the meaning. Each line i have to split no of words. In each word i'm get meaning and storing into array. But it can be get only meaning of first sentences.
String[] sentences = result.split("[\\.\\!\\?]");
for (int i=0;i<sentences.length;i++)
{
System.out.println(i);
System.out.println(sentences[i]);
int wcount1 = sentences[i].split("\\s+").length;
System.out.println(wcount1);int wcount1=wordCount(w2);
System.out.println(wcount1);
String[] word1 = sentences[i].split(" ");
for (int j=0;j<wcount1;j++){
System.out.println(j);
System.out.println(word1[j]);
}
}
IndexWordSet set = wordnet.lookupAllIndexWords(word1[j]);
System.out.println(set);
IndexWord[] ws = set.getIndexWordArray();
**POS p = ws[0].getPOS();///line no 103**
Set<String> synonyms = new HashSet<String>();
IndexWord indexWord = wordnet.lookupIndexWord(p, word1[j]);
Synset[] synSets = indexWord.getSenses();
for (Synset synset : synSets)
{
Word[] words = synset.getWords();
for (Word word : words)
{
synonyms.add(word.getLemma());
}
}
System.out.println(synonyms);
OUTPUT:
only the sentences[o](first sentence word's only shoe the meaning ...all the other words are not looping...)
it show this error..
**java.lang.ArrayIndexOutOfBoundsException: 0
at first_JWNL.main(first_JWNL.java:102)**
When you declare the variable wcount1, you assign in the value: sentences[i].split("\\s+")... And yet, when you assign the variable word1, it is assigned with sentences[i].split(" ").
Is it possible, because you are using two regular expressions, the second split (which is being assigned to the word1 variable) is not splitting correctly? And hence when you access the value (System.out.println(word1[j]);), it is throwing the ArrayIndexOutOfBoundsException. Since the value of wcount1 may be bigger than the length of word1.
Related
I want to count the occurrence of multiple substrings in a string.
I am able to do so by using the following code:
int score = 0;
String text = "This is some random text. This is some random text.";
List<String> words = Arrays.asList("this", "is", "for", "stackoverflow");
for (String word : words) {
if(StringUtils.containsIgnoreCase(text, word)){
score += 1;
}
}
My algorithm increases the score by +1 for each word from my "words" list that occurs in the text.
In my example, the score would be 2 because "this" and "is" occur in the text.
However, my code has to loop through the text for each string in my list.
Is there a faster way to do this?
How about the following:
String text = "This is some random text. This is some random text.";
text = text.toLowerCase();
String[] tokens = text.split("\\PL+");
java.util.Set<String> source = new java.util.HashSet<>();
for (String token : tokens) {
source.add(token);
}
java.util.List<String> words = java.util.Arrays.asList("this", "is", "for", "stackoverflow");
source.retainAll(words);
int score = source.size();
Split text into words.
Add the words to a Set so that each word appears only once. Hence Set will contain the word this only once despite the fact that the word this appears twice in text.
After calling method retainAll, the Set only contains words that are in the words list. Hence your score is the number of elements in the Set.
The fastest way would be to map each words of the text.
Therefore for each word in the words that you are searching for, you just have to look up for the keys in the hashmap.
Given your text has n word and your words has m words
The solution would take O(n+m) instead of O(n*m)
This is a case where regex is your friend:
public static Map<String, Integer> countTokens(String src, List<String> tokens) {
Map<String, Integer> countMap = new HashMap<>();
String end = "(\\s|$)"; //trap whitespace or the end of the string.
String start = "(^|\\s)"; //trap for whitespace or the start of a the string
//iterate through your tokens
for (String token : tokens) {
//create a pattern, note the case insensitive flag
Pattern pattern = Pattern.compile(start+token+end, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(src);
int cnt = 0;
//count your matches.
while(matcher.find()) {
cnt++;
}
countMap.put(token, cnt);
}
return countMap;
}
public static void main(String[] args) throws IOException {
String text = "This is some random text. This is some random text.";
List<String> words = Arrays.asList("this", "is", "for", "stackoverflow");
for (Entry<String, Integer> entry : countTokens(text, words).entrySet()) {
System.out.println(entry);
}
}
If you want to find tokens within token, like "is" within "this", simply remove the start and end regex.
You can use split method, to convert string to Array of Strings, sort it, and then you can binary search the elements of list in the array this has been implemented in the given code.
String[] wordsArr = "This is some random text. This is some random text.".toLowerCase().split(" ");
List<String> words = Arrays.asList("this", "is", "for", "stackoverflow");
int count = 0;
Arrays.sort(wordsArr);
for(String word: words)
if(Arrays.binarySearch( wordsArr, word )>-1)
count++;
Another good approach can be to use a TreeSet, this one I got inspiration from #Abra
String[] wordsArr = "This is some random text. This is some random text.".toLowerCase().split(" ");
List<String> words = Arrays.asList("this", "is", "for", "stackoverflow");
TreeSet<String> setOfWords = new TreeSet<String>(Arrays.asList(wordsArr));
int count = 0;
for(String word: words)
if(setOfWord.contains(word))
count++;
Both these methods have a Time Complexity of O(Nlog(M)), N being the size of words array, M being the size of wordsArr or setOfWords, However do be careful, using this method since this does have one flaw, which is quite obvious, It doesn't account for periods, thus, if you were to search for "text", it won't be found, because, The set/array contains "text.". You can get around that by removing all the punctuations from your initial text and searched text, however, if you do want it to be accurate then, you can set the regex string in split() to be "[^a-zA-Z]" this will split your String around non alphabetical characters.
I have an ArrayList of Strings and it contains records such as:
this is a first sentence
hello my name is Chris
what's up man what's up man
today is tuesday
I need to clear this list, so that the output does not contain repeated content. In the case above, the output should be:
this is a first sentence
hello my name is Chris
what's up man
today is tuesday
as you can see, the 3rd String has been modified and now contains only one statement what's up man instead of two of them.
In my list there is a situation that sometimes the String is correct, and sometimes it is doubled as shown above.
I want to get rid of it, so I thought about iterating through this list:
for (String s: myList) {
but I cannot find a way of eliminating duplicates, especially since the length of each string is not determined, and by that I mean there might be record:
this is a very long sentence this is a very long sentence
or sometimes short ones:
single word singe word
is there some native java function for that maybe?
Assuming the String is repeated just twice, and with an space in between as in your examples, the following code would remove repetitions:
for (int i=0; i<myList.size(); i++) {
String s = myList.get(i);
String fs = s.substring(0, s.length()/2);
String ls = s.substring(s.length()/2+1, s.length());
if (fs.equals(ls)) {
myList.set(i, fs);
}
}
The code just split each entry of the list into two substrings (dividing by the half point). If both are equal, substitute the original element with only one half, thus removing the repetition.
I was testing the code and did not see #Brendan Robert answer. This code follows the same logic as his answer.
I would suggest using regular expressions. I was able to remove duplicates using this pattern: \b([\w\s']+) \1\b
public class Main {
static String [] phrases = {
"this is a first sentence",
"hello my name is Chris",
"what's up man what's up man",
"today is tuesday",
"this is a very long sentence this is a very long sentence",
"single word single word",
"hey hey"
};
public static void main(String[] args) throws Exception {
String duplicatePattern = "\\b([\\w\\s']+) \\1\\b";
Pattern p = Pattern.compile(duplicatePattern);
for (String phrase : phrases) {
Matcher m = p.matcher(phrase);
if (m.matches()) {
System.out.println(m.group(1));
} else {
System.out.println(phrase);
}
}
}
}
Results:
this is a first sentence
hello my name is Chris
what's up man
today is tuesday
this is a very long sentence
single word
hey
Assumptions:
Uppercase words are equal to lowercase counterparts.
String fullString = "lol lol";
String[] words = fullString.split("\\W+");
StringBuilder stringBuilder = new StringBuilder();
Set<String> wordsHashSet = new HashSet<>();
for (String word : words) {
// Check for duplicates
if (wordsHashSet.contains(word.toLowerCase())) continue;
wordsHashSet.add(word.toLowerCase());
stringBuilder.append(word).append(" ");
}
String nonDuplicateString = stringBuilder.toString().trim();
simple logic : split every word by token space i.e " " and now add it in LinkedHashSet , Retrieve back, Replace "[","]",","
String s = "I want to walk my dog I want to walk my dog";
Set<String> temp = new LinkedHashSet<>();
String[] arr = s.split(" ");
for ( String ss : arr)
temp.add(ss);
String newl = temp.toString()
.replace("[","")
.replace("]","")
.replace(",","");
System.out.println(newl);
o/p : I want to walk my dog
It depends on the situation that you have but assuming that the string can be repeated at most twice and not three or more times you could find the length of the entire string, find the halfway point and compare each index after the halfway point with the matching beginning index. If the string can be repeated more than once you will need a more complicated algorithm that would first determine how many times the string is repeated and then finds the starting index of each repeat and truncates all index's from the beginning of the first repeat onward. If you can provide some more context into what possible scenarios you expect to handle we can start putting together some ideas.
//Doing it in Java 8
String str1 = "I am am am a good Good coder";
String[] arrStr = str1.split(" ");
String[] element = new String[1];
return Arrays.stream(arrStr).filter(str1 -> {
if (!str1.equalsIgnoreCase(element[0])) {
element[0] = str1;
return true;
}return false;
}).collect(Collectors.joining(" "));
How can I find the last word of a string? I am not trying to find a fixed word, in other words, I would not know what the last word is, however I want to retrieve it.
Here is my code:
myString = myString.trim();
String[] wordList = myString.split("\\s+");
System.out.println(wordList[wordList.length-1]);
Providing you consider words in a sentence to be delimited by whitespace and punctuation (particularly commas, spaces, new lines, brackets, and so on), which means punctuation can appear at the end of the sentence, and you want to include non-ASCII characters in the words, then the following will find you the last word in a string without the punctuation included:
static String lastWord(String sentence) {
Pattern p = Pattern.compile("([\\p{Alpha}]+)(?=\\p{Punct}*$)", Pattern.UNICODE_CHARACTER_CLASS);
Matcher m = p.matcher(sentence);
if (m.find()) {
return m.group();
}
return ""; // or null
}
The regular expression uses look-ahead to find zero-or-more punctuations at the end of the string and matches the alphabetical word before it.
If you want to also allow numbers in the word, change {Alpha} to {Alnum}.
Read the String API for various methods you might use.
For example you could:
Use the lastIndexOf(...) method to find where the start of the word is
Then use the substring(...) method to get the word
Use the StringTokenizer for this
StringTokenizer st = new StringTokenizer("this is a test");//Take any String
int count = st.countTokens();//it will count the number of token in that particular String
String[] myStringArray = new String[count];
for (int i = 0; i < count; i++) {
`myStringArray[i] = st.nextToken();`//insert the words/Tken in to the string array
}
`System.out.println("Last Word is--" + myStringArray[myStringArray.length - 1])`;//get the last words of the given String
System.out.println("enter the string");
Scanner input = new Scanner(System.in);
String instr = input.nextLine();
instr = instr.trim();
int index = instr.lastIndexOf(" ");
int l = instr.length();
System.out.println(l);
String lastStr = instr.substring(index+1,l);
System.out.println("last string .."+lastStr);
I need to extract several integers from a string that looks like this:
22:43:12:45
I need to extract 22, 43, 12, and 45 as separate integers. I need to use string methods or scanner methods to read up until : and extract the 22. Then read between the first : and second : to give me 43, and so on.
I can extract up to the first : no problem, but I down know what to do thereafter.
Any help would be much appreciated.
String[] parts = str.split(":");
int[] numbers = new int[parts.length];
Iterate over this String array to get an array of integers:
int index = 0;
for(String part : parts)
numbers[index++] = Integer.parseInt(part);
You should look at String.split method . Given a regular expression, this methods splits the string based on that. In your case the regular expression is a ":"
String s ="22:43:12:45";
int i =0;
for (String s1 : s.split(":")) { // For each string in the resulting split array
i = Integer.parseInt(s1);
System.out.println(i);
}
The split returns a string array with your string separated. So in this case , The resulting string array will have "22" on the 0th position, "43" on the first position and so on. To convert these to integers you can use the parseInt method which takes in a string and gives the int equivalent.
You can use only indexOf and substring methods of String:
String text = "22:43:12:45";
int start = 0;
int colon;
while (start < text.length()) {
colon = text.indexOf(':', start);
if (colon < 0) {
colon = text.length();
}
// You can store the returned value somewhere, in a list for example
Integer.parseInt(text.substring(start, colon)));
start = colon + 1;
};
Using Scanner is even simpler:
String text = "22:43:12:45";
Scanner scanner = new Scanner(text);
scanner.useDelimiter(":");
while (scanner.hasNext()) {
// Store the returned value somewhere to use it
scanner.nextInt();
}
However, String.split is the shortest solution.
Make a regex like
(\d\d):(\d\d):(\d\d):(\d\d)
How to count the number of times each word appear in a String in Java using Regular Expression?
I don't think a regex can solve your problem completely.
You want to
split a string into words, a regular expression can do this for a very simple definition of word, "parts of a string seperated by whitespace or punctuation", which is not a very good definition even if you just stick to English text
Count the number of occurances of each word derived from step 1. To do that you must store some kind of Mapping, and regexes neither store nor count.
A workable approach could be to
split the inputstring (by either regex or other means) into an array of word-strings
iterate over the array, and building a Map to keep count of each word
iterate over the map to output a list of words and the number of occurances.
If your input is limited to English you still have to consider how you want your algorithm to behave in case of things like they're <->they are etc and compound words. Add other languages to the mix for additional kinds of headaches (different ways of writing the same word, words split into parts, difference in writing depending on where in a sentence the word occurs, etc)
I would split your task into a) identify words and b) count number of each unique word in text.
a) could be solved with splitting the text with a regex.
b) could be solved by building a map with the result from a).
String text = "I like good mules. Mules are good :)";
String[] words = text.split("([\\W\\s]+)");
Map<String, Integer> counts = new HashMap<String, Integer>();
for (String word: words) {
if (counts.containsKey(word)) {
counts.put(word, counts.get(word) + 1);
} else {
counts.put(word, 1);
}
}
result: {Mules=1, are=1, good=2, mules=1, like=1, I=1}
Pattern p = Pattern.compile("\\babba\\b");
Matcher m = p.matcher("abba is abba with abbabba and abba doing abba");
int count = 0;
while(m.find()){
count++;
}
System.out.println(count); //4
Using Guava, this is a one-liner:
Multiset<String> countOfEachWord =
HashMultiset.create(Splitter.on(" ").omitEmptyStrings().split(myString));
then to get the count of "dog" for example you would say:
countOfEachWord.count("dog")
Must you use a regex? If not this might help:
public static int count(final String string, final String substring)
{
int count = 0;
int idx = 0;
while ((idx = string.indexOf(substring, idx)) != -1)
{
idx++;
count++;
}
return count;
}
int CountWords(String t){
return t.split("([[a-z][A-Z][0-9][\\Q-\\E]]+)",-1).length+(t.replaceAll("([[a-z][A-Z][0-9][\\W]]*)", "")).length()-1;
}
English Words(chemical names)+Chinese words