Check amount of particular words in string - java

If I want to compare the amount one word is used to the other, how would I do that?
It wouldn't be str.contains("cat") > str.contains("dog")
So for example:
if(str.contains("cat") == str.contains("dog")){
System.out.println("true");
}
else
system.out.print("false");
This would hypothetically print true if cat and dog appear the same amount of times. But obviously it doesn't, what would I have to d, to get it to check?

To count number of ocurrences of a String in another String create a function (extracted from here):
The "split and count" method:
public class CountSubstring {
public static int countSubstring(String subStr, String str){
// the result of split() will contain one more element than the delimiter
// the "-1" second argument makes it not discard trailing empty strings
return str.split(Pattern.quote(subStr), -1).length - 1;
}
The "remove and count the difference" method:
public static int countSubstring(String subStr, String str){
return (str.length() - str.replace(subStr, "").length()) / subStr.length();
}
Then you just have to compare:
return countSubstring("dog", phrase) > countSubstring("cat", phrase);
ADDITIONAL INFORMATION
To compare strings use String::equals or String::equalsIgnoreCase if you don't mean uppercase and lowercase.
string1.equals(string2);
string1.equalsIgnoreCase(string2);
To find number of ocurrences of a string in another string use indexOf
string1.indexOf(string2, index);
To see if a string contains another string use contains
string1.contains(string2);

String#contains() will return true if the searched string is found at least once, which is done for performance reasons. Thus str.contains("cat") == str.contains("dog") would be true if both cat and dog are found independent of how often they are found.
What you could do is use 2 regular expressions and check the number of matches:
int countWords(String input, String word ) {
Pattern p = Pattern.compile( "\\b" + word + "\\b" );
int count = 0;
Matcher m = p.matcher( input );
while( m.find() ) {
count++;
}
return count;
}
Usage:
String str = "dog eats dog but cat eats hotdog";
System.out.println("dogs: " + countWords( str, "dog"));
System.out.println("cats: " + countWords( str, "cat"));
Output:
dogs: 2
cats: 1

There are many possible solutions for your problem. I won't show you a full solution, but will try to guide you. Here are some:
split the String according to space(s), iterate over the resulted array and increment the counter when you match the String you're looking for.
use a regex that matches exactly the word you're looking for, there are many useful methods in the Matcher and Pattern classes, go through them.
Java 8 Stream tools has countless methods, you can get the result in one line.

You could get the count in the following way for dog
index = -1;
dogcount = 0;
do {
index = str.indexOf("dog",index+1);
if(index > -1)
dogcount++;
}while(index == -1);
similarly get cat count

Use Apache Commons StringUtils.CountMatches: - counts the number of occurrences of one String in another
https://commons.apache.org/proper/commons-lang/javadocs/api-2.6/org/apache/commons/lang/StringUtils.html#countMatches%28java.lang.String,%20java.lang.String%29

Related

How to find first occurance of whitespace(tab+space+etc) in java?

So I have something like this
System.out.println(some_string.indexOf("\\s+"));
this gives me -1
but when I do with specific value like \t or space
System.out.println(some_string.indexOf("\t"));
I get the correct index.
Is there any way I can get the index of the first occurrence of whitespace without using split, as my string is very long.
PS - if it helps, here is my requirement. I want the first number in the string which is separated from the rest of the string by a tab or space ,and i am trying to avoid split("\\s+")[0]. The string starts with that number and has a space or tab after the number ends
The point is: indexOf() takes a char, or a string; but not a regular expression.
Thus:
String input = "a\tb";
System.out.println(input);
System.out.println(input.indexOf('\t'));
prints 1 because there is a TAB char at index 1.
System.out.println(input.indexOf("\\s+"));
prints -1 because there is no substring \\s+ in your input value.
In other words: if you want to use the powers of regular expressions, you can't use indexOf(). You would be rather looking towards String.match() for example. But of course - that gives a boolean result; not an index.
If you intend to find the index of the first whitespace, you have to iterate the chars manually, like:
for (int index = 0; index < input.length(); index++) {
if (Character.isWhitespace(input.charAt(index))) {
return index;
}
}
return -1;
Something of this sort might help? Though there are better ways to do this.
class Sample{
public static void main(String[] args) {
String s = "1110 001";
int index = -1;
for(int i = 0; i < s.length(); i++ ){
if(Character.isWhitespace(s.charAt(i))){
index = i;
break;
}
}
System.out.println("Required Index : " + index);
}
}
Well, to find with a regular expression, you'll need to use the regular expression classes.
Pattern pat = Pattern.compile("\\s");
Matcher m = pat.matcher(s);
if ( m.find() ) {
System.out.println( "Found \\s at " + m.start());
}
The find method of the Matcher class locates the pattern in the string for which the matcher was created. If it succeeds, the start() method gives you the index of the first character of the match.
Note that you can compile the pattern only once (even create a constant). You just have to create a Matcher for every string.

Java Get first character values for a string

I have inputs like
AS23456SDE
MFD324FR
I need to get First Character values like
AS, MFD
There should no first two or first 3 characters input can be changed. Need to get first characters before a number.
Thank you.
Edit : This is what I have tried.
public static String getPrefix(String serial) {
StringBuilder prefix = new StringBuilder();
for(char c : serial.toCharArray()){
if(Character.isDigit(c)){
break;
}
else{
prefix.append(c);
}
}
return prefix.toString();
}
Here is a nice one line solution. It uses a regex to match the first non numeric characters in the string, and then replaces the input string with this match.
public String getFirstLetters(String input) {
return new String("A" + input).replaceAll("^([^\\d]+)(.*)$", "$1")
.substring(1);
}
System.out.println(getFirstLetters("AS23456SDE"));
System.out.println(getFirstLetters("1AS123"));
Output:
AS
(empty)
A simple solution could be like this:
public static void main (String[]args) {
String str = "MFD324FR";
char[] characters = str.toCharArray();
for(char c : characters){
if(Character.isDigit(c))
break;
else
System.out.print(c);
}
}
Use the following function to get required output
public String getFirstChars(String str){
int zeroAscii = '0'; int nineAscii = '9';
String result = "";
for (int i=0; i< str.lenght(); i++){
int ascii = str.toCharArray()[i];
if(ascii >= zeroAscii && ascii <= nineAscii){
result = result + str.toCharArray()[i];
}else{
return result;
}
}
return str;
}
pass your string as argument
I think this can be done by a simple regex which matches digits and java's string split function. This Regex based approach will be more efficient than the methods using more complicated regexs.
Something as below will work
String inp = "ABC345.";
String beginningChars = inp.split("[\\d]+",2)[0];
System.out.println(beginningChars); // only if you want to print.
The regex I used "[\\d]+" is escaped for java already.
What it does?
It matches one or more digits (d). d matches digits of any language in unicode, (so it matches japanese and arabian numbers as well)
What does String beginningChars = inp.split("[\\d]+",2)[0] do?
It applies this regex and separates the string into string arrays where ever a match is found. The [0] at the end selects the first result from that array, since you wanted the starting chars.
What is the second parameter to .split(regex,int) which I supplied as 2?
This is the Limit parameter. This means that the regex will be applied on the string till 1 match is found. Once 1 match is found the string is not processed anymore.
From the Strings javadoc page:
The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.
This will be efficient if your string is huge.
Possible other regex if you want to split only on english numerals
"[0-9]+"
public static void main(String[] args) {
String testString = "MFD324FR";
int index = 0;
for (Character i : testString.toCharArray()) {
if (Character.isDigit(i))
break;
index++;
}
System.out.println(testString.substring(0, index));
}
this prints the first 'n' characters before it encounters a digit (i.e. integer).

How to calculate syllables in text with regex and Java

I have text as a String and need to calculate number of syllables in each word. I've tried to split all text into array of words and than processed each word separately. I used regular expressions for that. But pattern for syllables doesn't work as it should. Please advice how to change it to calculate correct number of syllables. My initial code.
public int getNumSyllables()
{
String[] words = getText().toLowerCase().split("[a-zA-Z]+");
int count=0;
List <String> tokens = new ArrayList<String>();
for(String word: words){
tokens = Arrays.asList(word.split("[bcdfghjklmnpqrstvwxyz]*[aeiou]+[bcdfghjklmnpqrstvwxyz]*"));
count+= tokens.size();
}
return count;
}
This question is from a Java Course of UCSD, am I right?
I think you should provide enough information for this question, so that it won't confused people who want to offer some help. And here I have my own solution, which already been tested by the test case from the local program, also the OJ from UCSD.
You missed some important information about the definition of syllable in this question. Actually I think the key point of this problem is how should you deal with the e. For example, let's say there is a combination of te. And if you put te in the middle of a word, of course it should be counted as a syllable; However if it's at the end of a word, the e should be thought as a silent e in English, so it should not be thought as a syllable.
That's it. And I would like to write down my thought with some pseudo code:
if(last character is e) {
if(it is silent e at the end of this word) {
remove the silent e;
count the rest part as regular;
} else {
count++;
} else {
count it as regular;
}
}
You may find that I am not only using regex to deal with this problem. Actually I have thought about it: can this question really be done only using regex? My answer is: nope, I don't think so. At least now, with the knowledge UCSD gives us, it's too difficult to do that. Regex is a powerful tool, it can map the desired characters very fast. However regex is missing some functionality. Take the te as example again, regex won't be able to think twice when it is facing the word like teate (I made up this word just for example). If our regex pattern would count the first te as syllable, then why the last te not?
Meanwhile, UCSD actually have talked about it on the assignment paper:
If you find yourself doing mental gymnastics to come up with a single regex to count syllables directly, that's usually an indication that there's a simpler solution (hint: consider a loop over characters--see the next hint below). Just because a piece of code (e.g. a regex) is shorter does not mean it is always better.
The hint here is that, you should think this problem together with some loop, combining with regex.
OK, I should finally show my code now:
protected int countSyllables(String word)
{
// TODO: Implement this method so that you can call it from the
// getNumSyllables method in BasicDocument (module 1) and
// EfficientDocument (module 2).
int count = 0;
word = word.toLowerCase();
if (word.charAt(word.length()-1) == 'e') {
if (silente(word)){
String newword = word.substring(0, word.length()-1);
count = count + countit(newword);
} else {
count++;
}
} else {
count = count + countit(word);
}
return count;
}
private int countit(String word) {
int count = 0;
Pattern splitter = Pattern.compile("[^aeiouy]*[aeiouy]+");
Matcher m = splitter.matcher(word);
while (m.find()) {
count++;
}
return count;
}
private boolean silente(String word) {
word = word.substring(0, word.length()-1);
Pattern yup = Pattern.compile("[aeiouy]");
Matcher m = yup.matcher(word);
if (m.find()) {
return true;
} else
return false;
}
You may find that besides from the given method countSyllables, I also create two additional methods countit and silente. countit is for counting the syllables inside the word, silente is trying to figure it out that is this word end with a silent e. And it should also be noticed that the definition of not silent e. For example, the should be consider not silent e, while ate is considered silent e.
And here is the status my code has already passed the test, from both local test case and OJ from UCSD:
And from OJ the test result:
P.S: It should be fine to use something like [^aeiouy] directly, because the word is parsed before we call this method. Also change to lowercase is necessary, that would save a lot of work dealing with the uppercase. What we want is only the number of syllables.
Talking about number, an elegant way is to define count as static, so the private method could directly use count++ inside. But now it's fine.
Feel free to contact me if you still don't get the method of this question :)
Using the concept of user5500105, I have developed the following method to calculate the number of Syllables in a word. The rules are:
consecutive vowels are counted as 1 syllable. eg. "ae" "ou" are 1 syllable
Y is considered as a vowel
e at the end is counted as syllable if e is the only vowel: eg: "the" is one syllable, since "e" at the end is the only vowel while "there" is also 1 syllable because "e" is at the end and there is another vowel in the word.
public int countSyllables(String word) {
ArrayList<String> tokens = new ArrayList<String>();
String regexp = "[bcdfghjklmnpqrstvwxz]*[aeiouy]+[bcdfghjklmnpqrstvwxz]*";
Pattern p = Pattern.compile(regexp);
Matcher m = p.matcher(word.toLowerCase());
while (m.find()) {
tokens.add(m.group());
}
//check if e is at last and e is not the only vowel or not
if( tokens.size() > 1 && tokens.get(tokens.size()-1).equals("e") )
return tokens.size()-1; // e is at last and not the only vowel so total syllable -1
return tokens.size();
}
This gives you a number of syllables vowels in a word:
public int getNumVowels(String word) {
String regexp = "[bcdfghjklmnpqrstvwxz]*[aeiouy]+[bcdfghjklmnpqrstvwxz]*";
Pattern p = Pattern.compile(regexp);
Matcher m = p.matcher(word.toLowerCase());
int count = 0;
while (m.find()) {
count++;
}
return count;
}
You can call it on every word in your string array:
String[] words = getText().split("\\s+");
for (String word : words ) {
System.out.println("Word: " + word + ", vowels: " + getNumVowels(word));
}
Update: as freerunner noted, calculating the number of syllables is more complicated than just counting vowels. One need to take into account combinations like ou, ui, oo, the final silent e and possibly something else. As I am not a native English speaker, I am not sure what the correct algorithm would be.
This is how I do it. This is about as simple an algorithm I could come up with.
public static int syllables(String s) {
final Pattern p = Pattern.compile("([ayeiou]+)");
final String lowerCase = s.toLowerCase();
final Matcher m = p.matcher(lowerCase);
int count = 0;
while (m.find())
count++;
if (lowerCase.endsWith("e"))
count--;
return count < 0 ? 1 : count;
}
I use this in combination with a soundex function to determine if words sound alike. The syllable count improves accuracy of my soundex function.
Note: This is strictly for counting the syllables in a word. I assume you can parse your input for words using something like java.util.StringTokenizer.
Your line
String[] words = getText().toLowerCase().split("[a-zA-Z]+");
is splitting ON words, and returning only the space between words! You want to split on the space between words, as follows:
String[] words = getText().toLowerCase().split("\\s+");
you can do it as the following :
public int getNumSyllables()
{
return getSyllables(getTokens("[a-zA-Z]+"));
}
protected List<String> getWordTokens(String word,String pattern)
{
ArrayList<String> tokens = new ArrayList<String>();
Pattern tokSplitter = Pattern.compile(pattern);
Matcher m = tokSplitter.matcher(word);
while (m.find()) {
tokens.add(m.group());
}
return tokens;
}
private int getSyllables(List<String> tokens)
{
int count=0;
for(String word : tokens)
if(word.toLowerCase().endsWith("e") && getWordTokens(word.toLowerCase().substring(0, word.length()-1), "[aeiouy]+").size() > 0)
count+=getWordTokens(word.toLowerCase().substring(0, word.length()-1), "[aeiouy]+").size();
else
count+=getWordTokens(word.toLowerCase(), "[aeiouy]+").size();
return count;
}
I count the the separately, then split the text based on words which are ended with e.
Then counting the syllables, here is my implementation:
int syllables = 0;
word = word.toLowerCase();
if(word.contains("the ")){
syllables ++;
}
String[] split = word.split("e!$|e[?]$|e,|e |e[),]|e$");
ArrayList<String> tokens = new ArrayList<String>();
Pattern tokSplitter = Pattern.compile("[aeiouy]+");
for (int i = 0; i < split.length; i++) {
String s = split[i];
Matcher m = tokSplitter.matcher(s);
while (m.find()) {
tokens.add(m.group());
}
}
syllables += tokens.size();
I've testesd an all test cases are passed.
You are using method split incorrectly. This method recieve separator. Need write something like this:
String[] words = getText().toLowerCase().split(" ");
But if you want to count the number of syllables, it is enough to count the number of vowels:
String input = "text";
Set<Character> vowel = new HashSet<>();
vowel.add('a');
vowel.add('e');
vowel.add('i');
vowel.add('o');
vowel.add('u');
int count = 0;
for (char c : input.toLowerCase().toCharArray()) {
if (vowel.contains(c)){
count++;
}
}
System.out.println("count = " + count);

Finding the longest substring between a "start" string and one of 3 possible "end" strings

So my question is substring-related.
How do you find the longest possible substring between a starting string and one of three ending strings? I also need to find the index of the original string that the largest substring starts at.
So:
Start string:
"ATG"
3 possible end strings:
"TAG"
"TAA"
"TGA"
An example original string might be:
"SDAFKJDAFKATGDFSDFAKJDNKSJFNSDTGASDFKJSDNKFJSNDJFATGDSDFKJNSDFTAGSDFSDATGFF"
So the result of that should give me:
- Longest substring length: 23 (from the substring ATGDFSDFAKJDNKSJFNSDTGA)
- Index of longest substring: 10
I cannot use Regex.
Thanks for any help!
This is arguably the easiest way, and it's just one line:
String target = str.replaceAll(".*ATG(.*)(TAG|TAA|TGA).*", "$1");
To find the index:
int index = str.indexOf("ATG") + 3;
Note: I have interpreted your remark "I cannot use regex" to mean "I am unskilled at regex", because if it's a java question, regex is available.
Well, this looks like a fun one.
It seems the most straightforward way to do this would be to build your own mini finite state machine. You would have to parse each character in the string and keep track of all possible character sequences that would terminate the sequence.
If you hit a 'T', you need to jump ahead and look at the next character. If it's an 'A' or a 'G' you need to jump ahead again, otherwise, add those tokens to your string. Continue the pattern until you get to the end of the original string, or match one of your terminal patterns.
So, maybe something that looks like this (simplified example):
String longestSequence(String original) {
StringBuilder sb = new StringBuilder();
char[] tokens = original.toCharArray();
for (int i = 0; i < tokens.length; ++i) {
// read each token, and compare / look ahead to see if you should keep going or terminate.
}
return sb.toString();
}
match your string to this regex:
ATG[A-Z]+(TAG|TAA|TGA)
if multiple match occurs then iterate and keep the one with highest length.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
// using pattern with flags
Pattern pattern = Pattern.compile("ATG[A-Z]+(TAG|TAA|TGA)");
Matcher matcher = pattern.matcher( yourInputStringHere );
while (matcher.find()) {
System.out.println("Found the text \"" + matcher.group()
+ "\" starting at " + matcher.start()
+ " and ending at index " + matcher.end());
}
There are already some beautiful and elegant solutions to your problem (Bohemian and inquisitive). If you still - as originally stated - can't use regex, here's an alternative. This code is not especially elegant, and as pointed, there are better ways to do it, but it should at least clearly show you the logic behind the solution to your problem.
How do you find the longest possible substring between a starting string
and one of three ending strings?
First, find the index of starting string, then find the index of each ending string, and get substrings for each ending, then their length. Remember that if string is not found, its index will be -1.
String originalString = "SDAFKJDAFKATGDFSDFAKJDNKSJFNSDTGASDFKJSDNKFJSNDJFATGDSDFKJNSDFTAGSDFSDATGFF";
String STARTING_STRING = "ATG";
String END1 = "TAG";
String END2 = "TAA";
String END3 = "TGA";
//let's find the index of STARTING_STRING
int posOfStartingString = originalString.indexOf(STARTING_STRING);
//if found
if (posOfStartingString != -1) {
int tagPos[] = new int[3];
//let's find the index of each ending strings in the original string
tagPos[0] = originalString.indexOf(END1, posOfStartingString+3);
tagPos[1] = originalString.indexOf(END2, posOfStartingString+3);
tagPos[2] = originalString.indexOf(END3, posOfStartingString+3);
int lengths[] = new int[3];
//we can now use the following methods:
//public String substring(int beginIndex, int endIndex)
//where beginIndex is our posOfStartingString
//and endIndex is position of each ending string (if found)
//
//and finally, String.length() to get the length of each substring
if (tagPos[0] != -1) {
lengths[0] = originalString.substring(posOfStartingString, tagPos[0]).length();
}
if (tagPos[1] != -1) {
lengths[1] = originalString.substring(posOfStartingString, tagPos[1]).length();
}
if (tagPos[2] != -1) {
lengths[2] = originalString.substring(posOfStartingString, tagPos[2]).length();
}
} else {
//no starting string in original string
}
lengths[] table now contains length of strings starting with STARTING_STRING and 3 respective endings. Then just find which one is the longest and you will have your answer.
I also need to find the index of the original string that the largest substring starts at.
This will be the index of where starting string starts, in this case 10.

Java: Finding the number of word matches in a given string

I am trying to find the number of word matches for a given string and keyword combination, like this:
public int matches(String keyword, String text){
// ...
}
Example:
Given the following calls:
System.out.println(matches("t", "Today is really great, isn't that GREAT?"));
System.out.println(matches("great", "Today is really great, isn't that GREAT?"));
The result should be:
0
2
So far I found this: Find a complete word in a string java
This only returns if the given keyword exists but not how many occurrences. Also, I am not sure if it ignores case sensitivity (which is important for me).
Remember that substrings should be ignored! I only want full words to be found.
UPDATE
I forgot to mention that I also want keywords that are separated via whitespace to match.
E.g.
matches("today is", "Today is really great, isn't that GREAT?")
should return 1
Use a regular expression with word boundaries. It's by far the easiest choice.
int matches = 0;
Matcher matcher = Pattern.compile("\\bgreat\\b", Pattern.CASE_INSENSITIVE).matcher(text);
while (matcher.find()) matches++;
Your milage may vary on some foreign languages though.
How about taking advantage of indexOf ?
s1 = s1.toLowerCase(Locale.US);
s2 = s2.toLowerCase(Locale.US);
int count = 0;
int x;
int y = s2.length();
while((x=s1.indexOf(s2)) != -1){
count++;
s1 = s1.substr(x,x+y);
}
return count;
Efficient version
int count = 0;
int y = s2.length();
for(int i=0; i<=s1.length()-y; i++){
int lettersMatched = 0;
int j=0;
while(s1[i]==s2[j]){
j++;
i++;
lettersMatched++;
}
if(lettersMatched == y) count++;
}
return count;
For more efficient solution, you will have to modify KMP algorithm a little. Just google it, its simple.
well,you can use "split" to separate the words and find if there exists a word matches exactly.
hope that helps!
one option would be RegEx. Basically it sounds like you are looking to match a word with any punctuation on the left or right. so:
" great."
" great!"
" great "
" great,"
"Great"
would all match, but
"greatest"
wouldn't

Categories

Resources