Counting the occurrences of string in Java using string.split()

Counting the occurrences of string in Java using string.split() - java

I'm new to Java. I thought I would write a program to count the occurrences of a character or a sequence of characters in a sentence. I wrote the following code. But I then saw there are some ready-made options available in Apache Commons.
Anyway, can you look at my code and say if there is any rookie mistake? I tested it for a couple of cases and it worked fine. I can think of one case where if the input is a big text file instead of a small sentence/paragraph, the split() function may end up being problematic since it has to handle a large variable. However this is my guess and would love to have your opinions.
private static void countCharInString() {
//Get the sentence and the search keyword
System.out.println("Enter a sentence\n");
Scanner in = new Scanner(System.in);
String inputSentence = in.nextLine();
System.out.println("\nEnter the character to search for\n");
String checkChar = in.nextLine();
in.close();
//Count the number of occurrences
String[] splitSentence = inputSentence.split(checkChar);
int countChar = splitSentence.length - 1;
System.out.println("\nThe character/sequence of characters '" + checkChar + "' appear(s) '" + countChar + "' time(s).");
}
Thank you :)

Because of edge cases, split() is the wrong approach.
Instead, use replaceAll() to remove all other characters then use the length() of what's left to calculate the count:
int count = input.replaceAll(".*?(" + check + "|$)", "$1").length() / check.length();
FYI, the regex created (for example when check = 'xyz'), looks like ".*?(xyz|$)", which means "everything up to and including 'xyz' or end of input", and is replaced by the captured text (either `'xyz' or nothing if it's end of input). This leaves just a string of 0-n copies the check string. Then dividing by the length of check gives you the total.
To protect against the check being null or zero-length (causing a divide-by-zero error), code defensively like this:
int count = check == null || check.isEmpty() ? 0 : input.replaceAll(".*?(" + check + "|$)", "$1").length() / check.length();

A flaw that I can immediately think of is that if your inputSentence only consists of a single occurrence of checkChar. In this case split() will return an empty array and your count will be -1 instead of 1.
An example interaction:
Enter a sentence
onlyme
Enter the character to search for
onlyme
The character/sequence of characters 'onlyme' appear(s) '-1' time(s).
A better way would be to use the .indexOf() method of String to count the occurrences like this:
while ((i = inputSentence.indexOf(checkChar, i)) != -1) {
count++;
i = i + checkChar.length();
}

split is the wrong approach for a number of reasons:
String.split takes a regular expression
Regular expressions have characters with special meanings, so you cannot use it for all characters (without escaping them). This requires an escaping function.
Performance String.split is optimized for single characters. If this were not the case, you would be creating and compiling a regular expression every time. Still, String.split creates one object for the String[] and one object for each String in it, every time that you call it. And you have no use for these objects; all you want to know is the count. Although a future all-knowing HotSpot compiler might be able to optimize that away, the current one does not - it is roughly 10 times as slow as simply counting characters as below.
It will not count correctly if you have repeating instances of your checkChar
A better approach is much simpler: just go and count the characters in the string that match your checkChar. If you think about the steps you need to take count characters, that's what you'd end up with by yourself:
public static int occurrences(String str, char checkChar) {
int count = 0;
for (int i = 0, l = str.length(); i < l; i++) {
if (str.charAt(i) == checkChar)
count++;
}
return count;
}
If you want to count the occurrence of multiple characters, it becomes slightly tricker to write with some efficiency because you don't want to create a new substring every time.
public static int occurrences(String str, String checkChars) {
int count = 0;
int offset = 0;
while ((offset = str.indexOf(checkChars, offset)) != -1) {
offset += checkChars.length();
count++;
}
return count;
}
That's still 10-12 times as fast to match a two-character string than String.split()
Warning: Performance timings are ballpark figures that depends on many circumstances. Since the difference is an order of magnitude, it's safe to say that String.split is slower in general. (Tests performed on jdk 1.8.0-b28 64-bit, using 10 million iterations, verified that results were stable and the same with and without -Xcomp, after performing tests 10 times in same JVM instances.)

Related

How to remove a trailing comma from a string (Java)

I have an array, which I use a for each loop to iterate through. I then store it in a string and then try to add a comma to it, but am struggling look for a "logic" based solution.
I am almost close to about 1 year's worth of Java under my belt, so most of the solutions I am trying to find to implement are mostly logic based (for now), since this is apart of Java MOOC's course. What are some options I can look at? What am I missing?
for(int number: array){
String thread = "";
thread += number + ", ";
System.out.print(thread);
}

You can use a Stream to achieve this result.
System.out.println(Arrays.stream(array).collect(Collectors.joining(",")));

I'm not sure the constraints of this project for your course, but if you're able, try a StringJoiner! It will add a delimiter automatically to separate items that are added to it. Another note, I think you're going to want to declare your String outside of your for loop. otherwise it resets every iteration.
StringJoiner joiner = new StringJoiner(",");
for(int number : array){
joiner.add(String.valueOf(number));
}
System.out.print(joiner.toString());

What I like to do when I'm just doing something simple and quick is this:
String thread = "";
for (int i = 0; i < array.length; i++) {
int number = array[i];
thread += number;
if (i < array.length - 1) {
thread += ", ";
}
}
Basically all it does is check that we aren't on the last index and append the comma only if it isn't the last index. It's quick, simple, and doesn't require any other classes.

Pressuming you had a string ending with a comma, followed by zero or more white spaces you could do the following. String.replaceAll() uses a regular expression to detect the replacement part.
\\s* means 0 or more white spaces
$ means at the end of the line
String str = "a, a, b,c, ";
str = str.replaceAll(",\\s*$","");
Prints
a, a, b,c

Character occurrence in a txt file java

I'm writing a character occurrence counter in a txt file. I keep getting a result of 0 for my count when I run this:
public double charPercent(String letter) {
Scanner inputFile = new Scanner(theText);
int charInText = 0;
int count = 0;
// counts all of the user identified character
while(inputFile.hasNext()) {
if (inputFile.next() == letter) {
count += count;
}
}
return count;
}
Anyone see where I am going wrong?

This is because Scanner.next() will be returning entire words rather than characters. This means that the string from will rarely be the same as the single letter parameter(except for cases where the word is a single letter such as 'I' or 'A'). I also don't see the need for this line:
int charInText = 0;
as the variable is not being used.
Instead you could try something like this:
public double charPercent(String letter) {
Scanner inputFile = new Scanner(theText);
int totalCount = 0;
while(inputFile.hasNext()) {
//Difference of the word with and without the given letter
int occurencesInWord = inputFile.next().length() - inputFile.next().replace(letter, "").length();
totalCount += occurencesInWord;
}
return totalCount;
}
By using the difference between the length of the word at inputFile.next() with and without the letter, you will know the number of times the letter occurs in that specific word. This is added to the total count and repeated for all words in the txt.

use inputFile.next().equals(letter) instead of inputFile.next() == letter1.
Because == checks for the references. You should check the contents of the String object. So use equals() of String
And as said in comments change count += count to count +=1 or count++.
Read here for more explanation.

Do you mean to compare the entire next word to your desired letter?
inputFile.next() will return the next String, delimited by whitespace (tab, enter, spacebar). Unless your file only contains singular letters all separated by spaces, your code won't be able to find all the occurrences of letters in those words.
You might want to try calling inputFile.next() to get the next String, and then breaking that String down into a charArray. From there, you can iterate through the charArray (think for loops) to find the desired character. As a commenter mentioned, you don't want to use == to compare two Strings, but you can use it to compare two characters. If the character from the charArray of your String matches your desired character, then try count++ to increment your counter by 1.

Java newbie: cutting a string off?

I'm new to programming (taking a class) and I'm not sure how to accomplish this one task.
"Ignoring case, find the last occurrence of an ‘a’ in the input and remove all of the characters following it. In the case where there are no ‘a’s in the word, remove all but the first two characters (reminder: do not use if statements or loops). At the end of the now truncated word, add a number that is the percentage that the length of the truncated word is of the length of the original word; this percentage should be rounded to the closest integer value."
I'll be fine with the percentage part, but I'm not sure how to do the first part.
How do I remove only after the last occurrence of 'a'?
If there is no 'a' how do I cut it off after the first two letters without using an if statement?
I'm assuming its to be done using string manipulation and various substrings, but I'm not sure how the criteria for the substrings should be made.
Remember, Java newbie! I don't know a lot of fancy coding techniques yet.
Thank you!

String#toLowerCase - remove all case from the String
String#lastIndexOf will tell you where the last occurrence of the specified String occurs, will return -1 if there is no occurrence, this is important.
String#subString will allow you to generate a new String based on a sub element of the current String
Math#max, Math#min

Given String input, consider the following as a possible starting point:
int indexOfSmallA = input.lastIndexOf('a');
int indexOfBigA = input.lastIndexOf('A');
int beginIndex = Math.max(indexOfSmallA, indexOfBigA);
// if not found, begin at 2 or end of input, else begin after last 'a'
beginIndex = (beginIndex == -1) ? Math.min(2, input.length()) : beginIndex + 1;
String result = input.substring(beginIndex);

For finding the last occurence of 'a' or 'A' you can use...
int index = Math.max(str.lastIndexOf('a'),str.lastIndexOf('A'));
index = (index==-1)?Math.min(2,str.length()):index+1;
Once you get the index you can use the following to remove the characters after it...
str.substring(0,index);

Efficient Regular Expression for big data, if a String contains a word

I have a code that works but is extremely slow. This code determines whether a string contains a keyword. The requirements I have need to be efficient for hundreds of keywords that I will search for in thousands of documents.
What can I do to make finding the keywords (without falsely returning a word that contains the keyword) efficiently?
For example:
String keyword="ac";
String document"..." //few page long file
If i use :
if(document.contains(keyword) ){
//do something
}
It will also return true if document contains a word like "account";
so I tried to use regular expression as follows:
String pattern = "(.*)([^A-Za-z]"+ keyword +"[^A-Za-z])(.*)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(document);
if(m.find()){
//do something
}
Summary:
This is the summary: Hopefully it will be useful to some one else:
My regular expression would work but extremely impractical while
working with big data. (it didn't terminate)
#anubhava perfected the regular expression. it was easy to
understand and implement. It managed to terminate which is a big
thing. but it was still a bit slow. (Roughly about 240 seconds)
#Tomalak solution is abit complex to implement and understand but it
was the fastest solution. so hats off mate.(18 seconds)
so #Tomalak solution was ~15 times faster than #anubhava.

Don't think you need to have .* in your regex.
Try this regex:
String pattern = "\\b"+ Pattern.quote(keyword) + "\\b";
Here \\b is used for word boundary. If the keyword can contain special characters, make sure they are not at the start or end of the word, or the word boundaries will fail to match.
Also you must be using Pattern.quote if your keyword contains special regex characters.
EDIT: You might use this regex if your keywords are separated by space.
String pattern = "(?<=\\s|^)"+ Pattern.quote(keyword) + "(?=\\s|$)";

The fastest-possible way to find substrings in Java is to use String.indexOf().
To achieve "entire-word-only" matches, you would need to add a little bit of logic to check the characters before and after a possible match to make sure they are non-word characters:
public class IndexOfWordSample {
public static void main(String[] args) {
String input = "There are longer strings than this not very long one.";
String search = "long";
int index = indexOfWord(input, search);
if (index > -1) {
System.out.println("Hit for \"" + search + "\" at position " + index + ".");
} else {
System.out.println("No hit for \"" + search + "\".");
}
}
public static int indexOfWord(String input, String word) {
String nonWord = "^\\W?$", before, after;
int index, before_i, after_i = 0;
while (true) {
index = input.indexOf(word, after_i);
if (index == -1 || word.isEmpty()) break;
before_i = index - 1;
after_i = index + word.length();
before = "" + (before_i > -1 ? input.charAt(before_i) : "");
after = "" + (after_i < input.length() ? input.charAt(after_i) : "");
if (before.matches(nonWord) && after.matches(nonWord)) {
return index;
}
}
return -1;
}
}
This would print:
Hit for "long" at position 44.
This should perform better than a pure regular expressions approach.
Think if ^\W?$ already matches your expectation of a "non-word" character. The regular expression is a compromise here and may cost performance if your input string contains many "almost"-matches.
For extra speed, ditch the regex and work with the Character class, checking a combination of the many properties it provides (like isAlphabetic, etc.) for before and after.
I've created a Gist with an alternative implementation that does that.

Word Count no duplicates

Here is my word count program using java. I need to reprogram this so that something, something; something? something! and something count as one word. That means it should not count the same word twice irregardless of case and punctuation.
import java.util.Scanner;
public class WordCount1
{
public static void main(String[]args)
{
final int Lines=6;
Scanner in=new Scanner (System.in);
String paragraph = "";
System.out.println( "Please input "+ Lines + " lines of text.");
for (int i=0; i < Lines; i+=1)
{
paragraph=paragraph+" "+in.nextLine();
}
System.out.println(paragraph);
String word="";
int WordCount=0;
for (int i=0; i<paragraph.length()-1; i+=1)
{
if (paragraph.charAt(i) != ' ' || paragraph.charAt(i) !=',' || paragraph.charAt(i) !=';' || paragraph.charAt(i) !=':' )
{
word= word + paragraph.charAt(i);
if(paragraph.charAt(i+1)==' ' || paragraph.charAt(i) ==','|| paragraph.charAt(i) ==';' || paragraph.charAt(i) ==':')
{
WordCount +=1;
word="";
}
}
}
System.out.println("There are "+WordCount +" words ");
}
}

Since this is homework, here are some hints and advice.
There is a clever little method called String.split that splits a string into parts, using a separator specified as a regular expression. If you use it the right way, this will give you a one line solution to the "word count" problem. (If you've been told not to use split, you can ignore that ... though it is the simple solution that a seasoned Java developer would consider first.)
Format / indent your code properly ... before you show it to other people. If your instructor doesn't deduct marks for this, he / she isn't doing his job properly.
Use standard Java naming conventions. The capitalization of Lines is incorrect. It could be LINES for a manifest constant or lines for variable, but a mixed case name starting with a capital letter should always be a class name.
Be consistent in your use of white space characters around operators (including the assignment operator).
It is a bad idea (and completely unnecessary) to hard wire the number of lines of input that the user must supply. And you are not dealing with the case where he / supplies less than 6 lines.

You should just remove punctuation and change to a single case before doing further processing. (Be careful with locales and unicode)
Once you have broken the input into words, you can count the number of unique words by passing them into a Set and checking the size of the set.

Here You Go. This Works. Just Read The Comments And You Should Be Able To Follow.
import java.util.Arrays;
import java.util.HashSet;
import javax.swing.JOptionPane;
// Program Counts Words In A Sentence. Duplicates Are Not Counted.
public class WordCount
{
public static void main(String[]args)
{
// Initialize Variables
String sentence = "";
int wordCount = 1, startingPoint = 0;
// Prompt User For Sentence
sentence = JOptionPane.showInputDialog(null, "Please input a sentence.", "Input Information Below", 2);
// Remove All Punctuations. To Check For More Punctuations Just Add Another Replace Statement.
sentence = sentence.replace(",", "").replace(".", "").replace("?", "");
// Convert All Characters To Lowercase - Must Be Done To Compare Upper And Lower Case Words.
sentence = sentence.toLowerCase();
// Count The Number Of Words
for (int i = 0; i < sentence.length(); i++)
if (sentence.charAt(i) == ' ')
wordCount++;
// Initialize Array And A Count That Will Be Used As An Index
String[] words = new String[wordCount];
int count = 0;
// Put Each Word In An Array
for (int i = 0; i < sentence.length(); i++)
{
if (sentence.charAt(i) == ' ')
{
words[count] = sentence.substring(startingPoint,i);
startingPoint = i + 1;
count++;
}
}
// Put Last Word In Sentence In Array
words[wordCount - 1] = sentence.substring(startingPoint, sentence.length());
// Put Array Elements Into A Set. This Will Remove Duplicates
HashSet<String> wordsInSet = new HashSet<String>(Arrays.asList(words));
// Format Words In Hash Set To Remove Brackets, And Commas, And Convert To String
String wordsString = wordsInSet.toString().replace(",", "").replace("[", "").replace("]", "");
// Print Out None Duplicate Words In Set And Word Count
JOptionPane.showMessageDialog(null, "Words In Sentence:\n" + wordsString + " \n\n" +
"Word Count: " + wordsInSet.size(), "Sentence Information", 2);
}
}

If you know the marks you want to ignore (;, ?, !) you could do a simple String.replace to remove the characters out of the word. You may want to use String.startsWith and String.endsWith to help
Convert you values to lower case for easier matching (String.toLowercase)
The use of a 'Set' is an excellent idea. If you want to know how many times a particular word appears you could also take advantage of a Map of some kind

You'll need to strip out the punctuation; here's one approach: Translating strings character by character
The above can also be used to normalize the case, although there are probably other utilities for doing so.
Now all of the variations you describe will be converted to the same string, and thus be recognized as such. As pretty much everyone else has suggested, as set would be a good tool for counting the number of distinct words.

What your real problem is, is that you want to have a Distinct wordcount, so, you should either keep track of which words allready encountered, or delete them from the text entirely.
Lets say that you choose the first one, and store the words you already encountered in a List, then you can check against that list whether you allready saw that word.
List<String> encounteredWords = new ArrayList<String>();
// continue after that you found out what the word was
if(!encounteredWords.contains(word.toLowerCase()){
encounteredWords.add(word.toLowerCase());
wordCount++;
}
But, Antimony, made a interesting remark as well, he uses the property of a Set to see what the distinct wordcount is. It is defined that a set can never contain duplicates, so if you just add more of the same word, the set wont grow in size.
Set<String> wordSet = new HashSet<String>();
// continue after that you found out what the word was
wordSet.add(word.toLowerCase());
// continue after that you scanned trough all words
return wordSet.size();

remove all punctuations
convert all strings to lowercase OR uppercase
put those strings in a set
get the size of the set

As you parse your input string, store it word by word in a map data structure. Just ensure that "word", "word?" "word!" all are stored with the key "word" in the map, and increment the word's count whenever you have to add to the map.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Counting the occurrences of string in Java using string.split() - java

Related

How to remove a trailing comma from a string (Java)

Character occurrence in a txt file java

Java newbie: cutting a string off?

Efficient Regular Expression for big data, if a String contains a word

Word Count no duplicates

Categories

Resources