Word Count no duplicates - java

Here is my word count program using java. I need to reprogram this so that something, something; something? something! and something count as one word. That means it should not count the same word twice irregardless of case and punctuation.
import java.util.Scanner;
public class WordCount1
{
public static void main(String[]args)
{
final int Lines=6;
Scanner in=new Scanner (System.in);
String paragraph = "";
System.out.println( "Please input "+ Lines + " lines of text.");
for (int i=0; i < Lines; i+=1)
{
paragraph=paragraph+" "+in.nextLine();
}
System.out.println(paragraph);
String word="";
int WordCount=0;
for (int i=0; i<paragraph.length()-1; i+=1)
{
if (paragraph.charAt(i) != ' ' || paragraph.charAt(i) !=',' || paragraph.charAt(i) !=';' || paragraph.charAt(i) !=':' )
{
word= word + paragraph.charAt(i);
if(paragraph.charAt(i+1)==' ' || paragraph.charAt(i) ==','|| paragraph.charAt(i) ==';' || paragraph.charAt(i) ==':')
{
WordCount +=1;
word="";
}
}
}
System.out.println("There are "+WordCount +" words ");
}
}

Since this is homework, here are some hints and advice.
There is a clever little method called String.split that splits a string into parts, using a separator specified as a regular expression. If you use it the right way, this will give you a one line solution to the "word count" problem. (If you've been told not to use split, you can ignore that ... though it is the simple solution that a seasoned Java developer would consider first.)
Format / indent your code properly ... before you show it to other people. If your instructor doesn't deduct marks for this, he / she isn't doing his job properly.
Use standard Java naming conventions. The capitalization of Lines is incorrect. It could be LINES for a manifest constant or lines for variable, but a mixed case name starting with a capital letter should always be a class name.
Be consistent in your use of white space characters around operators (including the assignment operator).
It is a bad idea (and completely unnecessary) to hard wire the number of lines of input that the user must supply. And you are not dealing with the case where he / supplies less than 6 lines.

You should just remove punctuation and change to a single case before doing further processing. (Be careful with locales and unicode)
Once you have broken the input into words, you can count the number of unique words by passing them into a Set and checking the size of the set.

Here You Go. This Works. Just Read The Comments And You Should Be Able To Follow.
import java.util.Arrays;
import java.util.HashSet;
import javax.swing.JOptionPane;
// Program Counts Words In A Sentence. Duplicates Are Not Counted.
public class WordCount
{
public static void main(String[]args)
{
// Initialize Variables
String sentence = "";
int wordCount = 1, startingPoint = 0;
// Prompt User For Sentence
sentence = JOptionPane.showInputDialog(null, "Please input a sentence.", "Input Information Below", 2);
// Remove All Punctuations. To Check For More Punctuations Just Add Another Replace Statement.
sentence = sentence.replace(",", "").replace(".", "").replace("?", "");
// Convert All Characters To Lowercase - Must Be Done To Compare Upper And Lower Case Words.
sentence = sentence.toLowerCase();
// Count The Number Of Words
for (int i = 0; i < sentence.length(); i++)
if (sentence.charAt(i) == ' ')
wordCount++;
// Initialize Array And A Count That Will Be Used As An Index
String[] words = new String[wordCount];
int count = 0;
// Put Each Word In An Array
for (int i = 0; i < sentence.length(); i++)
{
if (sentence.charAt(i) == ' ')
{
words[count] = sentence.substring(startingPoint,i);
startingPoint = i + 1;
count++;
}
}
// Put Last Word In Sentence In Array
words[wordCount - 1] = sentence.substring(startingPoint, sentence.length());
// Put Array Elements Into A Set. This Will Remove Duplicates
HashSet<String> wordsInSet = new HashSet<String>(Arrays.asList(words));
// Format Words In Hash Set To Remove Brackets, And Commas, And Convert To String
String wordsString = wordsInSet.toString().replace(",", "").replace("[", "").replace("]", "");
// Print Out None Duplicate Words In Set And Word Count
JOptionPane.showMessageDialog(null, "Words In Sentence:\n" + wordsString + " \n\n" +
"Word Count: " + wordsInSet.size(), "Sentence Information", 2);
}
}

If you know the marks you want to ignore (;, ?, !) you could do a simple String.replace to remove the characters out of the word. You may want to use String.startsWith and String.endsWith to help
Convert you values to lower case for easier matching (String.toLowercase)
The use of a 'Set' is an excellent idea. If you want to know how many times a particular word appears you could also take advantage of a Map of some kind

You'll need to strip out the punctuation; here's one approach: Translating strings character by character
The above can also be used to normalize the case, although there are probably other utilities for doing so.
Now all of the variations you describe will be converted to the same string, and thus be recognized as such. As pretty much everyone else has suggested, as set would be a good tool for counting the number of distinct words.

What your real problem is, is that you want to have a Distinct wordcount, so, you should either keep track of which words allready encountered, or delete them from the text entirely.
Lets say that you choose the first one, and store the words you already encountered in a List, then you can check against that list whether you allready saw that word.
List<String> encounteredWords = new ArrayList<String>();
// continue after that you found out what the word was
if(!encounteredWords.contains(word.toLowerCase()){
encounteredWords.add(word.toLowerCase());
wordCount++;
}
But, Antimony, made a interesting remark as well, he uses the property of a Set to see what the distinct wordcount is. It is defined that a set can never contain duplicates, so if you just add more of the same word, the set wont grow in size.
Set<String> wordSet = new HashSet<String>();
// continue after that you found out what the word was
wordSet.add(word.toLowerCase());
// continue after that you scanned trough all words
return wordSet.size();

remove all punctuations
convert all strings to lowercase OR uppercase
put those strings in a set
get the size of the set

As you parse your input string, store it word by word in a map data structure. Just ensure that "word", "word?" "word!" all are stored with the key "word" in the map, and increment the word's count whenever you have to add to the map.

Related

Java change full name to initial. last name

I have a database of player names that i need converted for me to be able to further work with them (for example: I need Antonio Brown converted to A. Brown). My problem is that there are also names that only consist of the first name (for example Antonio) Therefore i get an ArrayIndexOutOfBoundsException: 1, is there another way to get what i want and why does it even with the if condition stil split?
if(spalte[1].contains(" ")){
String[] me = spalte[0].split(" ", 2);
String na = me[0].substring(0);
name = na + ". " + me[1];
} else {
name = spalte[1];
}
Firstly, I highly recommend you to keep your code formatted and variables named properly. It helps not only others to understand a snippet better but also makes debugging a bit easier.
While working with arrays and String::split, you have to be careful with indices because they might overflow easily.
Do you need to make the code handle multiple spaces: Antonio Light Brown -> A. L. Brown? The steps are simple and practically the same for any number of names:
Split by a space delimiter
Shorten the n-1 first partitions
Concatenate the String back
Here is the code:
String split[] = name.trim().split(" "); // Trim the multiple spaces inside to avoid empty parts
StringBuilder sb = new StringBuilder(); // StringBuilder builds the String
for (int i=0; i<split.length; i++) { // Iterate the parts
if (i<split.length -1) { // If not the last part
sb.append(split[i].charAt(0)).append(". "); // Append the first letter and a dot
} else sb.append(split[i]); // Or else keep the entire word
}
System.out.println(sb.toString()); // StringBuilder::toString returns a composed String
Hypothetically: How would you handle names such as O'Neil or de Anthony? You can include the conditional concatenation in the for-loop.

Capitalize each first letter of each word in a String that is wrapped in asterisks from markdown. Allow for multiple words in asterisks

I have a Spring blog. In the Post model, I wrote a method that capitalizes each first letter of each word in a Post's title. This works fine. However, the input field when creating the title allows for bold and italic options via a markdown editor, which then wraps words in asterisks. This is where the issues arise.
A bold or italicized word gets capitalized as long as it's the only word wrapped in asterisks. But, if two or more words are conveniently wrapped all together, like a book or movie title that has spaces in between words, it breaks and says "java.lang.StringIndexOutOfBoundsException: String index out of range:"
In the if statements, I've tried using word.charAt(i) == ' ' to check if there's an empty space but I can't seem to figure it out as sometimes it will capitalize the second word after the space, like "word Word" but then
the first word gets neglected.
I simply want to capitalize every word so that
italics: *word word*
bold: **word word**
both: ***word word***
returns Word Word , Word Word, or both respectively.
Is this even a good approach? Any help is greatly appreciated! Thank you in advance.
public String makeTitleUppercase(String title) {
StringBuffer sb = new StringBuffer();
String[] sentence = title.split(" ");
for (String word : sentence) {
char[] letters = word.trim().toCharArray();
//Capitalize each first letter of each word (works):
letters[0] = Character.toUpperCase(letters[0]);
//Capitalizing bold and italicized markdown (issues):
for (int i = 0; i < letters.length; i++) {
//word.charAt(i) == ' ' where???
// *italics*:
if (word.charAt(i) == '*') {
letters[1] = Character.toUpperCase(letters[1]);
//**bold**:
if (word.charAt(i + 1) == '*') {
letters[2] = Character.toUpperCase(letters[2]);
}
//***both***
if (word.charAt(i + 2) == '*') {
letters[3] = Character.toUpperCase(letters[3]);
}
break;
}
}
word = new String(letters);
sb.append(word).append(" ");
System.out.println("get here");
}
return sb.toString().trim();
}
You've gotten yourself a bit into trouble by focusing on the details of the markdown formatting. Note in particular that you scan each word for asterisks regardless of whether you've already successfully capitalized the first letter; this results in your exception when between one and three asterisks appear at the end of the word.
You should generalize your characterization a bit, so that the formatting is not handled as a special case. For instance, the rule you want might be "in each whitespace-delimited word, skip any asterisks and capitalize the next character, if any". After the tokenization, that might look like this:
char[] letters = word.toCharArray(); // no need to trim()
for (int i = 0; i < letters.length; i++) {
if (letters[i] != '*') {
// Capitalize the first non-asterisk (even if that doesn't change it)
letters[i] = Character.toUpperCase(letters[i]);
// No need to look any further
break;
}
}
// That's it for capitalizing!
Special cases complicate your reasoning. Sometimes they cannot be avoided, but when you have the choice to just be more general, it's usually a win.

Character occurrence in a txt file java

I'm writing a character occurrence counter in a txt file. I keep getting a result of 0 for my count when I run this:
public double charPercent(String letter) {
Scanner inputFile = new Scanner(theText);
int charInText = 0;
int count = 0;
// counts all of the user identified character
while(inputFile.hasNext()) {
if (inputFile.next() == letter) {
count += count;
}
}
return count;
}
Anyone see where I am going wrong?
This is because Scanner.next() will be returning entire words rather than characters. This means that the string from will rarely be the same as the single letter parameter(except for cases where the word is a single letter such as 'I' or 'A'). I also don't see the need for this line:
int charInText = 0;
as the variable is not being used.
Instead you could try something like this:
public double charPercent(String letter) {
Scanner inputFile = new Scanner(theText);
int totalCount = 0;
while(inputFile.hasNext()) {
//Difference of the word with and without the given letter
int occurencesInWord = inputFile.next().length() - inputFile.next().replace(letter, "").length();
totalCount += occurencesInWord;
}
return totalCount;
}
By using the difference between the length of the word at inputFile.next() with and without the letter, you will know the number of times the letter occurs in that specific word. This is added to the total count and repeated for all words in the txt.
use inputFile.next().equals(letter) instead of inputFile.next() == letter1.
Because == checks for the references. You should check the contents of the String object. So use equals() of String
And as said in comments change count += count to count +=1 or count++.
Read here for more explanation.
Do you mean to compare the entire next word to your desired letter?
inputFile.next() will return the next String, delimited by whitespace (tab, enter, spacebar). Unless your file only contains singular letters all separated by spaces, your code won't be able to find all the occurrences of letters in those words.
You might want to try calling inputFile.next() to get the next String, and then breaking that String down into a charArray. From there, you can iterate through the charArray (think for loops) to find the desired character. As a commenter mentioned, you don't want to use == to compare two Strings, but you can use it to compare two characters. If the character from the charArray of your String matches your desired character, then try count++ to increment your counter by 1.

Counting the occurrences of string in Java using string.split()

I'm new to Java. I thought I would write a program to count the occurrences of a character or a sequence of characters in a sentence. I wrote the following code. But I then saw there are some ready-made options available in Apache Commons.
Anyway, can you look at my code and say if there is any rookie mistake? I tested it for a couple of cases and it worked fine. I can think of one case where if the input is a big text file instead of a small sentence/paragraph, the split() function may end up being problematic since it has to handle a large variable. However this is my guess and would love to have your opinions.
private static void countCharInString() {
//Get the sentence and the search keyword
System.out.println("Enter a sentence\n");
Scanner in = new Scanner(System.in);
String inputSentence = in.nextLine();
System.out.println("\nEnter the character to search for\n");
String checkChar = in.nextLine();
in.close();
//Count the number of occurrences
String[] splitSentence = inputSentence.split(checkChar);
int countChar = splitSentence.length - 1;
System.out.println("\nThe character/sequence of characters '" + checkChar + "' appear(s) '" + countChar + "' time(s).");
}
Thank you :)
Because of edge cases, split() is the wrong approach.
Instead, use replaceAll() to remove all other characters then use the length() of what's left to calculate the count:
int count = input.replaceAll(".*?(" + check + "|$)", "$1").length() / check.length();
FYI, the regex created (for example when check = 'xyz'), looks like ".*?(xyz|$)", which means "everything up to and including 'xyz' or end of input", and is replaced by the captured text (either `'xyz' or nothing if it's end of input). This leaves just a string of 0-n copies the check string. Then dividing by the length of check gives you the total.
To protect against the check being null or zero-length (causing a divide-by-zero error), code defensively like this:
int count = check == null || check.isEmpty() ? 0 : input.replaceAll(".*?(" + check + "|$)", "$1").length() / check.length();
A flaw that I can immediately think of is that if your inputSentence only consists of a single occurrence of checkChar. In this case split() will return an empty array and your count will be -1 instead of 1.
An example interaction:
Enter a sentence
onlyme
Enter the character to search for
onlyme
The character/sequence of characters 'onlyme' appear(s) '-1' time(s).
A better way would be to use the .indexOf() method of String to count the occurrences like this:
while ((i = inputSentence.indexOf(checkChar, i)) != -1) {
count++;
i = i + checkChar.length();
}
split is the wrong approach for a number of reasons:
String.split takes a regular expression
Regular expressions have characters with special meanings, so you cannot use it for all characters (without escaping them). This requires an escaping function.
Performance String.split is optimized for single characters. If this were not the case, you would be creating and compiling a regular expression every time. Still, String.split creates one object for the String[] and one object for each String in it, every time that you call it. And you have no use for these objects; all you want to know is the count. Although a future all-knowing HotSpot compiler might be able to optimize that away, the current one does not - it is roughly 10 times as slow as simply counting characters as below.
It will not count correctly if you have repeating instances of your checkChar
A better approach is much simpler: just go and count the characters in the string that match your checkChar. If you think about the steps you need to take count characters, that's what you'd end up with by yourself:
public static int occurrences(String str, char checkChar) {
int count = 0;
for (int i = 0, l = str.length(); i < l; i++) {
if (str.charAt(i) == checkChar)
count++;
}
return count;
}
If you want to count the occurrence of multiple characters, it becomes slightly tricker to write with some efficiency because you don't want to create a new substring every time.
public static int occurrences(String str, String checkChars) {
int count = 0;
int offset = 0;
while ((offset = str.indexOf(checkChars, offset)) != -1) {
offset += checkChars.length();
count++;
}
return count;
}
That's still 10-12 times as fast to match a two-character string than String.split()
Warning: Performance timings are ballpark figures that depends on many circumstances. Since the difference is an order of magnitude, it's safe to say that String.split is slower in general. (Tests performed on jdk 1.8.0-b28 64-bit, using 10 million iterations, verified that results were stable and the same with and without -Xcomp, after performing tests 10 times in same JVM instances.)

Checking if a character is an integer or letter

I am modifying a file using Java. Here's what I want to accomplish:
if an & symbol, along with an integer, is detected while being read, I want to drop the & symbol and translate the integer to binary.
if an & symbol, along with a (random) word, is detected while being read, I want to drop the & symbol and replace the word with the integer 16, and if a different string of characters is being used along with the & symbol, I want to set the number 1 higher than integer 16.
Here's an example of what I mean. If a file is inputted containing these strings:
&myword
&4
&anotherword
&9
&yetanotherword
&10
&myword
The output should be:
&0000000000010000 (which is 16 in decimal)
&0000000000000100 (or the number '4' in decimal)
&0000000000010001 (which is 17 in decimal, since 16 is already used, so 16+1=17)
&0000000000000101 (or the number '9' in decimal)
&0000000000010001 (which is 18 in decimal, or 17+1=18)
&0000000000000110 (or the number '10' in decimal)
&0000000000010000 (which is 16 because value of myword = 16)
Here's what I tried so far, but haven't succeeded yet:
for (i=0; i<anyLines.length; i++) {
char[] charray = anyLines[i].toCharArray();
for (int j=0; j<charray.length; j++)
if (Character.isDigit(charray[j])) {
anyLines[i] = anyLines[i].replace("&","");
anyLines[i] = Integer.toBinaryString(Integer.parseInt(anyLines[i]);
}
else {
continue;
}
if (Character.isLetter(charray[j])) {
anyLines[i] = anyLines[i].replace("&","");
for (int k=16; j<charray.length; k++) {
anyLines[i] = Integer.toBinaryString(Integer.parseInt(k);
}
}
}
}
I hope that I am articulate enough. Any suggestions on how to accomplish this task?
Character.isLetter() //tests to see if it is a letter
Character.isDigit() //tests the character to
It looks like something you could match against a regex. I don't know Java but you should have at least one regex engine at your disposal. Then the regex would be:
regex1: &(\d+)
and
regex2: &(\w+)
or
regex3: &(\d+|\w+)
in the first case, if regex1 matches, you know you ran into a number, and that number is into the first capturing group (eg: match.group(1)). If regex2 matches, you know you have a word. You can then lookup that word into a dictionary and see what its associated number is, or if not present, add it to the dictionary and associate it with the next free number (16 + dictionary size + 1).
regex3 on the other hand will match both numbers and words, so it's up to you to see what's in the capturing group (it's just a different approach).
If neither of the regex match, then you have an invalid sequence, or you need some other action. Note that \w in a regex only matches word characters (ie: letters, _ and possibly a few other characters), so &çSomeWord or &*SomeWord won't match at all, while the captured group in &Hello.World would be just "Hello".
Regex libs usually provide a length for the matched text, so you can move i forward by that much in order to skip already matched text.
You have to somehow tokenize your input. It seems you are splitting it in lines and then analyzing each line individually. If this is what you want, okay. If not, you could simply search for & (indexOf('%')) and then somehow determine what the next token is (either a number or a "word", however you want to define word).
What do you want to do with input which does not match your pattern? Neither the description of the task nor the example really covers this.
You need to have a dictionary of already read strings. Use a Map<String, Integer>.
I would post this as a comment, but don't have the ability yet. What is the issue you are running into? Error? Incorrect Results? 16's not being correctly incremented? Also, the examples use a '%' but in your description you say it should start with a '&'.
Edit2: Was thinking it was line by line, but re-reading indicates you could be trying to find say "I went to the &store" and want it to say "I went to the &000010000". So you would want to split by whitespace and then iterate through and pass the strings into your 'replace' method, which is similar to below.
Edit1: If I understand what you are trying to do, code like this should work.
Map<String, Integer> usedWords = new HashMap<String, Integer>();
List<String> output = new ArrayList<String>();
int wordIncrementer = 16;
String[] arr = test.split("\n");
for(String s : arr)
{
if(s.startsWith("&"))
{
String line = s.substring(1).trim(); //Removes &
try
{
Integer lineInt = Integer.parseInt(line);
output.add("&" + Integer.toBinaryString(lineInt));
}
catch(Exception e)
{
System.out.println("Line was not an integer. Parsing as a String.");
String outputString = "&";
if(usedWords.containsKey(line))
{
outputString += Integer.toBinaryString(usedWords.get(line));
}
else
{
outputString += Integer.toBinaryString(wordIncrementer);
usedWords.put(line, wordIncrementer++);
}
output.add(outputString);
}
}
else
{
continue; //Nothing indicating that we should parse the line.
}
}
How about this?
String input = "&myword\n&4\n&anotherword\n&9\n&yetanotherword\n&10\n&myword";
String[] lines = input.split("\n");
int wordValue = 16;
// to keep track words that are already used
Map<String, Integer> wordValueMap = new HashMap<String, Integer>();
for (String line : lines) {
// if line doesn't begin with &, then ignore it
if (!line.startsWith("&")) {
continue;
}
// remove &
line = line.substring(1);
Integer binaryValue = null;
if (line.matches("\\d+")) {
binaryValue = Integer.parseInt(line);
}
else if (line.matches("\\w+")) {
binaryValue = wordValueMap.get(line);
// if the map doesn't contain the word value, then assign and store it
if (binaryValue == null) {
binaryValue = wordValue;
wordValueMap.put(line, binaryValue);
wordValue++;
}
}
// I'm using Commons Lang's StringUtils.leftPad(..) to create the zero padded string
String out = "&" + StringUtils.leftPad(Integer.toBinaryString(binaryValue), 16, "0");
System.out.println(out);
Here's the printout:-
&0000000000010000
&0000000000000100
&0000000000010001
&0000000000001001
&0000000000010010
&0000000000001010
&0000000000010000
Just FYI, the binary value for 10 is "1010", not "110" as stated in your original post.

Categories

Resources