I am modifying a file using Java. Here's what I want to accomplish:
if an & symbol, along with an integer, is detected while being read, I want to drop the & symbol and translate the integer to binary.
if an & symbol, along with a (random) word, is detected while being read, I want to drop the & symbol and replace the word with the integer 16, and if a different string of characters is being used along with the & symbol, I want to set the number 1 higher than integer 16.
Here's an example of what I mean. If a file is inputted containing these strings:
&myword
&4
&anotherword
&9
&yetanotherword
&10
&myword
The output should be:
&0000000000010000 (which is 16 in decimal)
&0000000000000100 (or the number '4' in decimal)
&0000000000010001 (which is 17 in decimal, since 16 is already used, so 16+1=17)
&0000000000000101 (or the number '9' in decimal)
&0000000000010001 (which is 18 in decimal, or 17+1=18)
&0000000000000110 (or the number '10' in decimal)
&0000000000010000 (which is 16 because value of myword = 16)
Here's what I tried so far, but haven't succeeded yet:
for (i=0; i<anyLines.length; i++) {
char[] charray = anyLines[i].toCharArray();
for (int j=0; j<charray.length; j++)
if (Character.isDigit(charray[j])) {
anyLines[i] = anyLines[i].replace("&","");
anyLines[i] = Integer.toBinaryString(Integer.parseInt(anyLines[i]);
}
else {
continue;
}
if (Character.isLetter(charray[j])) {
anyLines[i] = anyLines[i].replace("&","");
for (int k=16; j<charray.length; k++) {
anyLines[i] = Integer.toBinaryString(Integer.parseInt(k);
}
}
}
}
I hope that I am articulate enough. Any suggestions on how to accomplish this task?
Character.isLetter() //tests to see if it is a letter
Character.isDigit() //tests the character to
It looks like something you could match against a regex. I don't know Java but you should have at least one regex engine at your disposal. Then the regex would be:
regex1: &(\d+)
and
regex2: &(\w+)
or
regex3: &(\d+|\w+)
in the first case, if regex1 matches, you know you ran into a number, and that number is into the first capturing group (eg: match.group(1)). If regex2 matches, you know you have a word. You can then lookup that word into a dictionary and see what its associated number is, or if not present, add it to the dictionary and associate it with the next free number (16 + dictionary size + 1).
regex3 on the other hand will match both numbers and words, so it's up to you to see what's in the capturing group (it's just a different approach).
If neither of the regex match, then you have an invalid sequence, or you need some other action. Note that \w in a regex only matches word characters (ie: letters, _ and possibly a few other characters), so &çSomeWord or &*SomeWord won't match at all, while the captured group in &Hello.World would be just "Hello".
Regex libs usually provide a length for the matched text, so you can move i forward by that much in order to skip already matched text.
You have to somehow tokenize your input. It seems you are splitting it in lines and then analyzing each line individually. If this is what you want, okay. If not, you could simply search for & (indexOf('%')) and then somehow determine what the next token is (either a number or a "word", however you want to define word).
What do you want to do with input which does not match your pattern? Neither the description of the task nor the example really covers this.
You need to have a dictionary of already read strings. Use a Map<String, Integer>.
I would post this as a comment, but don't have the ability yet. What is the issue you are running into? Error? Incorrect Results? 16's not being correctly incremented? Also, the examples use a '%' but in your description you say it should start with a '&'.
Edit2: Was thinking it was line by line, but re-reading indicates you could be trying to find say "I went to the &store" and want it to say "I went to the &000010000". So you would want to split by whitespace and then iterate through and pass the strings into your 'replace' method, which is similar to below.
Edit1: If I understand what you are trying to do, code like this should work.
Map<String, Integer> usedWords = new HashMap<String, Integer>();
List<String> output = new ArrayList<String>();
int wordIncrementer = 16;
String[] arr = test.split("\n");
for(String s : arr)
{
if(s.startsWith("&"))
{
String line = s.substring(1).trim(); //Removes &
try
{
Integer lineInt = Integer.parseInt(line);
output.add("&" + Integer.toBinaryString(lineInt));
}
catch(Exception e)
{
System.out.println("Line was not an integer. Parsing as a String.");
String outputString = "&";
if(usedWords.containsKey(line))
{
outputString += Integer.toBinaryString(usedWords.get(line));
}
else
{
outputString += Integer.toBinaryString(wordIncrementer);
usedWords.put(line, wordIncrementer++);
}
output.add(outputString);
}
}
else
{
continue; //Nothing indicating that we should parse the line.
}
}
How about this?
String input = "&myword\n&4\n&anotherword\n&9\n&yetanotherword\n&10\n&myword";
String[] lines = input.split("\n");
int wordValue = 16;
// to keep track words that are already used
Map<String, Integer> wordValueMap = new HashMap<String, Integer>();
for (String line : lines) {
// if line doesn't begin with &, then ignore it
if (!line.startsWith("&")) {
continue;
}
// remove &
line = line.substring(1);
Integer binaryValue = null;
if (line.matches("\\d+")) {
binaryValue = Integer.parseInt(line);
}
else if (line.matches("\\w+")) {
binaryValue = wordValueMap.get(line);
// if the map doesn't contain the word value, then assign and store it
if (binaryValue == null) {
binaryValue = wordValue;
wordValueMap.put(line, binaryValue);
wordValue++;
}
}
// I'm using Commons Lang's StringUtils.leftPad(..) to create the zero padded string
String out = "&" + StringUtils.leftPad(Integer.toBinaryString(binaryValue), 16, "0");
System.out.println(out);
Here's the printout:-
&0000000000010000
&0000000000000100
&0000000000010001
&0000000000001001
&0000000000010010
&0000000000001010
&0000000000010000
Just FYI, the binary value for 10 is "1010", not "110" as stated in your original post.
Related
I need to capitalize first letter in every word in the string, BUT it's not so easy as it seems to be as the word is considered to be any sequence of letters, digits, "_" , "-", "`" while all other chars are considered to be separators, i.e. after them the next letter must be capitalized.
Example what program should do:
For input: "#he&llo wo!r^ld"
Output should be: "#He&Llo Wo!R^Ld"
There are questions that sound similar here, but there solutions really don't help.
This one for example:
String output = Arrays.stream(input.split("[\\s&]+"))
.map(t -> t.substring(0, 1).toUpperCase() + t.substring(1))
.collect(Collectors.joining(" "));
As in my task there can be various separators, this solution doesn't work.
It is possible to split a string and keep the delimiters, so taking into account the requirement for delimiters:
word is considered to be any sequence of letters, digits, "_" , "-", "`" while all other chars are considered to be separators
the pattern which keeps the delimiters in the result array would be: "((?<=[^-`\\w])|(?=[^-`\\w]))":
[^-`\\w]: all characters except -, backtick and word characters \w: [A-Za-z0-9_]
Then, the "words" are capitalized, and delimiters are kept as is:
static String capitalize(String input) {
if (null == input || 0 == input.length()) {
return input;
}
return Arrays.stream(input.split("((?<=[^-`\\w])|(?=[^-`\\w]))"))
.map(s -> s.matches("[-`\\w]+") ? Character.toUpperCase(s.charAt(0)) + s.substring(1) : s)
.collect(Collectors.joining(""));
}
Tests:
System.out.println(capitalize("#he&l_lo-wo!r^ld"));
System.out.println(capitalize("#`he`&l+lo wo!r^ld"));
Output:
#He&l_lo-wo!R^Ld
#`he`&L+Lo Wo!R^Ld
Update
If it is needed to process not only ASCII set of characters but apply to other alphabets or character sets (e.g. Cyrillic, Greek, etc.), POSIX class \\p{IsWord} may be used and matching of Unicode characters needs to be enabled using pattern flag (?U):
static String capitalizeUnicode(String input) {
if (null == input || 0 == input.length()) {
return input;
}
return Arrays.stream(input.split("(?U)((?<=[^-`\\p{IsWord}])|(?=[^-`\\p{IsWord}]))")
.map(s -> s.matches("(?U)[-`\\p{IsWord}]+") ? Character.toUpperCase(s.charAt(0)) + s.substring(1) : s)
.collect(Collectors.joining(""));
}
Test:
System.out.println(capitalizeUnicode("#he&l_lo-wo!r^ld"));
System.out.println(capitalizeUnicode("#привет&`ёж`+дос^βιδ/ως"));
Output:
#He&L_lo-wo!R^Ld
#Привет&`ёж`+Дос^Βιδ/Ως
You can't use split that easily - split will eliminate the separators and give you only the things in between. As you need the separators, no can do.
One real dirty trick is to use something called 'lookahead'. That argument you pass to split is a regular expression. Most 'characters' in a regexp have the property that they consume the matching input. If you do input.split("\\s+") then that doesn't 'just' split on whitespace, it also consumes them: The whitespace is no longer part of the individual entries in your string array.
However, consider ^ and $. or \\b. These still match things but don't consume anything. You don't consume 'end of string'. In fact, ^^^hello$$$ matches the string "hello" just as well. You can do this yourself, using lookahead: It matches when the lookahead is there but does not consume it:
String[] args = "Hello World$Huh Weird".split("(?=[\\s_$-]+)");
for (String arg : args) System.out.println("*" + args[i] + "*");
Unfortunately, this 'works', in that it saves your separators, but isn't getting you all that much closer to a solution:
*Hello*
* World*
*$Huh*
* *
* *
* Weird*
You can go with lookbehind as well, but it's limited; they don't do variable length, for example.
The conclusion should rapidly become: Actually, doing this with split is a mistake.
Then, once split is off the table, you should no longer use streams, either: Streams don't do well once you need to know stuff about the previous element in a stream to do the job: A stream of characters doesn't work, as you need to know if the previous character was a non-letter or not.
In general, "I want to do X, and use Y" is a mistake. Keep an open mind. It's akin to asking: "I want to butter my toast, and use a hammer to do it". Oookaaaaayyyy, you can probably do that, but, eh, why? There are butter knives right there in the drawer, just.. put down the hammer, that's toast. Not a nail.
Same here.
A simple loop can take care of this, no problem:
private static final String BREAK_CHARS = "&-_`";
public String toTitleCase(String input) {
StringBuilder out = new StringBuilder();
boolean atBreak = true;
for (char c : input.toCharArray()) {
out.append(atBreak ? Character.toUpperCase(c) : c);
atBreak = Character.isWhitespace(c) || (BREAK_CHARS.indexOf(c) > -1);
}
return out.toString();
}
Simple. Efficient. Easy to read. Easy to modify. For example, if you want to go with 'any non-letter counts', trivial: atBreak = Character.isLetter(c);.
Contrast to the stream solution which is fragile, weird, far less efficient, and requires a regexp that needs half a page's worth of comment for anybody to understand it.
Can you do this with streams? Yes. You can butter toast with a hammer, too. Doesn't make it a good idea though. Put down the hammer!
You can use a simple FSM as you iterate over the characters in the string, with two states, either in a word, or not in a word. If you are not in a word and the next character is a letter, convert it to upper case, otherwise, if it is not a letter or if you are already in a word, simply copy it unmodified.
boolean isWord(int c) {
return c == '`' || c == '_' || c == '-' || Character.isLetter(c) || Character.isDigit(c);
}
String capitalize(String s) {
StringBuilder sb = new StringBuilder();
boolean inWord = false;
for (int c : s.codePoints().toArray()) {
if (!inWord && Character.isLetter(c)) {
sb.appendCodePoint(Character.toUpperCase(c));
} else {
sb.appendCodePoint(c);
}
inWord = isWord(c);
}
return sb.toString();
}
Note: I have used codePoints(), appendCodePoint(int), and int so that characters outside the basic multilingual plane (with code points greater than 64k) are handled correctly.
I need to capitalize first letter in every word
Here is one way to do it. Admittedly this is a might longer but your requirement to change the first letter to upper case (not first digit or first non-letter) required a helper method. Otherwise it would have been easier. Some others seemed to have missed this point.
Establish word pattern, and test data.
String wordPattern = "[\\w_-`]+";
Pattern p = Pattern.compile(wordPattern);
String[] inputData = { "#he&llo wo!r^ld", "0hel`lo-w0rld" };
Now this simply finds each successive word in the string based on the established regular expression. As each word is found, it changes the first letter in the word to upper case and then puts it in a string buffer in the correct position where the match was found.
for (String input : inputData) {
StringBuilder sb = new StringBuilder(input);
Matcher m = p.matcher(input);
while (m.find()) {
sb.replace(m.start(), m.end(),
upperFirstLetter(m.group()));
}
System.out.println(input + " -> " + sb);
}
prints
#he&llo wo!r^ld -> #He&Llo Wo!R^Ld
0hel`lo-w0rld -> 0Hel`lo-W0rld
Since words may start with digits, and the requirement was to convert the first letter (not character) to upper case. This method finds the first letter, converts it to upper case and
returns the new string. So 01_hello would become 01_Hello
public static String upperFirstLetter(String word) {
char[] chs = word.toCharArray();
for (int i = 0; i < chs.length; i++) {
if (Character.isLetter(chs[i])) {
chs[i] = Character.toUpperCase(chs[i]);
break;
}
}
return String.valueOf(chs);
}
I have inputs like
AS23456SDE
MFD324FR
I need to get First Character values like
AS, MFD
There should no first two or first 3 characters input can be changed. Need to get first characters before a number.
Thank you.
Edit : This is what I have tried.
public static String getPrefix(String serial) {
StringBuilder prefix = new StringBuilder();
for(char c : serial.toCharArray()){
if(Character.isDigit(c)){
break;
}
else{
prefix.append(c);
}
}
return prefix.toString();
}
Here is a nice one line solution. It uses a regex to match the first non numeric characters in the string, and then replaces the input string with this match.
public String getFirstLetters(String input) {
return new String("A" + input).replaceAll("^([^\\d]+)(.*)$", "$1")
.substring(1);
}
System.out.println(getFirstLetters("AS23456SDE"));
System.out.println(getFirstLetters("1AS123"));
Output:
AS
(empty)
A simple solution could be like this:
public static void main (String[]args) {
String str = "MFD324FR";
char[] characters = str.toCharArray();
for(char c : characters){
if(Character.isDigit(c))
break;
else
System.out.print(c);
}
}
Use the following function to get required output
public String getFirstChars(String str){
int zeroAscii = '0'; int nineAscii = '9';
String result = "";
for (int i=0; i< str.lenght(); i++){
int ascii = str.toCharArray()[i];
if(ascii >= zeroAscii && ascii <= nineAscii){
result = result + str.toCharArray()[i];
}else{
return result;
}
}
return str;
}
pass your string as argument
I think this can be done by a simple regex which matches digits and java's string split function. This Regex based approach will be more efficient than the methods using more complicated regexs.
Something as below will work
String inp = "ABC345.";
String beginningChars = inp.split("[\\d]+",2)[0];
System.out.println(beginningChars); // only if you want to print.
The regex I used "[\\d]+" is escaped for java already.
What it does?
It matches one or more digits (d). d matches digits of any language in unicode, (so it matches japanese and arabian numbers as well)
What does String beginningChars = inp.split("[\\d]+",2)[0] do?
It applies this regex and separates the string into string arrays where ever a match is found. The [0] at the end selects the first result from that array, since you wanted the starting chars.
What is the second parameter to .split(regex,int) which I supplied as 2?
This is the Limit parameter. This means that the regex will be applied on the string till 1 match is found. Once 1 match is found the string is not processed anymore.
From the Strings javadoc page:
The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.
This will be efficient if your string is huge.
Possible other regex if you want to split only on english numerals
"[0-9]+"
public static void main(String[] args) {
String testString = "MFD324FR";
int index = 0;
for (Character i : testString.toCharArray()) {
if (Character.isDigit(i))
break;
index++;
}
System.out.println(testString.substring(0, index));
}
this prints the first 'n' characters before it encounters a digit (i.e. integer).
I have automated some flow of filling in a form from a website by taking the fields data from a csv.
Now, for the address there are 3 fields in the form:
Address 1 ____________
Address 2 ____________
Address 3 ____________
Each field have a limit of 35 characters, so whenever I get to 35 characters im continuing the address string in the second address field...
Now, the issue is that my current solution will split it but it will cut the word if it got to 35 chars, for instant, if the word 'barcelona' in the str and 'o' is the 35th char so in the address 2 will be 'na'.
in that case I want to identify if the 35th char is a middle of a word and take the whole word to the next field.
this is my current solution:
private def enterAddress(purchaseInfo: PurchaseInfo) = {
val webElements = driver.findElements(By.className("address")).toList
val strings = purchaseInfo.supplierAddress.grouped(35).toList
strings.zip(webElements).foreach{
case (text, webElement) => webElement.sendKeys(text)
}
}
I would appreciate some help here, preferably with Scala but java will be fine as well :)
thanks allot!
Since you said you'd accept Java code as well... the following code will wrap a given input string to several lines of a given maximum length:
import java.util.ArrayList;
import java.util.List;
public class WordWrap {
public static void main(String[] args) {
String input = "This is a rather long address, somewhere in a small street in Barcelona";
List<String> wrappedLines = wrap(input, 35);
for (String line : wrappedLines) {
System.out.println(line);
}
}
private static List<String> wrap(String input, int maxLength) {
String[] words = input.split(" ");
List<String> lines = new ArrayList<String>();
StringBuilder sb = new StringBuilder();
for (String word : words) {
if (sb.length() == 0) {
// Note: Will not work if a *single* word already exceeds maxLength
sb.append(word);
} else if (sb.length() + word.length() < maxLength) {
// Use < maxLength as we add +1 space.
sb.append(" " + word);
} else {
// Line is full
lines.add(sb.toString());
// Restart
sb = new StringBuilder(word);
}
}
// Add the last line
if (sb.length() > 0) {
lines.add(sb.toString());
}
return lines;
}
}
Output:
This is a rather long address,
somewhere in a small street in
Barcelona
This is not necessarily the best approach, but I guess you'll have to adapt it to Scala anyway.
If you prefer a library solution (because... why re-invent the wheel?) you can also have a look at WordUtils.wrap() from Apache Commons.
Words in the English language are delimited by space (or other punctuation, but that is irrelevant in this case unless you actually want to wrap lines based on that), and there are a couple of options for using this to your advantage:
One thing you could potentially do is take a substring of 35 characters from your string, use String.lastIndexOf to figure out where the space is, and add only up to that space to your address line, then repeating the process starting from that space character until you have entered the string.
Another method (showcased in Marvin's answer) is to just use String.split on spaces and concatenate them back together until the next word would cause the string to exceed 35 characters.
I am trying to create a method which will either remove all duplicates from a string or only keep the same 2 characters in a row based on a parameter.
For example:
helllllllo -> helo
or
helllllllo -> hello - This keeps double letters
Currently I remove duplicates by doing:
private String removeDuplicates(String word) {
StringBuffer buffer = new StringBuffer();
for (int i = 0; i < word.length(); i++) {
char letter = word.charAt(i);
if (buffer.length() == 0 && letter != buffer.charAt(buffer.length() - 1)) {
buffer.append(letter);
}
}
return buffer.toString();
}
If I want to keep double letters I was thinking of having a method like private String removeDuplicates(String word, boolean doubleLetter)
When doubleLetter is true it will return hello not helo
I'm not sure of the most efficient way to do this without duplicating a lot of code.
why not just use a regex?
public class RemoveDuplicates {
public static void main(String[] args) {
System.out.println(new RemoveDuplicates().result("hellllo", false)); //helo
System.out.println(new RemoveDuplicates().result("hellllo", true)); //hello
}
public String result(String input, boolean doubleLetter){
String pattern = null;
if(doubleLetter) pattern = "(.)(?=\\1{2})";
else pattern = "(.)(?=\\1)";
return input.replaceAll(pattern, "");
}
}
(.) --> matches any character and puts in group 1.
?= --> this is called a positive lookahead.
?=\\1 --> positive lookahead for the first group
So overall, this regex looks for any character that is followed (positive lookahead) by itself. For example aa or bb, etc. It is important to note that only the first character is part of the match actually, so in the word 'hello', only the first l is matched (the part (?=\1) is NOT PART of the match). So the first l is replaced by an empty String and we are left with helo, which does not match the regex
The second pattern is the same thing, but this time we look ahead for TWO occurrences of the first group, for example helllo. On the other hand 'hello' will not be matched.
Look here for a lot more: Regex
P.S. Fill free to accept the answer if it helped.
try
String s = "helllllllo";
System.out.println(s.replaceAll("(\\w)\\1+", "$1"));
output
helo
Taking this previous SO example as a starting point, I came up with this:
String str1= "Heelllllllllllooooooooooo";
String removedRepeated = str1.replaceAll("(\\w)\\1+", "$1");
System.out.println(removedRepeated);
String keepDouble = str1.replaceAll("(\\w)\\1{2,}", "$1");
System.out.println(keepDouble);
It yields:
Helo
Heelo
What it does:
(\\w)\\1+ will match any letter and place it in a regex capture group. This group is later accessed through the \\1+. Meaning that it will match one or more repetitions of the previous letter.
(\\w)\\1{2,} is the same as above the only difference being that it looks after only characters which are repeated more than 2 times. This leaves the double characters untouched.
EDIT:
Re-read the question and it seems that you want to replace multiple characters by doubles. To do that, simply use this line:
String keepDouble = str1.replaceAll("(\\w)\\1+", "$1$1");
Try this, this will be most efficient way[Edited after comment]:
public static String removeDuplicates(String str) {
int checker = 0;
StringBuffer buffer = new StringBuffer();
for (int i = 0; i < str.length(); ++i) {
int val = str.charAt(i) - 'a';
if ((checker & (1 << val)) == 0)
buffer.append(str.charAt(i));
checker |= (1 << val);
}
return buffer.toString();
}
I am using bits to identify uniqueness.
EDIT:
Whole logic is that if a character has been parsed then its corrresponding bit is set and next time when that character comes up then it will not be added in String Buffer the corresponding bit is already set.
Here is my word count program using java. I need to reprogram this so that something, something; something? something! and something count as one word. That means it should not count the same word twice irregardless of case and punctuation.
import java.util.Scanner;
public class WordCount1
{
public static void main(String[]args)
{
final int Lines=6;
Scanner in=new Scanner (System.in);
String paragraph = "";
System.out.println( "Please input "+ Lines + " lines of text.");
for (int i=0; i < Lines; i+=1)
{
paragraph=paragraph+" "+in.nextLine();
}
System.out.println(paragraph);
String word="";
int WordCount=0;
for (int i=0; i<paragraph.length()-1; i+=1)
{
if (paragraph.charAt(i) != ' ' || paragraph.charAt(i) !=',' || paragraph.charAt(i) !=';' || paragraph.charAt(i) !=':' )
{
word= word + paragraph.charAt(i);
if(paragraph.charAt(i+1)==' ' || paragraph.charAt(i) ==','|| paragraph.charAt(i) ==';' || paragraph.charAt(i) ==':')
{
WordCount +=1;
word="";
}
}
}
System.out.println("There are "+WordCount +" words ");
}
}
Since this is homework, here are some hints and advice.
There is a clever little method called String.split that splits a string into parts, using a separator specified as a regular expression. If you use it the right way, this will give you a one line solution to the "word count" problem. (If you've been told not to use split, you can ignore that ... though it is the simple solution that a seasoned Java developer would consider first.)
Format / indent your code properly ... before you show it to other people. If your instructor doesn't deduct marks for this, he / she isn't doing his job properly.
Use standard Java naming conventions. The capitalization of Lines is incorrect. It could be LINES for a manifest constant or lines for variable, but a mixed case name starting with a capital letter should always be a class name.
Be consistent in your use of white space characters around operators (including the assignment operator).
It is a bad idea (and completely unnecessary) to hard wire the number of lines of input that the user must supply. And you are not dealing with the case where he / supplies less than 6 lines.
You should just remove punctuation and change to a single case before doing further processing. (Be careful with locales and unicode)
Once you have broken the input into words, you can count the number of unique words by passing them into a Set and checking the size of the set.
Here You Go. This Works. Just Read The Comments And You Should Be Able To Follow.
import java.util.Arrays;
import java.util.HashSet;
import javax.swing.JOptionPane;
// Program Counts Words In A Sentence. Duplicates Are Not Counted.
public class WordCount
{
public static void main(String[]args)
{
// Initialize Variables
String sentence = "";
int wordCount = 1, startingPoint = 0;
// Prompt User For Sentence
sentence = JOptionPane.showInputDialog(null, "Please input a sentence.", "Input Information Below", 2);
// Remove All Punctuations. To Check For More Punctuations Just Add Another Replace Statement.
sentence = sentence.replace(",", "").replace(".", "").replace("?", "");
// Convert All Characters To Lowercase - Must Be Done To Compare Upper And Lower Case Words.
sentence = sentence.toLowerCase();
// Count The Number Of Words
for (int i = 0; i < sentence.length(); i++)
if (sentence.charAt(i) == ' ')
wordCount++;
// Initialize Array And A Count That Will Be Used As An Index
String[] words = new String[wordCount];
int count = 0;
// Put Each Word In An Array
for (int i = 0; i < sentence.length(); i++)
{
if (sentence.charAt(i) == ' ')
{
words[count] = sentence.substring(startingPoint,i);
startingPoint = i + 1;
count++;
}
}
// Put Last Word In Sentence In Array
words[wordCount - 1] = sentence.substring(startingPoint, sentence.length());
// Put Array Elements Into A Set. This Will Remove Duplicates
HashSet<String> wordsInSet = new HashSet<String>(Arrays.asList(words));
// Format Words In Hash Set To Remove Brackets, And Commas, And Convert To String
String wordsString = wordsInSet.toString().replace(",", "").replace("[", "").replace("]", "");
// Print Out None Duplicate Words In Set And Word Count
JOptionPane.showMessageDialog(null, "Words In Sentence:\n" + wordsString + " \n\n" +
"Word Count: " + wordsInSet.size(), "Sentence Information", 2);
}
}
If you know the marks you want to ignore (;, ?, !) you could do a simple String.replace to remove the characters out of the word. You may want to use String.startsWith and String.endsWith to help
Convert you values to lower case for easier matching (String.toLowercase)
The use of a 'Set' is an excellent idea. If you want to know how many times a particular word appears you could also take advantage of a Map of some kind
You'll need to strip out the punctuation; here's one approach: Translating strings character by character
The above can also be used to normalize the case, although there are probably other utilities for doing so.
Now all of the variations you describe will be converted to the same string, and thus be recognized as such. As pretty much everyone else has suggested, as set would be a good tool for counting the number of distinct words.
What your real problem is, is that you want to have a Distinct wordcount, so, you should either keep track of which words allready encountered, or delete them from the text entirely.
Lets say that you choose the first one, and store the words you already encountered in a List, then you can check against that list whether you allready saw that word.
List<String> encounteredWords = new ArrayList<String>();
// continue after that you found out what the word was
if(!encounteredWords.contains(word.toLowerCase()){
encounteredWords.add(word.toLowerCase());
wordCount++;
}
But, Antimony, made a interesting remark as well, he uses the property of a Set to see what the distinct wordcount is. It is defined that a set can never contain duplicates, so if you just add more of the same word, the set wont grow in size.
Set<String> wordSet = new HashSet<String>();
// continue after that you found out what the word was
wordSet.add(word.toLowerCase());
// continue after that you scanned trough all words
return wordSet.size();
remove all punctuations
convert all strings to lowercase OR uppercase
put those strings in a set
get the size of the set
As you parse your input string, store it word by word in a map data structure. Just ensure that "word", "word?" "word!" all are stored with the key "word" in the map, and increment the word's count whenever you have to add to the map.