I have some basic idea on how to do this task, but I'm not sure if I'm doing it right. So we have class WindyString with metod blow. After using it :
System.out.println(WindyString.blow(
"Abrakadabra! The second chance to pass has already BEGUN! "));
we should obtain something like this :
e a e a a ea y
br k d br ! Th s c nd ch nc t p ss h s lr d B G N!
A a a a a e o o a E U
so in a nutshell in every second word we pick every vowels and move them one line above. In the second half of words we move vowels one line below.
I know I should split string to tokens with tokenizer or split method,but what next ? Create 3 arrays each representing each row ?
Yes, that's probably an easy (not very performant) way to solve the problem.
Create 3 arrays; one is filled with the actual data and 2 arrays are filled (Arrays.fill) with ' '.
Then iterate over the array containing the actual data, and keep an integer of which word you're currently at and a boolean if you already matched whitespace.
While iterating, you check if the character is a vowel or not. If it's a vowel, check the word-count (oddness/evenness) and place it in the first or third array.
When you reach a whitespace, set the boolean and increase the word count. If you reach another whitespace, check whether the whitespace is already set: if so, continue. If you match a non-whitespace, reset the whitespace boolean.
Then join all arrays together and append a new-line character between each joined array and return the string.
The simplest way is to use regex. This should be instructive:
static String blow(String s) {
String vowels = "aeiouAEIOU";
String middle = s.replaceAll("[" + vowels + "]", " ");
int flip = 0;
String[] side = { "", "" };
Scanner sc = new Scanner(s);
for (String word; (word = sc.findInLine("\\s*\\S*")) != null; ) {
side[flip] += word.replaceAll(".", " ");
side[1-flip] += word.replaceAll("[^" + vowels + "]", " ");
flip = 1-flip;
}
return String.format("|%s|%n|%s|%n|%s|", side[0], middle, side[1]);
}
I added the | characters in the output to show that this processes excess whitespaces correctly -- all three lines are guaranteed the same length, taking care of leading blanks, trailing blanks, or even ALL blanks input.
If you're not familiar with regular expressions, this is definitely a good one to start learning with.
The middle is simply the original string with all vowels replaced with spaces.
Then, side[0] and side[1] are the top and bottom lines respectively. We use the Scanner to extract every word (preserving leading and trailing spaces). The way we process each word is that in one side, everything is replaced by blanks; in the other, only non-vowels are replaced by blanks. We flip sides with every word we process.
Related
I need to capitalize first letter in every word in the string, BUT it's not so easy as it seems to be as the word is considered to be any sequence of letters, digits, "_" , "-", "`" while all other chars are considered to be separators, i.e. after them the next letter must be capitalized.
Example what program should do:
For input: "#he&llo wo!r^ld"
Output should be: "#He&Llo Wo!R^Ld"
There are questions that sound similar here, but there solutions really don't help.
This one for example:
String output = Arrays.stream(input.split("[\\s&]+"))
.map(t -> t.substring(0, 1).toUpperCase() + t.substring(1))
.collect(Collectors.joining(" "));
As in my task there can be various separators, this solution doesn't work.
It is possible to split a string and keep the delimiters, so taking into account the requirement for delimiters:
word is considered to be any sequence of letters, digits, "_" , "-", "`" while all other chars are considered to be separators
the pattern which keeps the delimiters in the result array would be: "((?<=[^-`\\w])|(?=[^-`\\w]))":
[^-`\\w]: all characters except -, backtick and word characters \w: [A-Za-z0-9_]
Then, the "words" are capitalized, and delimiters are kept as is:
static String capitalize(String input) {
if (null == input || 0 == input.length()) {
return input;
}
return Arrays.stream(input.split("((?<=[^-`\\w])|(?=[^-`\\w]))"))
.map(s -> s.matches("[-`\\w]+") ? Character.toUpperCase(s.charAt(0)) + s.substring(1) : s)
.collect(Collectors.joining(""));
}
Tests:
System.out.println(capitalize("#he&l_lo-wo!r^ld"));
System.out.println(capitalize("#`he`&l+lo wo!r^ld"));
Output:
#He&l_lo-wo!R^Ld
#`he`&L+Lo Wo!R^Ld
Update
If it is needed to process not only ASCII set of characters but apply to other alphabets or character sets (e.g. Cyrillic, Greek, etc.), POSIX class \\p{IsWord} may be used and matching of Unicode characters needs to be enabled using pattern flag (?U):
static String capitalizeUnicode(String input) {
if (null == input || 0 == input.length()) {
return input;
}
return Arrays.stream(input.split("(?U)((?<=[^-`\\p{IsWord}])|(?=[^-`\\p{IsWord}]))")
.map(s -> s.matches("(?U)[-`\\p{IsWord}]+") ? Character.toUpperCase(s.charAt(0)) + s.substring(1) : s)
.collect(Collectors.joining(""));
}
Test:
System.out.println(capitalizeUnicode("#he&l_lo-wo!r^ld"));
System.out.println(capitalizeUnicode("#привет&`ёж`+дос^βιδ/ως"));
Output:
#He&L_lo-wo!R^Ld
#Привет&`ёж`+Дос^Βιδ/Ως
You can't use split that easily - split will eliminate the separators and give you only the things in between. As you need the separators, no can do.
One real dirty trick is to use something called 'lookahead'. That argument you pass to split is a regular expression. Most 'characters' in a regexp have the property that they consume the matching input. If you do input.split("\\s+") then that doesn't 'just' split on whitespace, it also consumes them: The whitespace is no longer part of the individual entries in your string array.
However, consider ^ and $. or \\b. These still match things but don't consume anything. You don't consume 'end of string'. In fact, ^^^hello$$$ matches the string "hello" just as well. You can do this yourself, using lookahead: It matches when the lookahead is there but does not consume it:
String[] args = "Hello World$Huh Weird".split("(?=[\\s_$-]+)");
for (String arg : args) System.out.println("*" + args[i] + "*");
Unfortunately, this 'works', in that it saves your separators, but isn't getting you all that much closer to a solution:
*Hello*
* World*
*$Huh*
* *
* *
* Weird*
You can go with lookbehind as well, but it's limited; they don't do variable length, for example.
The conclusion should rapidly become: Actually, doing this with split is a mistake.
Then, once split is off the table, you should no longer use streams, either: Streams don't do well once you need to know stuff about the previous element in a stream to do the job: A stream of characters doesn't work, as you need to know if the previous character was a non-letter or not.
In general, "I want to do X, and use Y" is a mistake. Keep an open mind. It's akin to asking: "I want to butter my toast, and use a hammer to do it". Oookaaaaayyyy, you can probably do that, but, eh, why? There are butter knives right there in the drawer, just.. put down the hammer, that's toast. Not a nail.
Same here.
A simple loop can take care of this, no problem:
private static final String BREAK_CHARS = "&-_`";
public String toTitleCase(String input) {
StringBuilder out = new StringBuilder();
boolean atBreak = true;
for (char c : input.toCharArray()) {
out.append(atBreak ? Character.toUpperCase(c) : c);
atBreak = Character.isWhitespace(c) || (BREAK_CHARS.indexOf(c) > -1);
}
return out.toString();
}
Simple. Efficient. Easy to read. Easy to modify. For example, if you want to go with 'any non-letter counts', trivial: atBreak = Character.isLetter(c);.
Contrast to the stream solution which is fragile, weird, far less efficient, and requires a regexp that needs half a page's worth of comment for anybody to understand it.
Can you do this with streams? Yes. You can butter toast with a hammer, too. Doesn't make it a good idea though. Put down the hammer!
You can use a simple FSM as you iterate over the characters in the string, with two states, either in a word, or not in a word. If you are not in a word and the next character is a letter, convert it to upper case, otherwise, if it is not a letter or if you are already in a word, simply copy it unmodified.
boolean isWord(int c) {
return c == '`' || c == '_' || c == '-' || Character.isLetter(c) || Character.isDigit(c);
}
String capitalize(String s) {
StringBuilder sb = new StringBuilder();
boolean inWord = false;
for (int c : s.codePoints().toArray()) {
if (!inWord && Character.isLetter(c)) {
sb.appendCodePoint(Character.toUpperCase(c));
} else {
sb.appendCodePoint(c);
}
inWord = isWord(c);
}
return sb.toString();
}
Note: I have used codePoints(), appendCodePoint(int), and int so that characters outside the basic multilingual plane (with code points greater than 64k) are handled correctly.
I need to capitalize first letter in every word
Here is one way to do it. Admittedly this is a might longer but your requirement to change the first letter to upper case (not first digit or first non-letter) required a helper method. Otherwise it would have been easier. Some others seemed to have missed this point.
Establish word pattern, and test data.
String wordPattern = "[\\w_-`]+";
Pattern p = Pattern.compile(wordPattern);
String[] inputData = { "#he&llo wo!r^ld", "0hel`lo-w0rld" };
Now this simply finds each successive word in the string based on the established regular expression. As each word is found, it changes the first letter in the word to upper case and then puts it in a string buffer in the correct position where the match was found.
for (String input : inputData) {
StringBuilder sb = new StringBuilder(input);
Matcher m = p.matcher(input);
while (m.find()) {
sb.replace(m.start(), m.end(),
upperFirstLetter(m.group()));
}
System.out.println(input + " -> " + sb);
}
prints
#he&llo wo!r^ld -> #He&Llo Wo!R^Ld
0hel`lo-w0rld -> 0Hel`lo-W0rld
Since words may start with digits, and the requirement was to convert the first letter (not character) to upper case. This method finds the first letter, converts it to upper case and
returns the new string. So 01_hello would become 01_Hello
public static String upperFirstLetter(String word) {
char[] chs = word.toCharArray();
for (int i = 0; i < chs.length; i++) {
if (Character.isLetter(chs[i])) {
chs[i] = Character.toUpperCase(chs[i]);
break;
}
}
return String.valueOf(chs);
}
Given the following string:
String text = "The woods are\nlovely,\t\tdark and deep.";
I want all whitespace treated as a single character. So for instance, the \n is 1 char. The \t\t should also be 1 char. With that logic, I count 36 characters and 7 words. But when I run this through the following code:
String text = "The woods are\nlovely,\t\tdark and deep.";
int numNewCharacters = 0;
for(int i=0; i < text.length(); i++)
if(!Character.isWhitespace(text.charAt(i)))
numNewCharacters++;
int numNewWords = text.split("\\s").length;
// Prints "30"
System.out.println("Chars:" + numNewCharacters);
// Prints "8"
System.out.println("Words:" + numNewWords);
It's telling me that there are 30 characters and 8 words. Any ideas as to why? Thanks in advance.
You are matching on individual whitespaces. Instead you could match on one or more:
text.split("\\s+")
You are counting only non white space characters in the first loop - so not counting space etc at all. Then 30 is the right answer. As for the second - I suspect split is treating consecutive white spaces as distinct, so there is a "null" word between the two tabs.
Reimueus has already solved your word count problem:
text.split("\\s+")
And your character count is corret. Newlines \n and tabs \t are considered whitespace. If you don't want them to be, you can implement your own isWhitespace function.
Here is the complete solution to counting words and characters:
System.out.println("Characters: " + text.replaceAll("\\s+", " ").length());
Matcher m = Pattern.compile("[^\\s]+", Pattern.MULTILINE).matcher(text);
int wordCount = 0;
while (m.find()) {
wordCount ++;
}
System.out.println("Words: "+ wordCount);
Character count is accomplished by replacing all whitespaces groups to a single space and just taking the resulting string's length;
For word count we create a pattern that will match any char group which does not contain a whitespace. You could use \\w+ pattern here, but it will match only alphanumeric characters and underscore. Note also Pattern.MULTILINE parameter.
I am trying to create a String[] which contains only words that comprise of certain characters. For example I have a dictionary containing a number of words like so:
arm
army
art
as
at
attack
attempt
attention
attraction
authority
automatic
awake
baby
back
bad
bag
balance
I want to narrow the list down so that it only contains words with the characters a, b and g. Therefore the list should only contain the word 'bag' in this example.
Currently I am trying to do this using regexes but having never used them before I can't seem to get it to work.
Here is my code:
public class LetterJugglingMain {
public static void main(String[] args) {
String dictFile = "/Users/simonrhillary/Desktop/Dictionary(3).txt";
fileReader fr = new fileReader();
fr.openFile(dictFile);
String[] dictionary = fr.fileToArray();
String regx = "able";
String[] newDict = createListOfValidWords(dictionary, regx);
printArray(newDict);
}
public static String[] createListOfValidWords(String[] d, String regex){
List<String> narrowed = new ArrayList<String>();
for(int i = 0; i<d.length; i++){
if(d[i].matches(regex)){
narrowed.add(d[i]);
System.out.println("added " + d[i]);
}
}
String[] narrowArray = narrowed.toArray(new String[0]);
return narrowArray;
}
however the array returned is always empty unless the String regex is the exact word! Any ideas? I can post more code if needed...I think I must be trying to initialise the regex wrong.
The narrowed down list must contain ONLY the characters from the regex.
Frankly, I'm not an expert in regexes, but I don't think it's the best tool to do what you want. I would use a method like the following:
public boolean containsAll(String s, Set<Character> chars) {
Set<Character> copy = new HashSet<Character>();
for (int i = 0; i < s.length() && copy.size() < chars.size(); i++) {
char c = s.charAt(i);
if (chars.contains(c)) {
copy.add(c);
}
}
return copy.size() == chars.size();
}
The regex able will match only the string "able". However, if you want a regular expression to match either character of a, b, l or e, the regex you're looking for is [able] (in brackets). If you want words containing several such characters, add a + for repeating the pattern: [able]+.
The OP wants words that contain every character. Not just one of them.
And other characters are not a problem.
If this is the case, I think the simiplest way would be to loop through the entire string, character by character, and check to see if it contains all of the characters you want. Keep flags to check and see if every character has been found.
If this isn't the case.... :
Try using the regex:
^[able]+$
Here's what it does:
^ matches the beginning of the string and $ matches the end of the string. This makes sure that you're not getting a partial match.
[able] matches the characters you want the string to consist of, in this case a, b, l, and e. + Makes sure that there are 1 or more of these characters in the string.
Note: This regex will match a string that contains these 4 letters. For example, it will match:
able, albe, aeble, aaaabbblllleeee
and will not match
qable, treatable, and abled.
A sample regex that filters out words that contains at least one occurrence of all characters in a set. This will match any English word (case-insensitive) that contains at least one occurrence of all the characters a, b, g:
(?i)(?=.*a)(?=.*b)(?=.*g)[a-z]+
Example of strings that match would be bag, baggy, grab.
Example of strings that don't match would be big, argument, nothing.
The (?i) means turns on case-insensitive flag.
You need to append as many (?=.*<character>) as the number of characters in the set, for each of the characters.
I assume a word only contains English alphabet, so I specify [a-z]. Specify more if you need space, hyphen, etc.
I assume matches(String regex) method in String class, so I omitted the ^ and $.
The performance may be bad, since in the worst case (the characters are found at the end of the words), I think that the regex engine may go through the string for around n times where n is the number of characters in the set. It may not be an actual concern at all, since the words are very short, but if it turns out that this is a bottleneck, you may consider doing simple looping.
Here is my word count program using java. I need to reprogram this so that something, something; something? something! and something count as one word. That means it should not count the same word twice irregardless of case and punctuation.
import java.util.Scanner;
public class WordCount1
{
public static void main(String[]args)
{
final int Lines=6;
Scanner in=new Scanner (System.in);
String paragraph = "";
System.out.println( "Please input "+ Lines + " lines of text.");
for (int i=0; i < Lines; i+=1)
{
paragraph=paragraph+" "+in.nextLine();
}
System.out.println(paragraph);
String word="";
int WordCount=0;
for (int i=0; i<paragraph.length()-1; i+=1)
{
if (paragraph.charAt(i) != ' ' || paragraph.charAt(i) !=',' || paragraph.charAt(i) !=';' || paragraph.charAt(i) !=':' )
{
word= word + paragraph.charAt(i);
if(paragraph.charAt(i+1)==' ' || paragraph.charAt(i) ==','|| paragraph.charAt(i) ==';' || paragraph.charAt(i) ==':')
{
WordCount +=1;
word="";
}
}
}
System.out.println("There are "+WordCount +" words ");
}
}
Since this is homework, here are some hints and advice.
There is a clever little method called String.split that splits a string into parts, using a separator specified as a regular expression. If you use it the right way, this will give you a one line solution to the "word count" problem. (If you've been told not to use split, you can ignore that ... though it is the simple solution that a seasoned Java developer would consider first.)
Format / indent your code properly ... before you show it to other people. If your instructor doesn't deduct marks for this, he / she isn't doing his job properly.
Use standard Java naming conventions. The capitalization of Lines is incorrect. It could be LINES for a manifest constant or lines for variable, but a mixed case name starting with a capital letter should always be a class name.
Be consistent in your use of white space characters around operators (including the assignment operator).
It is a bad idea (and completely unnecessary) to hard wire the number of lines of input that the user must supply. And you are not dealing with the case where he / supplies less than 6 lines.
You should just remove punctuation and change to a single case before doing further processing. (Be careful with locales and unicode)
Once you have broken the input into words, you can count the number of unique words by passing them into a Set and checking the size of the set.
Here You Go. This Works. Just Read The Comments And You Should Be Able To Follow.
import java.util.Arrays;
import java.util.HashSet;
import javax.swing.JOptionPane;
// Program Counts Words In A Sentence. Duplicates Are Not Counted.
public class WordCount
{
public static void main(String[]args)
{
// Initialize Variables
String sentence = "";
int wordCount = 1, startingPoint = 0;
// Prompt User For Sentence
sentence = JOptionPane.showInputDialog(null, "Please input a sentence.", "Input Information Below", 2);
// Remove All Punctuations. To Check For More Punctuations Just Add Another Replace Statement.
sentence = sentence.replace(",", "").replace(".", "").replace("?", "");
// Convert All Characters To Lowercase - Must Be Done To Compare Upper And Lower Case Words.
sentence = sentence.toLowerCase();
// Count The Number Of Words
for (int i = 0; i < sentence.length(); i++)
if (sentence.charAt(i) == ' ')
wordCount++;
// Initialize Array And A Count That Will Be Used As An Index
String[] words = new String[wordCount];
int count = 0;
// Put Each Word In An Array
for (int i = 0; i < sentence.length(); i++)
{
if (sentence.charAt(i) == ' ')
{
words[count] = sentence.substring(startingPoint,i);
startingPoint = i + 1;
count++;
}
}
// Put Last Word In Sentence In Array
words[wordCount - 1] = sentence.substring(startingPoint, sentence.length());
// Put Array Elements Into A Set. This Will Remove Duplicates
HashSet<String> wordsInSet = new HashSet<String>(Arrays.asList(words));
// Format Words In Hash Set To Remove Brackets, And Commas, And Convert To String
String wordsString = wordsInSet.toString().replace(",", "").replace("[", "").replace("]", "");
// Print Out None Duplicate Words In Set And Word Count
JOptionPane.showMessageDialog(null, "Words In Sentence:\n" + wordsString + " \n\n" +
"Word Count: " + wordsInSet.size(), "Sentence Information", 2);
}
}
If you know the marks you want to ignore (;, ?, !) you could do a simple String.replace to remove the characters out of the word. You may want to use String.startsWith and String.endsWith to help
Convert you values to lower case for easier matching (String.toLowercase)
The use of a 'Set' is an excellent idea. If you want to know how many times a particular word appears you could also take advantage of a Map of some kind
You'll need to strip out the punctuation; here's one approach: Translating strings character by character
The above can also be used to normalize the case, although there are probably other utilities for doing so.
Now all of the variations you describe will be converted to the same string, and thus be recognized as such. As pretty much everyone else has suggested, as set would be a good tool for counting the number of distinct words.
What your real problem is, is that you want to have a Distinct wordcount, so, you should either keep track of which words allready encountered, or delete them from the text entirely.
Lets say that you choose the first one, and store the words you already encountered in a List, then you can check against that list whether you allready saw that word.
List<String> encounteredWords = new ArrayList<String>();
// continue after that you found out what the word was
if(!encounteredWords.contains(word.toLowerCase()){
encounteredWords.add(word.toLowerCase());
wordCount++;
}
But, Antimony, made a interesting remark as well, he uses the property of a Set to see what the distinct wordcount is. It is defined that a set can never contain duplicates, so if you just add more of the same word, the set wont grow in size.
Set<String> wordSet = new HashSet<String>();
// continue after that you found out what the word was
wordSet.add(word.toLowerCase());
// continue after that you scanned trough all words
return wordSet.size();
remove all punctuations
convert all strings to lowercase OR uppercase
put those strings in a set
get the size of the set
As you parse your input string, store it word by word in a map data structure. Just ensure that "word", "word?" "word!" all are stored with the key "word" in the map, and increment the word's count whenever you have to add to the map.
Can anyone give me a Java regex to identify repeated characters in a string? I am only looking for characters that are repeated immediately and they can be letters or digits.
Example:
abccde <- looking for this (immediately repeating c's)
abcdce <- not this (c's seperated by another character)
Try "(\\w)\\1+"
The \\w matches any word character (letter, digit, or underscore) and the \\1+ matches whatever was in the first set of parentheses, one or more times. So you wind up matching any occurrence of a word character, followed immediately by one or more of the same word character again.
(Note that I gave the regex as a Java string, i.e. with the backslashes already doubled for you)
String stringToMatch = "abccdef";
Pattern p = Pattern.compile("(\\w)\\1+");
Matcher m = p.matcher(stringToMatch);
if (m.find())
{
System.out.println("Duplicate character " + m.group(1));
}
Regular Expressions are expensive. You would probably be better off just storing the last character and checking to see if the next one is the same.
Something along the lines of:
String s;
char c1, c2;
c1 = s.charAt(0);
for(int i=1;i<s.length(); i++){
char c2 = s.charAt(i);
// Check if they are equal here
c1=c2;
}