If I have a file containing random characters e.g:
sdo8kd oko ala la654
"sdo8kd", "oko", "ala" and "la654" would be considered words.
How can I represent a word not containing white space characters specifically using the method Character.isWhitespace(c) where c is the character being checked to see if it is white space.
You can use split(regex) from String and put into an array, after that do what you want with it.
String sentence = "sdo8kd oko ala la654";
String[] words = sentence.split("\\s+");
for(String word : words){
System.out.println("'" + word + "'"); //'sdo8kd', 'oko', 'ala', 'la654'
}
System.out.println(words.length); //4
try this regex "[a-zA-Z0-9]+"
boolean isAlphaNumeric = s.matches("[a-zA-Z0-9]+");
a-zA-Z all latin letters (lower and upper case)
0-9 all digits
[a-zA-Z0-9]+ at least 1 or more characters inside brackets.
If you are looking to split the words even any number of space is in between.
You can use
"sdo8kd oko ala la654".split(" +");
This will return String[] with values "sdo8kd", "oko", "ala" and "la654"
Related
I have made a program that counts the frequency of a word in a very long string. My problem is that the program is counting for example "*it" (consider * a quotation mark) and "it" as different words and therefore putting them in different categories.
I tried to replace all the punctuation marks I know of with the following code:
text = text.replace("\n", " ");
text = text.replaceAll("\\p{Punct}", " ");
text = text.replace("\"", "");
text = text.replace("–", "");
text = text.replace("\t", "");
Unfortunately, the code didn't work and I think it is because there is a lot of different quotation marks in Unicode that I can't see a difference between, so is there a way to remove all Unicode characters except letters and whitespaces with the String.replaceAll method or do I have to make a CharArray and continue from there?
Thanks a lot, any help would be appreciated.
I think this might do it
text = text.replaceAll("[^a-zA-Z0-9 ]", "");
which will remove all the characters which are not either alphanumeric or special characters.
EDIT :-
As suggesed by #npinti
text = text.replaceAll("[^\\p{L}0-9 ]", "");
This will remove all non-letter/digit characters and squish the spaces so you don't get multiple consecutive spaces:
text = text.replaceAll("[^\\p{L}\\d]+", " ");
This will remove all not letters and whitespaces.
text.replaceAll("[^\\sa-zA-Z]", "");
Legend:
^ - exclude given characters from being replaced
\\s - all whitespaces (\n , \t , ' ')
a-zA-Z - all letters
Example:
String in="12ASxA sdr5%";
System.out.println(in.replaceAll("[^\\sa-zA-Z]", "")); // ASxA sdr
There is a string,string a=" *|** || |**|** ";
The space separated this string into three groups. How can I converse these three groups into an array with three elements?
I tried to use split,
String a=" *|** || |**|** ";
String names[]=a.trim().split(" ");
System.out.println(names.length);
The expected output should be 3, however, it shows 8. Anyone who can tell me how to do it? thanks
String a=" *|** || |**|** ";
String names[]=a.trim().split("\\s+");
System.out.println(names.length);
This splits by any amount of white space characters (regular spaces, tabs, etc)
The regex "\\s+" searches for any white space.
Try splitting on multiple spaces, your code splits on a single space character. \s+ is regex for 1 or more spaces.
String a = " *|** || |**|** ";
String names[] = a.trim().split("\\s+");
System.out.println(names.length);
System.out.println(Arrays.toString(names));
Output
3
[*|**, ||, |**|**]
How can I split this string power:110V;220V;Color:Pink;White;Type:1;2;Condition:New;Used;
into these 4 strings
power:110V;220V;
Color:Pink;White;
Type:1;2;
Condition:New;Used;
Split your input according to the below regex.
string.split("(?<=;)(?=\\w+:)");
The above regex would match all the boundaries which exists next to a semicolon and the boundary must be followed by one or more word characters and a colon.
OR
string.split("(?<=;)(?=[^;:]*:)");
Example:
String s = "power:110V;220V;Color:Pink;White;Type:1;2;Condition:New;Used;";
String[] parts = s.split("(?<=;)(?=\\w+:)");
for(String i: parts)
{
System.out.println(i);
}
Given the following string:
String text = "The woods are\nlovely,\t\tdark and deep.";
I want all whitespace treated as a single character. So for instance, the \n is 1 char. The \t\t should also be 1 char. With that logic, I count 36 characters and 7 words. But when I run this through the following code:
String text = "The woods are\nlovely,\t\tdark and deep.";
int numNewCharacters = 0;
for(int i=0; i < text.length(); i++)
if(!Character.isWhitespace(text.charAt(i)))
numNewCharacters++;
int numNewWords = text.split("\\s").length;
// Prints "30"
System.out.println("Chars:" + numNewCharacters);
// Prints "8"
System.out.println("Words:" + numNewWords);
It's telling me that there are 30 characters and 8 words. Any ideas as to why? Thanks in advance.
You are matching on individual whitespaces. Instead you could match on one or more:
text.split("\\s+")
You are counting only non white space characters in the first loop - so not counting space etc at all. Then 30 is the right answer. As for the second - I suspect split is treating consecutive white spaces as distinct, so there is a "null" word between the two tabs.
Reimueus has already solved your word count problem:
text.split("\\s+")
And your character count is corret. Newlines \n and tabs \t are considered whitespace. If you don't want them to be, you can implement your own isWhitespace function.
Here is the complete solution to counting words and characters:
System.out.println("Characters: " + text.replaceAll("\\s+", " ").length());
Matcher m = Pattern.compile("[^\\s]+", Pattern.MULTILINE).matcher(text);
int wordCount = 0;
while (m.find()) {
wordCount ++;
}
System.out.println("Words: "+ wordCount);
Character count is accomplished by replacing all whitespaces groups to a single space and just taking the resulting string's length;
For word count we create a pattern that will match any char group which does not contain a whitespace. You could use \\w+ pattern here, but it will match only alphanumeric characters and underscore. Note also Pattern.MULTILINE parameter.
Can anyone give me a Java regex to identify repeated characters in a string? I am only looking for characters that are repeated immediately and they can be letters or digits.
Example:
abccde <- looking for this (immediately repeating c's)
abcdce <- not this (c's seperated by another character)
Try "(\\w)\\1+"
The \\w matches any word character (letter, digit, or underscore) and the \\1+ matches whatever was in the first set of parentheses, one or more times. So you wind up matching any occurrence of a word character, followed immediately by one or more of the same word character again.
(Note that I gave the regex as a Java string, i.e. with the backslashes already doubled for you)
String stringToMatch = "abccdef";
Pattern p = Pattern.compile("(\\w)\\1+");
Matcher m = p.matcher(stringToMatch);
if (m.find())
{
System.out.println("Duplicate character " + m.group(1));
}
Regular Expressions are expensive. You would probably be better off just storing the last character and checking to see if the next one is the same.
Something along the lines of:
String s;
char c1, c2;
c1 = s.charAt(0);
for(int i=1;i<s.length(); i++){
char c2 = s.charAt(i);
// Check if they are equal here
c1=c2;
}