Java (Regex) - Get all words in a sentence - java

I need to split a java string into an array of words. Let's say the string is:
"Hi!! I need to split this string, into a serie's of words?!"
At the moment I'm tried using this String[] strs = str.split("(?!\\w)") however it keeps symbols such as ! in the array and it also keeps strings like "Hi!" in the array as well. The string I am splitting will always be lowercase. What I would like is for an array to be produced that looks like:
{"hi", "i", "need", "to", "split", "this", "string", "into", "a", "serie's", "of", "words"} - Note the apostrophe is kept.
How could I change my regex to not include the symbols in the array?
Apologies, I would define a word as a sequence of alphanumeric characters only but with the ' character inclusive if it is in the above context such as "it's", not if it is used to a quote a word such as "'its'". Also, in this context "hi," or "hi-person" are not words but "hi" and "person" are. I hope that clarifies the question.

You can remove all ?! symbols and then split into words
str = str.replaceAll("[!?,]", "");
String[] words = str.split("\\s+");
Result:
Hi, I, need, to, split, this, string, into, a, serie's, of, words

Should work for what you want.
String line = "Hi!! I need to split this string, into a serie's of words?! but not '' or ''' word";
String regex = "([^a-zA-Z']+)'*\\1*";
String[] split = line.split(regex);
System.out.println(Arrays.asList(split));
Gives
[Hi, I, need, to, split, this, string, into, a, serie's, of, words, but, not, or, word]

If you define a word as a sequence of non-whitespace characters (whitespace character as defined by \s), then you can split along space characters:
str.split("\\s+")
Note that ";.';.##$>?>#4", "very,bad,punctuation", and "'goodbye'" are words under the definition above.
Then the other approach is to define a word as a sequence of characters from a set of allowed characters. If you want to allow a-z, A-Z, and ' as part of a word, you can split along everything else:
str.split("[^a-zA-Z']+")
This will still allow "''''''" to be defined as a word, though.

So what you want is to split on anything that is not a wordcharacter [a-zA-Z] and is not a '
This regex will do that "[^a-zA-Z']\s"
There will be a problem if the string contains a quote that is quoted in '
I usually use this page for testing my regex'
http://www.regexplanet.com/advanced/java/index.html

I would use str.split("[\\s,?!]+"). You can add whatever character you want to split with inside the brackets [].

You could filter out the characters you deem as "non-word" characters:
String[] strs = str.split("[,!? ]+");

myString.replaceAll("[^a-zA-Z'\\s]","").toLowerCase().split("\\s+");
replaceAll("[^a-zA-Z'\\s]","") method replaces all the characters which are not a-z or A-Z or ' or a whitespace with nothing ("") and then toLowerCase method make all the chars returned from replaceAll method lower case. Finally we are splitting the string in terms of whitespace char. more readable one;
myString = myString.replaceAll("[^a-zA-Z'\\s]","");
myString = myString.toLowerCase();
String[] strArr = myString.split("\\s+");

Related

Java regex - split string with leading special characters

I am trying to split a string that contains whitespaces and special characters. The string starts with special characters.
When I run the code, the first array element is an empty string.
String s = ",hm ..To?day,.. is not T,uesday.";
String[] sArr = s.split("[^a-zA-Z]+\\s*");
Expected result is ["hm", "To", "day", "is", "not", "T", "uesday"]
Can someone explain how this is happening?
Actual result is ["", "hm", "To", "day", "is", "not", "T", "uesday"]
Split is behaving as expected by splitting off a zero-length string at the start before the first comma.
To fix, first remove all splitting chars from the start:
String[] sArr = s.replaceAll("^([^a-zA-Z]*\\s*)*", "").split("[^a-zA-Z]+\\s*");
Note that I’ve altered the removal regex to trim any sequence of spaces and non-letters from the front.
You don’t need to remove from the tail because split discards empty trailing elements from the result.
I would simplify it by making it a two-step process rather than trying to achieve a pure regex split() operation:
s.replaceAll( '[^a-zA-Z]+', ' ' ).trim().split( ' ' )

Java Split regex

Given a string S, find the number of words in that string. For this problem a word is defined by a string of one or more English letters.
Note: Space or any of the special characters like ![,?.\_'#+] will act as a delimiter.
Input Format: The string will only contain lower case English letters, upper case English letters, spaces, and these special characters: ![,?._'#+].
Output Format: On the first line, print the number of words in the string. The words don't need to be unique. Then, print each word in a separate line.
My code:
Scanner sc = new Scanner(System.in);
String str = sc.nextLine();
String regex = "( |!|[|,|?|.|_|'|#|+|]|\\\\)+";
String[] arr = str.split(regex);
System.out.println(arr.length);
for(int i = 0; i < arr.length; i++)
System.out.println(arr[i]);
When I submit the code, it works for just over half of the test cases. I do not know what the test cases are. I'm asking for help with the Murphy's law. What are the situations where the regex I implemented won't work?
You don't escape some special characters in your regex. Let's start with []. Since you don't escape them, the part [|,|?|.|_|'|#|+|] is treated like a set of characters |,?._'#+. This means that your regex doesn't split on [ and ].
For example x..]y+[z is split to x, ]y and [z.
You can fix that by escaping those characters. That will force you to escape more of them and you end up with a proper definition:
String regex = "( |!|\\[|,|\\?|\\.|_|'|#|\\+|\\])+";
Note that instead of defining alternatives, you could use a set which will make your regex easier to read:
String regex = "[!\\[,?._'#+\\].]+";
In this case you only need to escape [ and ].
UPDATE:
There's also a problem with leading special character (like in your example ".Hi?there[broski.]#####"). You need to split on it but it produces an empty string in the results. I don't think there's a way to use split function without producing it but you can mitigate it by removing the first group before splitting using the same regex:
String[] arr = str.replaceFirst(regex, "").split(regex);

Splitting a string in Java using multiple delimiters

I have a string like
String myString = "hello world~~hello~~world"
I am using the split method like this
String[] temp = myString.split("~|~~|~~~");
I want the array temp to contain only the strings separated by ~, ~~ or ~~~.
However, the temp array thus created has length 5, the 2 additional 'strings' being empty strings.
I want it to ONLY contain my non-empty string. Please help. Thank you!
You should use quantifier with your character:
String[] temp = myString.split("~+");
String#split() takes a regex. ~+ will match 1 or more ~, so it will split on ~, or ~~, or ~~~, and so on.
Also, if you just want to split on ~, ~~, or ~~~, then you can limit the repetition by using {m,n} quantifier, which matches a pattern from m to n times:
String[] temp = myString.split("~{1,3}");
When you split it the way you are doing, it will split a~~b twice on ~, and thus the middle element will be an empty string.
You could also have solved the problem by reversing the order of your delimiter like this:
String[] temp = myString.split("~~~|~~|~");
That will first try to split on ~~, before splitting on ~ and will work fine. But you should use the first approach.
Just turn the pattern around:
String myString = "hello world~~hello~~world";
String[] temp = myString.split("~~~|~~|~");
Try This :
myString.split("~~~|~~|~");
It will definitely works. In your code, what actually happens that when ~ occurs for the first time,it count as a first separator and split the string from that point. So it doesn't get ~~ or ~~~ anywhere in your string though it is there. Like :
[hello world]~[]~[hello]~[]~[world]
Square brackets are split-ed in to 5 different string values.

java split string with regex

I want to split string by setting all non-alphabet as separator.
String[] word_list = line.split("[^a-zA-Z]");
But with the following input
11:11 Hello World
word_list contains many empty string before "hello" and "world"
Please kindly tell me why. Thank You.
Because your regular expression matches each individual non-alpha character. It would be like separating
",,,,,,Hello,World"
on commas.
You will want an expression that matches an entire sequence of non-alpha characters at once such as:
line.split("[^a-zA-Z][^a-zA-Z]*")
I still think you will get one leading empty string with your example since it would be like separating ",Hello,World" if comma were your separator.
Here's your string, where each ^ character shows a match for [^a-zA-Z]:
11:11 Hello World
^^^^^^ ^
The split method finds each of these matches, and basically returns all substrings between the ^ characters. Since there's six matches before any useful data, you end up with 5 empty substrings before you get the string "Hello".
To prevent this, you can manually filter the result to ignore any empty strings.
Will the following do?
String[] word_list = line.replaceAll("[^a-zA-Z ]","").replaceAll(" +", " ").trim().split("[^a-zA-Z]");
What I am doing here is removing all non-alphabet characters before doing the split and then replacing multiple spaces by a single space.

Matching strings and storing into an array with regex in java

Im making a program that takes a file and finds identifiers. So far I removed any words in quotes, any words that start with a number and I removed all the non word characters.
Is there a way to find words that dont match words in an array and store those words into another array using regex? I can figure it out, I was trying to use the split method but its not working right when I try to split by spaces...This is what I did to split it.
String[] SplitString = newLine.split("[\\s]");
Use
String[] SplitString = newLine.split("\\s");
if you don't want to combine multiple spaces/tabs, etc., but use
String[] SplitString = newLine.split("\\s+");
if you do. For example, if your string is:
"a b c"
the first will give you four tokens: "a", "", "b", and "c", and the second will give you three: "a", "b", and "c".
You can do it simply in one line by first removing the known words, then splitting:
String[] unknownWords = newLine.replaceAll("\\b(apple|orange|banana)\\b", "").split("\\s+");
Notes:
Your regex [\s] is equivalent to \s, so I simplified it
You should probably split on any number of spaces: \s+
\b means "word boundary" - this means the removal regex won't match applejack
The regex (A|B|C|etc) is the syntax for "OR" logic

Categories

Resources