Spliting a sentence when a hyphen is encountered in java

Spliting a sentence when a hyphen is encountered in java - java

I have following code in my program. It splits a line when a hyphen is encountered and stores each word in the String Array 'tokens'. But I want the hyphen also to be stored in the String Array 'tokens' when it is encountered in a sentence.
String[] tokens = line.split("-");
The above code splits the sentence but also totally ignores the hyphen in the resulting array.
What can I do to store hyphen also in the resulting array?

Edit : -
Seems like you want to split on both whitespaces and hyphen but keeping only the hyphen in the array (As, I infer from your this line - stores each word in the String Array), you can use this: -
String[] tokens = "abc this is-a hyphen def".split("((?<=-)|(?=-))|\\s+");
System.out.println(Arrays.toString(tokens));
Output: -
[abc, this, is, -, a, hyphen, def]
For handling spaces before and after hyphen, you can first trim those spaces using replaceAll method, and then do split: -
"abc this is - a hyphen def".replaceAll("[ ]*-[ ]*", "-")
.split("((?<=-)|(?=-))|\\s+");
Previous answer : -
You can use this: -
String[] tokens = "abc-efg".split("((?<=-)|(?=-))");
System.out.println(Arrays.toString(tokens));
OUTPUT : -
[abc, -, efg]
It splits on an empty character before and after the hyphen (-).

I suggest to use a regular expression in combination with the Java Pattern and Matcher. Example:
String line = "a-b-c-d-e-f-";
Pattern p = Pattern.compile("[^-]+|-");
Matcher m = p.matcher(line);
while (m.find())
{
String match = m.group();
System.out.println("match:" + match);
}
To test your regular expression you could use an online regexp tester like this

Related

Tokenize Words separated by non-word characters exept single quote

I have the following method I'm trying to implement: parses the input into “word tokens”: sequences of word characters separated by non-word characters. However, non-word characters can become part of a token if they are quoted (in single quotes).
I want to use regex but have trouble getting my code just right:
public static List<String> wordTokenize(String input) {
Pattern pattern = Pattern.compile ("\\b(?:(?<=\')[^\']*(?=\')|\\w+)\\b");
Matcher matcher = pattern.matcher (input);
ArrayList ans = new ArrayList();
while (matcher.find ()){
ans.add (matcher.group ());
}
return ans;
}
My regex fails to identify that starting a word mid word without space doesn't mean starting a new word. Examples:
The input: this-string 'has only three tokens' // works
The input:
"this*string'has only two#tokens'"
Expected :[this, stringhas only two#tokens]
Actual :[this, string, has only two#tokens]
The input: "one'two''three' '' four 'twenty-one'"
Expected :[onetwothree, , four, twenty-one]
Actual :[one, two, three, four, twenty-one]
How do I fix the spaces?

You want to match one or more occurrences of a word char or a substring between the closest single straight apostrophes, and remove all those apostrophes from the tokens.
Use the following regex and .replace("'", "") on the matches:
(?:\w|'[^']*')+
See the regex demo. Details:
(?: - start of a non-capturing group
\w - a word char
| - or
' - a straight single quotation mark
[^']* - any 0+ chars other than a straight single quotation mark
' - a straight single quotation mark
)+ - end of the group, 1+ occurrences.
See the Java demo:
// String s = "this*string'has only two#tokens'"; // => [this, stringhas only two#tokens]
String s = "one'two''three' '' four 'twenty-one'"; // => [onetwothree, , four, twenty-one]
Pattern pattern = Pattern.compile("(?:\\w|'[^']*')+", Pattern.UNICODE_CHARACTER_CLASS);
Matcher matcher = pattern.matcher(s);
List<String> tokens = new ArrayList<>();
while (matcher.find()){
tokens.add(matcher.group(0).replace("'", ""));
}
Note the Pattern.UNICODE_CHARACTER_CLASS is added for the \w pattern to match all Unicode letters and digits.

.split() and [\\W] creates an additional empty string?

I'm creating a small program to split a string into tokens (consecutive English alphabet characters, then outputting the number of tokens as well as the actual tokens. The problem is an extra empty string element is created wherever there is a comma followed by a space.
I've researched into regular expressions and understand that \W is anything that is not a word character.
String str = sc.nextLine();
// creating an array of tokens
String tokens[] = str.split("[\\W]");
int len = tokens.length;
System.out.println(len);
for (int i = 0; i < len; i++) {
System.out.println(tokens[i]);
}
Input:
Hello, World.
Expected output:
2
Hello
World
Actual output:
3
Hello
World
Note: this is my first stack overflow post, if I've done anything wrong please let me know, thanks

Try str.split("\\W+")
It means 1 or more non-word character
\W matches only 1 character. So it breaks at , and then breaks again at the space
That’s why it gives you back an extra empty string.
\W+ will match on ‘, ‘ as one, so it will break only once, so you will get back only the tokens. (It works on multiple tokens not just two. So ‘hello, world, again’ will give you [hello,world,again].

If you use .split("\\W") you will get empty items if:
non-word char(s) appear(s) at the start of the string
non-word chars appear in succession, one after another as \W matches 1 non-word char, breaks the string, and then the next non-word char breaks it again, producing empty strings.
There are two ways out.
Either remove all non-word chars at the start and then split with \W+:
String tokens[] = str.replaceFirst("^\\W+", "").split("\\W+");
Or, match the chunks of word chars with \w+ pattern:
Pattern p = Pattern.compile("\\w+");
Matcher m = p.matcher(" abc=-=123");
List<String> tokens = new ArrayList<>();
while(m.find()) {
tokens.add(m.group());
}
System.out.println(tokens);
See the online demo.

Try this
Scanner inputter = new Scanner(System.in);
System.out.print("Please enter your thoughts : ");
final String words = inputter.nextLine();
final String[] tokens = words.split("\\W+");
Arrays.stream(tokens).forEach(System.out::println);

RegEx: Matching n-char long sequence of repeating character

I want to split of a text string that might look like this:
(((Hello! --> ((( and Hello!
or
########No? --> ######## and No?
At the beginning I have n-times the same special character, but I want to match the longest possible sequence.
What I have at the moment is this regex:
([^a-zA-Z0-9])\\1+([a-zA-Z].*)
This one would return for the first example
( (only 1 time) and Hello!
and for the second
# and No!
How do I tell regEx I want the maximal long repetition of the matching character?
I am using RegEx as part of a Java program in case this matters.

I suggest the following solution with 2 regexps: (?s)(\\W)\\1+\\w.* for checking if the string contains same repeating non-word symbols at the start, and if yes, split with a mere (?<=\\W)(?=\\w) pattern (between non-word and a word character), else, just return a list containing the whole string (as if not split):
String ptrn = "(?<=\\W)(?=\\w)";
List<String> strs = Arrays.asList("(((Hello!", "########No?", "$%^&^Hello!");
for (String str : strs) {
if (str.matches("(?s)(\\W)\\1+\\w.*")) {
System.out.println(Arrays.toString(str.split(ptrn)));
}else { System.out.println(Arrays.asList(str)); }
}
See IDEONE demo
Result:
[(((, Hello!]
[########, No?]
[$%^&^Hello!]
Also, your original regex can be modified to fit the requirement like this:
String ptrn = "(?s)((\\W)\\2+)(\\w.*)";
List<String> strs = Arrays.asList("(((Hello!", "########No?", "$%^&^Hello!");
for (String str : strs) {
Pattern p = Pattern.compile(ptrn);
Matcher m = p.matcher(str);
if (m.matches()) {
System.out.println(Arrays.asList(m.group(1), m.group(3)));
}
else {
System.out.println(Arrays.asList(str));
}
}
See another IDEONE demo
That regex matches:
(?s) - DOTALL inline modifier (if the string has newline characters, .* will also match them).
((\\W)\\2+) - Capture group 1 matching and capturing into Group 2 a non-word character followed by the same character (since a backreference \2 is used) 1 or more times.
(\\w.*) - matches and captures into Group 3 a word character and then one or more characters.

splitting string and keep characters (regex pattern)

I would like to split a String and despair on the regex pattern.
I need to split a string like this: Hi I want "to split" this (String) to a String array like this:
String [] array = {"Hi", "I", "want", """, "to", "split", """, "this", "(", "string", ")"};
This is what I have tried, but it deletes the delimiter.
public static void main(String[] args) {
String string = "Hi \"why should\" (this work)";
String[] array;
array = string.split("\\s"
+ "|\\s(?=\")"
+ "|\\w(?=\")"
+ "|\"(?=\\w)"
+ "|\\s(?=\\()"
+ "|\\w(?=\\))"
+ "|\\((?=\\w)");
for (String str : array) {
System.out.println(str);
}
}
Result:
Hi
why
shoul
"
this
wor
)

You can match the tokens with the regex \w+|[\w\s], assuming that you want the punctuation characters to end up in different tokens:
String input = "Hi I want \"to split\" this (String).";
Matcher matcher = Pattern.compile("\\w+|[^\\w\\s]").matcher(input);
List<String> out = new ArrayList<>();
while (matcher.find()) {
out.add(matcher.group());
}
The output ArrayList contains:
[Hi, I, want, ", to, split, ", this, (, String, ), .]
You might want to use (?U) flag to make the \w and \s follows the Unicode definition of word and whitespace character. By default, \w and \s only recognizes word and whitespace characters in ASCII range.
For the sake of completeness, here is the solution in split(), which works on Java 8 and above. There will be an extra empty string at the beginning in Java 7.
String tokens[] = input.split("\\s+|(?<![\\w\\s])(?=\\w)|(?<=\\w)(?![\\w\\s])|(?<=[^\\w\\s])(?=[^\\w\\s])");
The regex is rather complex, since the empty string splits between punctuation character and word character need to avoid the cases already split by \s+.
Since the regex in the split solution is quite a mess, please use the match solution instead.

What language are you trying to write this in?
You could write regex groups something like: (.+)(\s)
This would match any quantity of characters followed by a space

How can I split a string except when the delimiter is protected by quotes or brackets?

I asked How to split a string with conditions. Now I know how to ignore the delimiter if it is between two characters.
How can I check multiple groups of two characters instead of one?
I found Regex for splitting a string using space when not surrounded by single or double quotes, but I don't understand where to change '' to []. Also, it works with two groups only.
Is there a regex that will split using , but ignore the delimiter if it is between "" or [] or {}?
For instance:
// Input
"text1":"text2","text3":"text,4","text,5":["text6","text,7"],"text8":"text9","text10":{"text11":"text,12","text13":"text14","text,15":["text,16","text17"],"text,18":"text19"}
// Output
"text1":"text2"
"text3":"text,4"
"text,5":["text6","text,7"]
"text8":"text9"
"text10":{"text11":"text,12","text13":"text14","text,15":["text,16","text17"],"text,18":"text19"}

You can use:
text = "\"text1\":\"text2\",\"text3\":\"text,4\",\"text,5\":[\"text6\",\"text,7\"],\"text8\":\"text9\",\"text10\":{\"text11\":\"text,12\",\"text13\":\"text14\",\"text,15\":[\"text,16\",\"text17\"],\"text,18\":\"text19\"}";
String[] toks = text.split("(?=(?:(?:[^\"]*\"){2})*[^\"]*$)(?![^{]*})(?![^\\[]*\\]),+");
for (String tok: toks)
System.out.printf("%s%n", tok);
- RegEx Demo
OUTPUT:
"text1":"text2"
"text3":"text,4"
"text,5":["text6","text,7"]
"text8":"text9"
"text10":{"text11":"text,12","text13":"text14","text,15":["text,16","text17"],"text,18":"text19"}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spliting a sentence when a hyphen is encountered in java - java

Related

Tokenize Words separated by non-word characters exept single quote

.split() and [\\W] creates an additional empty string?

RegEx: Matching n-char long sequence of repeating character

splitting string and keep characters (regex pattern)

How can I split a string except when the delimiter is protected by quotes or brackets?

Categories

Resources