Splitting a sentence

Splitting a sentence - java

I'm trying to split a string: multiple characters such as !!! , ??, ... denote the end of the sentence so I want anything after this to be on a new line e.g. sentence hey.. hello split !!! example me. should be turned into:
hey..
hello split !!!
example me.
What I tried:
String myStr= "hey.. hello split !!! example me.";
String [] split = myStr.split("(?<=\\.{2,})");
This works fine when I have multiple dots but doesn't work for anything else, I can't add exclamation marks to this expression too "(?<=[\\.{2,}!{2,}]). This splits after each dot and exclamation. Is there any way to combine those ?
Ideally I wanted the app to split after a SINGLE dot too (anything that denotes the end of the sentence) but I don't think this is possible in a single pass...Thanks

Just do like this,
String [] split = myStr.split("(?<=([?!.])\\1+)");
oir
String [] split = myStr.split("(?<=([?!.])\\1{1,99})");
It captures the first character from the list [?.!] and expects the same character to be present one or more times. If yes, then the splitting should occur next to this.
or
String[] split = s.split("(?<=\\.{2,}+)|(?<=\\?{2,}+)|(?<=!{2,}+)");
Ideone

Ideally I wanted the app to split after a SINGLE dot too (anything that denotes the end of the sentence)
To do this first you have to lay down as to what cases are you considering as end of sentence. Multiple special symbols are not standard form of ending a sentence (as per my knowledge).
But if you are keeping in mind the nefarious users or some casual mistakes ending up making special symbols look like end of sentence then at least make a list of such cases and then proceed.
For your situation here where you want to split the string on multiple special symbols. Lookbehind won't be of much help because as Wiktor noted
The problem is in the backreference whose length is not known from the start.
So we need to find that zero-width where splitting needs to be done. And following regex does the same.
Regex:
(?<=[.!?])(?=[^.!?]) Regex101 Demo Ideone Demo
(?<=[.!?]) (?=[^.!?]) Regex101 Demo Ideone Demo
Note the space between two assertions in second regex.If you want to consume the preceding space when start next line.
Explanation:
This will split on the zero-width where it's preceded by special and not succeeded by it.
hey..¦ hello split !!!¦ example me. ( ¦ denotes the zero-width)

A look behind, with a negative look to prevent split within the group:
String[] lines = s.split("(?<=[?!.]{2,3})(?![?!.])");
Some test code:
public static void main (String[] args) {
String s = "hey..hello split !!!example me.";
String[] lines = s.split("(?<=[?!.]{2,3})(?![?!.])");
Arrays.stream(lines).forEach(System.out::println);
}
Output:
hey..
hello split !!!
example me.

Related

Is it possible to split by a substring without removing it?

I am trying to do the follow:
I have some strings that I need to separate, they have this form:
node:info:sequence(id:ASDF,LMD)
node:info:sequence:id:QWES
Those are the possible individual string formats...
Now I have to separate them when come concatenated by comma... like this
node:info:sequence(id:ASDF,LMD),node:info:sequence:id:QWES
So I tried
entries.split(",node");
Which... kinda works but of course I cut the "node" part from the previous string, is there anyway I can detect that , followed by node but split it by the comma , only?

You may use
s.split(",(?=node\\b)")
See the regex demo
The positive lookahead (?=node\b) will make sure only those commas are matched that are followed with a whole word node (as the \b is a word boundary).

Split on period but ignore some words which contains period in java

I want to split a document into paragraphs, for that, I am using.
String paragraphs[] = documentData.split("\.\n");
but it removes .\n from the actual document. I don't want to lose those tokens. Also, I want that, words like Inc. etc. Jr. should not be split by the regular expression.

Here is a very basic script just to point you in the right direction:
String input = "Sentence one in paragraph one. Sentence two in paragraph one.\n Sentence one in paragraph two. Sentence two in paragraph two.";
String[] parts = input.split("(?<=\\.\n)\\s*");
for (String part : parts) {
System.out.println(part);
}
One way to split on dot while not consuming it is to use a lookarounds. Lookarounds match, but do not consume, making them ideal for what you have in mind. In this case, I am splitting on the following pattern:
(?<=\.\n)\s*
This asserts that what comes before the current position is a full stop followed by a newline. Then it consumes any whitespace which might separate paragraphs, before printing the paragraph to the console.
Demo

Regex - Split String on `:` but not inside if statements

I am unable to figure out a solution myself on this one. Is it even possible in regex?
I have this sample:
...:...:If...Then...:...:Else...:...:...:IfEnd:...
I want to split my string on colons, but not the colons inside my if-statements.
Something like:
...
...
If...Then...:...:Else...:...:...:IfEnd
...
I searched on other questions, but all of them have 1-character delimiters, which you can solve the problem with [^set]. Is this case possible with regex?
I don't even have a half working regex solution because none of what I tried worked. haha.
If you want to know what I'm doing. I'm attempting to make an application to parse some script. The system will separate statements and parse each line individually. But separating the if-statement/while-loop/for-loop is not going to work properly because of obvious reasons. Is my way of thinking the solution not conventional or isn't right in any chance?

Assuming your input doesn't have newlines, you can use:
String str = "...:...:If...Then...:...:Else...:...:...:IfEnd:...";
String[] toks = str.replaceAll("(\\bIf\\b.*?\\bIfEnd\\b):?|:", "$1\n").split("\\n+");
for (String tok: toks) {
System.err.printf("%s%n", tok);
}
Output:
...
...
If...Then...:...:Else...:...:...:IfEnd
...
This regex first matches text from If to EfEnd and captures it in group #1 OR it matches a colon. Back-reference of captured group #1 is used in replacement while adding \n in front of it.
RegEx Demo

Regular Expression to find words separated with space, backtracking

I have to find words separated by space. What best practice to do it with the smallest backtracking?
I found this solution:
Regex: \d+\s([a-zA-Z]+\\s{0,1}){1,} in a sentence
Input: 1234 this is words in a sentence
So, this is words - i have to check using regex ([a-zA-Z]+\\s{0,1}){1,} and words in a sentence i have to check by constant words in regex in a sentences.
But in this case regex101.com gives me debug with 4156 steps and this is Catastrophic Backtracking. Any way to avoid it?
I have other more complicated example, where it takes 86000 steps and it does not validate.
Main problem, that i have to find all words separated by space, but in the same time regex contains words separated by space (constants). This is where i have Catastrophic Backtracking.
I have to do this using Java.

You want to find words separated by space.So you should say at least 1 or more space.You can use this instead which takes just 37 steps.
\d+\s([a-zA-Z]+\s+)+in a sentence
See demo.
https://regex101.com/r/tD0dU9/4
For java double escape all ie \d==\\d

You could try splitting the String into a String array, then find the size of the array after eliminating any members of the array that do not match your definition of a word (ex. a whitespace or puncuation)
String[] mySplitString = myOriginalString.split(" ");
for(int x = 0; x < mySplitString.length; x++){
if(mySplitString[x].matches("\\w.*"/*Your regex for a word here*/)) words++;
}
mySplitString is an array of Strings that have been split from an original string. All whitespace characters are removed and substrings that were before, after, or in-between whitespaces are placed into the new String array. The for-loop runs through the split String array and checks to make sure that each array member contains a word (characters or numbers atleast once) and adds it to a total word count.

If I understood it right, you want to match any word separeted by space plus the sentence "in a sentence".
You can try the following solution:
(in a sentence)|(\S+)
As seen in this example on regex101: Exemple
The regex matchs in 61 steps.
You might have problems with punctuation after the "in a sentence" sentence. Make some tests.
I hope I was helpfull.

java regex tricky pattern

I'm stucked for a while with a regex that does me the following:
split my sentences with this: "[\W+]"
but if it finds a word like this: "aaa-aa" (not "aaa - aa" or "aaa--aaa-aa"), the word isnt splitted, but the whole word.
Basically, i want to split a sentece per words, but also considering "aaa-aa" is a word. I'have sucessfully done that by creating two separate functions, one for spliting with \w, and other to find words like "aaa-aa". Finally, i then add both, and subctract each compound word.
For example, the sentence:
"Hello my-name is Richard"
First i collect {Hello, my, name, is, Richard}
then i collect {my-name}
then i add {my-name} to {Hello, my, name, is, Richard}
then i take out {my} and {name} in here {Hello, my, name, is, Richard}.
result: {Hello, my-name, is, Richard}
this approach does what i need, but for parsing large files, this becomes too heavy, because for each sentence there's too many copies needed. So my question is, there is anything i can do to include everything in one pattern? Like:
"split me the text using this pattern "[\W+], but if you find a word like this "aaa-aa", consider it a word and not two words.

If you want to use a split() rather than explicitly matching the words you are interested in, the following should do what you want: [\s-]{2,}|\s To break that down, you first split on two or more whitespaces and/or hyphens - so a single '-' won't match so 'one-two' will be left alone but something like 'one--two', 'one - two' or even 'one - --- - two' will be split into 'one' and 'two'. That still leaves the 'normal' case of a single whitespace - 'one two' - unmatched, so we add an or ('|') followed by a single whitespace (\s). Note that the order of the alternatives is important - RE subexpressions separated by '|' are evaluated left-to-right so we need to put the spaces-and-hyphens alternative first. If we did it the other way around, when presented with something like 'one -two' we'd match on the first whitespace and return 'one', '-two'.
If you want to interactively play around with Java REs I can thoroughly recommend http://myregexp.com/signedJar.html which allows you to edit the RE and see it matching against a sample string as you edit the RE.

Why not to use pattern \\s+? This does exactly what you want without any tricks: splits text by words separated by whitespace.

Your description isn't clear enough, but why not just split it up by spaces?

I am not sure whether this pattern would work, because I don't have developer tools for Java, you might try it though, it uses character class substraction, which is supported only in Java regex as far as I know:
[\W&&[^-]]+
it means match characters if they are [\W] and [^-], that is characters are [\W] and not [-].

Almost the same regular expression as in your previous question:
String sentence = "Hello my-name is Richard";
Pattern pattern = Pattern.compile("(?<!\\w)\\w+(-\\w+)?(?!\\w)");
Matcher matcher = pattern.matcher(sentence);
while (matcher.find()) {
System.out.println(matcher.group());
}
Just added the option (...)? to also match non-hypened words.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Splitting a sentence - java

Related

Is it possible to split by a substring without removing it?

Split on period but ignore some words which contains period in java

Regex - Split String on `:` but not inside if statements

Regular Expression to find words separated with space, backtracking

java regex tricky pattern

Categories

Resources