How to substring before nth occurence of a separator?

How to substring before nth occurence of a separator? - java

first;snd;3rd;4th;5th;6th;...
How can I split the above after the third occurence of the ; separator? Especially without having to value.split(";") the whole string as an array, as I won't need the values separated. Just the first part of the string up until nth occurence.
Desired output would be:
first;snd;3rd.
I just need that as a string substring, not as split separated values.

Use StringUtils.ordinalIndexOf() from Apache
Finds the n-th index within a String, handling null. This method uses String.indexOf(String).
Parameters:
str - the String to check, may be null
searchStr - the String to find, may be null
ordinal - the n-th searchStr to find
Returns:
the n-th index of the search String, -1 (INDEX_NOT_FOUND) if no match or null string input
Or this way, no libraries required:
public static int ordinalIndexOf(String str, String substr, int n) {
int pos = str.indexOf(substr);
while (--n > 0 && pos != -1)
pos = str.indexOf(substr, pos + 1);
return pos;
}

I would go with this, easy and basic:
String test = "first;snd;3rd;4th;5th;6th;";
int result = 0;
for (int i = 0; i < 3; i++) {
result = test.indexOf(";", result) +1;
}
System.out.println(test.substring(0, result-1));
Output:
first;snd;3rd
You can ofc change the 3 in the loop with the number of arguments you need

If you want to use regular expressions, it is pretty straightforward:
import re
value = "first;snd;3rd;4th;5th;6th;"
reg = r'^([\w]+;[\w]+;[\w]+)'
re.match(reg, value).group()
Outputs:
"first;snd;3rd"
More options here .

You could use a regex that uses a negated character class to match from the start of the string not a semicolon.
Then repeat a grouping structure 2 times that matches a semicolon followed by not a semicolon 1+ times.
^[^;]+(?:;[^;]+){2}
Explanation
^ Assert the start of the string
[^;]+ Negated character class to match not a semicolon 1+ times
(?: Start non capturing group
;[^;]+ Match a semicolon and 1+ times not a semi colon
){2} Close non capturing group and repeat 2 times
For example:
String regex = "^[^;]+(?:;[^;]+){2}";
String string = "first;snd;3rd;4th;5th;6th;...";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);
if (matcher.find()) {
System.out.println(matcher.group(0)); // first;snd;3rd
}
See the Java demo

If you don't want to use split, just use indexOf in a for loop to know the index of the 3rd and 4th ";" then do a substring between these index.
Also you can do a split with a regex that match the 3rd ; but it's probably not the best solution.

If you need to do this frequently it is best to compile the regex upfront in a static Pattern instance:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class NthOccurance {
static Pattern pattern=Pattern.compile("^(([^;]*;){3}).*");
public static void main(String[] args) {
String in="first;snd;3rd;4th;5th;6th;";
Matcher m=pattern.matcher(in);
if (m.matches())
System.out.println(m.group(1));
}
}
Replace the '3' by the number of elements you want.

Below code find index of 3rd occurence of ';' character and make substring.
String s = "first;snd;3rd;4th;5th;6th;";
String splitted = s.substring(0, s.indexOf(";", s.indexOf(";", s.indexOf(";") + 1) + 1));

Related

Splitting a String that has a particular structure

I have a string that goes something like this
"330 Daniel T92435"
Now I need to obtain the name "Daniel", and I could simply just type
string.substring(4,11);
But the position where a name ("Daniel") is placed could vary.
And I don't want to use the split[] method.
I was thinking if there was a way to make the substring method read data until a whitespace is found.

If input string always has the following string structure "someSymbols Name someSymbols" you can use the following regular expression to extract the name:
"[^\\s]+\\s+(\\p{Alpha}+)\\s+[^\\s]+"
\\p{Alpha} - alphabetic character;
\\s - white space;
[^\\s] - any symbol apart from the white space.
In the code below Pattern is as object representing the regular expression. In turn, Matcher is a special object that is responsible for navigation over the given string and allows discovering the parts of this string that match the pattern.
public static String findName(String source) {
Pattern pattern = Pattern.compile("[^\\s]+\\s+(\\p{Alpha}+)\\s+[^\\s]+");
Matcher matcher = pattern.matcher(source);
String result = "no match was found";
if (matcher.find()) {
result = matcher.group(1); // group 1 corresponds to the first element enclosed in parentheses (\\p{Alpha}+)
}
return result;
}
main()
public static void main(String[] args) {
System.out.println(findName("330 Daniel T92435"));
}
Output
Daniel

You can use the str.indexOf(" ") function.
int start = string.indexOf(" ")+1;
string.substring(start,start + 7);
Edit: You can use
int start = string.indexOf(" ")+1;
int end = string.indexOf(" ", start+1);
string.substring(start,end >= 0 ? end : string.length());
if you want to select the first word and don't know how long it will be.

How to find first occurance of whitespace(tab+space+etc) in java?

So I have something like this
System.out.println(some_string.indexOf("\\s+"));
this gives me -1
but when I do with specific value like \t or space
System.out.println(some_string.indexOf("\t"));
I get the correct index.
Is there any way I can get the index of the first occurrence of whitespace without using split, as my string is very long.
PS - if it helps, here is my requirement. I want the first number in the string which is separated from the rest of the string by a tab or space ,and i am trying to avoid split("\\s+")[0]. The string starts with that number and has a space or tab after the number ends

The point is: indexOf() takes a char, or a string; but not a regular expression.
Thus:
String input = "a\tb";
System.out.println(input);
System.out.println(input.indexOf('\t'));
prints 1 because there is a TAB char at index 1.
System.out.println(input.indexOf("\\s+"));
prints -1 because there is no substring \\s+ in your input value.
In other words: if you want to use the powers of regular expressions, you can't use indexOf(). You would be rather looking towards String.match() for example. But of course - that gives a boolean result; not an index.
If you intend to find the index of the first whitespace, you have to iterate the chars manually, like:
for (int index = 0; index < input.length(); index++) {
if (Character.isWhitespace(input.charAt(index))) {
return index;
}
}
return -1;

Something of this sort might help? Though there are better ways to do this.
class Sample{
public static void main(String[] args) {
String s = "1110 001";
int index = -1;
for(int i = 0; i < s.length(); i++ ){
if(Character.isWhitespace(s.charAt(i))){
index = i;
break;
}
}
System.out.println("Required Index : " + index);
}
}

Well, to find with a regular expression, you'll need to use the regular expression classes.
Pattern pat = Pattern.compile("\\s");
Matcher m = pat.matcher(s);
if ( m.find() ) {
System.out.println( "Found \\s at " + m.start());
}
The find method of the Matcher class locates the pattern in the string for which the matcher was created. If it succeeds, the start() method gives you the index of the first character of the match.
Note that you can compile the pattern only once (even create a constant). You just have to create a Matcher for every string.

Java Regex replacing every digit in beginning

How can I replace with regex each digit at the beginning of word with the underscore character, as well as in the rest part of the word to replace all characters except letters, digits, dashes and dots to underscores?
I tried this regex:
^(\d+)|[^\w-.]
However, it replaces all digits in the beginning with a single underscore character.
So, 34567fgf-kl.)*/676hh is converted to _fgf-kl.___676hh while I need every digit in the beginning to be replaced with one underscore character like _____fgf-kl.___676hh.
Is it possible to achieve using a regex?

You can do it like this with Matcher.appendReplacement used with Matcher.find:
String fileText = "34567fgf-kl.)*/676hh";
String pattern = "^\\d+|[^\\w.-]+";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(fileText);
StringBuffer sb = new StringBuffer();
while (m.find()) {
m.appendReplacement(sb, repeat("_", m.group(0).length()));
}
m.appendTail(sb); // append the rest of the contents
System.out.println(sb);
And the repeat is
public static String repeat(String s, int n) {
if(s == null) {
return null;
}
final StringBuilder sb = new StringBuilder(s.length() * n);
for(int i = 0; i < n; i++) {
sb.append(s);
}
return sb.toString();
}
See IDEONE demo
Also, repeat can be replaced with String repeated = StringUtils.repeat("_", m.group(0).length()); using Commons Lang StringUtils.repeat().

You can use a negative-lookbehind to individually match each leading digit, i.e. any digit that doesn't have a non-digit before it.
(?<!\D.{0,999})\d|[^\w-.]
Due to constraints in lookbehind, it cannot be unlimited. The above code can handle at most 999 leading digits.

You can also use replaceAll() with regex:
(^\d)|(?<=\d\G)\d|[^-\w.\n]
which means match:
(^\d) - digit on beginning of a line,
| - or
(?<=\d\G)\d - digit if it is preceded by previously matched digit,
| - or
[^-\w.\n] - not dash, word character (\w is [A-Za-z_0-9]), point or
new line (\n). As a [^-\w.\n] is rather broad category, maybe you will like to add some more characters, or character groups, to exclude from matching, it is enough to add it inside brackets,
DEMO
\n is added if string could be multiline. If there is just one-line string, \n is redundant.
Example in Java:
public class Test {
public static void main(String[] args) {
String example = "34567fgf-kl.)*/676hh";
System.out.println(example.replaceAll("(^\\d)|(?<=\\d\\G)\\d|[^\\w.-]", "_"));
}
}
with output:
_____fgf-kl.___676hh

Extracting both matching and not matching regex

I have a String like this one abc3a de'f gHi?jk I want to split it into the substrings abc3a, de'f, gHi, ? and jk. In other terms, I want to return Strings that match the regular expression [a-zA-Z0-9'] and the Strings that do not match this regular expression. If there is a way to tell whether each resulting substring is a match or not, this will be a plus.
Thanks!

import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class HelloWorld{
public static void main(String []args){
Pattern pattern = Pattern.compile("([a-zA-Z0-9']*)?([^a-zA-Z0-9']*)?");
String str = "abc3a de'f gHi?jk";
Matcher matcher = pattern.matcher(str);
while(matcher.find()){
if(matcher.group(1).length() > 0)
System.out.println("Match:" + matcher.group(1));
if(matcher.group(2).length() > 0)
System.out.println("Miss: `" + matcher.group(2) + "`");
}
}
}
Output:
Match:abc3a
Miss: ` `
Match:de'f
Miss: ` `
Match:gHi
Miss: `?`
Match:jk
If you don't want white space.
Pattern pattern = Pattern.compile("([a-zA-Z0-9']*)?([^a-zA-Z0-9'\\s]*)?");
Output:
Match:abc3a
Match:de'f
Match:gHi
Miss: `?`
Match:jk

You can use this regex:
"[a-zA-Z0-9']+|[^a-zA-Z0-9' ]+"
Will give:
["abc3a", "de'f", "gHi", "?", "jk"]
Online Demo: http://regex101.com/r/xS0qG4
Java code:
Pattern p = Pattern.compile("[a-zA-Z0-9']+|[^a-zA-Z0-9' ]+");
Matcher m = p.matcher("abc3a de'f gHi?jk");
while (m.find())
System.out.println(m.group());
OUTPUT
abc3a
de'f
gHi
?
jk

myString.split("\\s+|(?<=[a-zA-Z0-9'])(?=[^a-zA-Z0-9'\\s])|(?<=[^a-zA-Z0-9'\\s])(?=[a-zA-Z0-9'])")
splits at all the boundaries between runs of characters in that charset.
The lookbehind (?<=...) matches after a character in a run, while the lookahead (?=...) matches before a character in a run of characters outside the set.
The \\s+ is not a boundary match, and matches a run of whitespace characters. This has the effect of removing white-space from the result entirely.
The | allows causing splitting to happy at either boundary or at a run of white-space.
Since the lookbehind and lookahead are both positive, the boundaries will not match at the start or end of the string, so there's no need to ignore empty strings in the output unless there is white-space there.

You can use anchors to split
private static String[] splitString(final String s) {
final String [] arr = s.split("(?=[^a-zA-Z0-9'])|(?<=[^a-zA-Z0-9'])");
final ArrayList<String> strings = new ArrayList<String>(arr.length);
for (final String str : arr) {
if(!"".equals(str.trim())) {
strings.add(str);
}
}
return strings.toArray(new String[strings.size()]);
}
(?=xxx) means xxx will follow here and (?<=xxx) mean xxx precedes this position.
As you did not want to include all-whitespace-matches into the result you need to filter the Array given by split.

How to find the text between ( and )

I have a few strings which are like this:
text (255)
varchar (64)
...
I want to find out the number between ( and ) and store that in a string. That is, obviously, store these lengths in strings.
I have the rest of it figured out except for the regex parsing part.
I'm having trouble figuring out the regex pattern.
How do I do this?
The sample code is going to look like this:
Matcher m = Pattern.compile("<I CANT FIGURE OUT WHAT COMES HERE>").matcher("text (255)");
Also, I'd like to know if there's a cheat sheet for regex parsing, from where one can directly pick up the regex patterns

I would use a plain string match
String s = "text (255)";
int start = s.indexOf('(')+1;
int end = s.indexOf(')', start);
if (end < 0) {
// not found
} else {
int num = Integer.parseInt(s.substring(start, end));
}
You can use regex as sometimes this makes your code simpler, but that doesn't mean you should in all cases. I suspect this is one where a simple string indexOf and substring will not only be faster, and shorter but more importantly, easier to understand.

You can use this pattern to match any text between parentheses:
\(([^)]*)\)
Or this to match just numbers (with possible whitespace padding):
\(\s*(\d+)\s*\)
Of course, to use this in a string literal, you have to escape the \ characters:
Matcher m = Pattern.compile("\\(\\s*(\\d+)\\s*\\)")...

Here is some example code:
import java.util.regex.*;
class Main
{
public static void main(String[] args)
{
String txt="varchar (64)";
String re1=".*?"; // Non-greedy match on filler
String re2="\\((\\d+)\\)"; // Round Braces 1
Pattern p = Pattern.compile(re1+re2,Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher m = p.matcher(txt);
if (m.find())
{
String rbraces1=m.group(1);
System.out.print("("+rbraces1.toString()+")"+"\n");
}
}
}
This will print out any (int) it finds in the input string, txt.
The regex is \((\d+)\) to match any numbers between ()

int index1 = string.indexOf("(")
int index2 = string.indexOf(")")
String intValue = string.substring(index1+1, index2-1);

Matcher m = Pattern.compile("\\((\\d+)\\)").matcher("text (255)");
if (m.find()) {
int len = Integer.parseInt (m.group(1));
System.out.println (len);
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to substring before nth occurence of a separator? - java

If you want to use regular expressions, it is pretty straightforward: import re value = "first;snd;3rd;4th;5th;6th;" reg = r'^([\w]+;[\w]+;[\w]+)' re.match(reg, value).group() Outputs: "first;snd;3rd" More options here .

If you don't want to use split, just use indexOf in a for loop to know the index of the 3rd and 4th ";" then do a substring between these index. Also you can do a split with a regex that match the 3rd ; but it's probably not the best solution.

Below code find index of 3rd occurence of ';' character and make substring. String s = "first;snd;3rd;4th;5th;6th;"; String splitted = s.substring(0, s.indexOf(";", s.indexOf(";", s.indexOf(";") + 1) + 1));

Related

Splitting a String that has a particular structure

How to find first occurance of whitespace(tab+space+etc) in java?

Java Regex replacing every digit in beginning

Extracting both matching and not matching regex

How to find the text between ( and )

Categories

Resources