Regex: skip first occurence of pattern - java

Let's say I have a String, foo, with values separated by whitespace:
[value 1] [value 2] [value 3] ... [value n]
What would the regular expression be to split(...) foo, such that the resulting String array contained all values except value 1? Is this even possible?
Here's what I have thus far:
String[] array = foo.split("\\s");
And that's not much, I know.
EDIT:
I'm looking to accomplish this purely through regular expressions. If this is not possible, please let me know!

Once you've split your string into an array of values, loop through the array and do whatever you need, skipping the first iteration.
for(i=1; i<array.count(); i++){
//Act on the data value
}

Your delimiter could be "either a whitespace sequence OR chunks of non-ws at the beginning of a string, but this leaves you an empty string at the front:
Arrays.toString("abc def ghi jkl".split("\\s+|^\\S+\\s+"))
produces
[,def,ghi]
That is the problem with split -- you will, I think, always get something at the beginning of your array. Unfortunately I think you need to whack off the front of the string before splitting, or use Java's Arrays.copyOfRange() or similar to post-process.
Dropping the beginning can be done with replaceFirst:
import java.util.Arrays;
public class SplitExample {
public static final String data = "abc def ghi";
public static void main(String[] args) {
System.out.println(Arrays.toString(data.split("\\s+")));
System.out.println(Arrays.toString(data.split("\\s+|^\\S+\\s+")));
System.out.println(Arrays.toString(data.replaceFirst("^\\S+\\s+", "").split("\\s+")));
}
}
The final line is as close as I can get, because split produces matches AROUND your delimiters. How can you avoid the blank string at the front with a single split? I am not sure there is a way....

Since the consensus seems to be that it is impossible, I propose this solution:
Assuming you only have ONE space in the 'junk' value,
int ix = theString.indexOf(" ");
ix = theString.indexOf(" ", ix);
theString.substring(ix + 1).split("\\s");
This gets the substring from the second space in the string (the first space after the space in the 'junk' value) then splits it.

Would you be able to do this?
String[] array = foo.split("\\S*\\s")[1].split("\\s");
It's 2 regex splits instead of one, but it's neater than looping later. I'm not sure it's correct, but it should separate the string first into "any number of non-whitespace characters followed by a whitespace" and everything else. You can then split everything else by whitespace only, and you'll be left with an array excluding the first element.
Edit: Yeah, it can't be done with a single split since the only way to have anything other than "" as the first element in your array is to have something you're not removing with the split at the front of the string.

Related

Split a java string among < > brackets, including the brackets, but only if no space between brackets

I need to be able to turn a string, for instance "This and <those> are.", into a string array of the form ["This and ", "<those>", " are."]. I have been trying to using the String.split() command, and I've gotten this regex:
"(?=[<>])"
However, this just gets me ["This and ", "<those", "> are."]. I can't figure out a good regex to get the brackets all on the same element, and I also can't have spaces between those brackets. So for instance, "This and <hey there> are." Should be simply split to ["This and <hey there> are."]. Ideally I'd like to just rely solely on the split command for this operation. Can anyone point me in the right direction?
Not actually possible; given that the 'separator' needs to match 0 characters it needs to be all lookahead/lookbehind, and those require fixed-size lookups; you need to look ahead arbitrarily far into the string to know if a space is going to occur or not, thus, what you want? Impossible.
Just write a regexp that FINDS the construct you want, that's a lot simpler. Simply Pattern.compile("<\\w+>") (taking a select few liberties on what you intend a thing-in-brackets to look like. If truly it can be ANYTHING except spaces and the closing brace, "<[^ >]+>" is what you want).
Then, just loop through, finding as you go:
private static final Pattern TOKEN_FINDER = Pattern.compile("<\\w+>");
List<String> parse(String in) {
Matcher m = TOKEN_FINDER.matcher(in);
if (!m.find()) return List.of(in);
var out = new ArrayList<String>();
int pos = 0;
do {
int s = m.start();
if (s > pos) out.add(in.substring(pos, s));
out.add(m.group());
pos = m.end();
} while (m.find());
if (pos < in.length()) out.add(in.substring(pos));
return out;
}
Let's try it:
System.out.println(parse("This and <those> are."));
System.out.println(parse("This and <hey there> are."));
System.out.println(parse("<edgecase>2"));
System.out.println(parse("3<edgecase>"));
prints:
[This and , <those>, are.]
[This and <hey there> are.]
[<edgecase>]
[<edgecase>, 2]
[3, <edgecase>]
seems like what you wanted.

Extract number in a String java

I have a quite simple question here is that i have a string 0-1000
say str = "0-1000"
I successfully extract 0 and 1000 by using str.split("-")
Now, I am assigned to check the number because i am noticed that those two numbers can be a negative.
If I continue str.split("-"), then I will skip the negative sign as well.
Could anyone suggest methods for me?
Since String.split() uses regular expressions to split, you could do something like this:
String[] nos = "-1000--1".split("(?<=\\d)-";
This means you split at minus characters that follow a digit, i.e. must be an operator.
Note that the positive look-behind (?<=\d) needs to be used since you only want to match the minus character. String.split() removes all matching separators and thus something like \d- would remove digits as well.
To parse the numbers you'd then iterate over the array elements and call Integer.valueOf(element) or Integer.parseInt(element).
Note that this assumes the input string to be valid. Depending on what you want to achieve, you might first have to check the input for a match, e.g. by using -?\d--?\d to check whether the string is in format x-y where x and y can be positive or negative integers.
You can use regex like this :Works for all cases
public static void main(String[] args) {
String s = "-500--578";
String[] arr = s.split("(?<=\\d)-"); // split on "-" only if it is preceeded by a digit
for (String str : arr)
System.out.println(str);
}
O/P:
-500
-578

Create String[] containing only certain characters

I am trying to create a String[] which contains only words that comprise of certain characters. For example I have a dictionary containing a number of words like so:
arm
army
art
as
at
attack
attempt
attention
attraction
authority
automatic
awake
baby
back
bad
bag
balance
I want to narrow the list down so that it only contains words with the characters a, b and g. Therefore the list should only contain the word 'bag' in this example.
Currently I am trying to do this using regexes but having never used them before I can't seem to get it to work.
Here is my code:
public class LetterJugglingMain {
public static void main(String[] args) {
String dictFile = "/Users/simonrhillary/Desktop/Dictionary(3).txt";
fileReader fr = new fileReader();
fr.openFile(dictFile);
String[] dictionary = fr.fileToArray();
String regx = "able";
String[] newDict = createListOfValidWords(dictionary, regx);
printArray(newDict);
}
public static String[] createListOfValidWords(String[] d, String regex){
List<String> narrowed = new ArrayList<String>();
for(int i = 0; i<d.length; i++){
if(d[i].matches(regex)){
narrowed.add(d[i]);
System.out.println("added " + d[i]);
}
}
String[] narrowArray = narrowed.toArray(new String[0]);
return narrowArray;
}
however the array returned is always empty unless the String regex is the exact word! Any ideas? I can post more code if needed...I think I must be trying to initialise the regex wrong.
The narrowed down list must contain ONLY the characters from the regex.
Frankly, I'm not an expert in regexes, but I don't think it's the best tool to do what you want. I would use a method like the following:
public boolean containsAll(String s, Set<Character> chars) {
Set<Character> copy = new HashSet<Character>();
for (int i = 0; i < s.length() && copy.size() < chars.size(); i++) {
char c = s.charAt(i);
if (chars.contains(c)) {
copy.add(c);
}
}
return copy.size() == chars.size();
}
The regex able will match only the string "able". However, if you want a regular expression to match either character of a, b, l or e, the regex you're looking for is [able] (in brackets). If you want words containing several such characters, add a + for repeating the pattern: [able]+.
The OP wants words that contain every character. Not just one of them.
And other characters are not a problem.
If this is the case, I think the simiplest way would be to loop through the entire string, character by character, and check to see if it contains all of the characters you want. Keep flags to check and see if every character has been found.
If this isn't the case.... :
Try using the regex:
^[able]+$
Here's what it does:
^ matches the beginning of the string and $ matches the end of the string. This makes sure that you're not getting a partial match.
[able] matches the characters you want the string to consist of, in this case a, b, l, and e. + Makes sure that there are 1 or more of these characters in the string.
Note: This regex will match a string that contains these 4 letters. For example, it will match:
able, albe, aeble, aaaabbblllleeee
and will not match
qable, treatable, and abled.
A sample regex that filters out words that contains at least one occurrence of all characters in a set. This will match any English word (case-insensitive) that contains at least one occurrence of all the characters a, b, g:
(?i)(?=.*a)(?=.*b)(?=.*g)[a-z]+
Example of strings that match would be bag, baggy, grab.
Example of strings that don't match would be big, argument, nothing.
The (?i) means turns on case-insensitive flag.
You need to append as many (?=.*<character>) as the number of characters in the set, for each of the characters.
I assume a word only contains English alphabet, so I specify [a-z]. Specify more if you need space, hyphen, etc.
I assume matches(String regex) method in String class, so I omitted the ^ and $.
The performance may be bad, since in the worst case (the characters are found at the end of the words), I think that the regex engine may go through the string for around n times where n is the number of characters in the set. It may not be an actual concern at all, since the words are very short, but if it turns out that this is a bottleneck, you may consider doing simple looping.

split method leaving space in array

{
ArrayList<String> node_array = new ArrayList<String>();
String allValues[] = node.split("[(,)]");
for(String value : allValues){
node_array.add(value);
}
node is a string, for example: (3,4,5,6,3)
for some reason when I verify the content of the arraylist the split seems to leave a trail of space as elements, specifically where ( and ) is supposed to be. What am I doing wrong?
You're asking split() to split at parentheses and commas. In your string, there is a blank substring right before the first separator, the opening parenthesis. split() is keeping that blank substring and returning it at the zeroth element of the resulting array.
There are plenty of examples in the documentation that illustrate how the function works.
To work around this, you can either ignore the empty strings, or flip the regex on its head and match the numbers instead of splitting at the punctuation characters.
You have defined a separator to be the one of the characters that's the first character in your String, so an empty string "" will show up in your ArrayList, because that what occurs before the first separator. However, for your application you can easily fix it like this:
ArrayList<String> node_array = new ArrayList<String>();
String allValues[] = node.split("[(,)]");
for(String value : allValues){
if(!value.equals("")) node_array.add(value);
}
return node_array;
node.replace("(","").replace(")","").split(",");
or
node.substring(1,node.length()-1).split(",");

Finding multiple substrings using boundaries in Java

Alright so here is my problem. Basically I have a string with 4 words in it, with each word seperated by a #. What I need to do is use the substring method to extract each word and print it out. I am having trouble figuring out the parameters for it though. I can always get the first one right, but the following ones generally have problems.
Here is the first piece of the code:
word = format.substring( 0 , format.indexOf('#') );
Now from what I understand this basically means start at the beginning of the string, and end right before the #. So using the same logic, I tried to extract the second word like so:
wordTwo = format.substring ( wordlength + 1 , format.indexOf('#') );
//The plus one so I don't start at the #.
But with this I continually get errors saying it doesn't exist. I figured that the compiler was trying to read the first # before the second word, so I rewrote it like so:
wordTwo = format.substring (wordlength + 1, 1 + wordLength + format.indexOf('#') );
And with this it just completely screws it up, either not printing the second word or not stopping in the right place. If I could get any help on the formatting of this, it would be greatly appreciated. Since this is for a class, I am limited to using very basic methods such as indexOf, length, substring etc. so if you could refrain from using anything to complex that would be amazing!
If you have to use substring then you need to use the variant of indexOf that takes a start. This means you can start look for the second # by starting the search after the first one. I.e.
wordTwo = format.substring ( wordlength + 1 , format.indexOf('#', wordlength + 1 ) );
There are however much better ways of splitting a string on a delimiter like this. You can use a StringTokenizer. This is designed for splitting strings like this. Basically:
StringTokenizer tok = new StringTokenizer(format, "#");
String word = tok.nextToken();
String word2 = tok.nextToken();
String word3 = tok.nextToken();
Or you can use the String.split method which is designed for splitting strings. e.g.
String[] parts = String.split("#");
String word = parts[0];
String word2 = parts[1];
String word3 = parts[2];
You can go with split() for this kind of formatting strings.
For instance if you have string like,
String text = "Word1#Word2#Word3#Word4";
You can use delimiter as,
String delimiter = "#";
Then create an string array like,
String[] temp;
For splitting string,
temp = text.split(delimiter);
You can get words like this,
temp[0] = "Word1";
temp[1] = "Word2";
temp[2] = "Word3";
temp[3] = "Word4";
Use split() method to do this with "#" as the delimiter
String s = "hi#vivek#is#good";
String temp = new String();
String[] arr = s.split("#");
for(String x : arr){
temp = temp + x;
}
Or if you want to exact each word... you have it already in arr
arr[0] ---> First Word
arr[1] ---> Second Word
arr[2] ---> Third Word
I suggest that you've a look at the Javadoc for String before you proceed further.
Since this is your homework, I'll give you a couple of hints and maybe you can solve it yourself:
The format for subString is public void subString(int beginIndex, int endIndex). As per the javadoc for this method:
Returns a new string that is a substring of this string. The substring
begins at the specified beginIndex and extends to the character at
index endIndex - 1. Thus the length of the substring is
endIndex-beginIndex.
Note that if you've to use this method, understand that you'll have to shift your beginIndex and endIndex each time because in your situation, you'll have multiple words that are separated by #.
However if you look closely, there's another method in String class that might be helpful to you. That's the public String[] split(String regex) method. The javadoc for this one states:
Splits this string around matches of the given regular expression.
This method works as if by invoking the two-argument split method with
the given expression and a limit argument of zero. Trailing empty
strings are therefore not included in the resulting array.
The split() method looks pretty interesting for your case. You can split your String with the delimiter that you have as the parameter to this method, get the String array and work with that.
Hope this helps you to understand your problem and get started towards a solution :)
Since this is a home work, it may be better to have try to write it your self. But I will give a clue.
Clue:
The indexOf method has another overload: int indexOf(int chr,
int fromIndex) which find the first character chr in the string
from the fromIndex.
http://docs.oracle.com/javase/1.4.2/docs/api/java/lang/String.html
From this clue, the program will look something like this:
Find the index of the first '#' from the start of the string.
Extract the word from 0th character to that index.
Find the index of the first '#' from the character AFTER the first '#'.
Extract the word from the first '#' that index.
... Just do it until you get 4 words or the string ends.
Hope this helps.
I don't know why you're forced to use String#substring, but as others have mentioned, it seems like the wrong method for the kind of functionality you need.
String#split(String regex) is what you would use for such a problem, or, if your input sequence is something you don't control, I would suggest you look at the overloaded method String#split(String regex, int limit); this way you can impose a limit on the amount of matches you make, controlling your resulting array.

Categories

Resources