Split a string when the separator can be nested

Split a string when the separator can be nested - java

I'm having some trouble while trying to split a string with a nested separator.
My String would be like "(a,b(1,2,3),c,d(a,b,c))".
How could I get an array ["a","b(1,2,3)","c","d(a,b,c)"] ?
I obviously can't use .split(","), since it would also split my sub-strings.

Here is a straight forward non-recursive function that splits your string the way you want:
private String[] specialSplit(String s) {
List<String> result = new ArrayList<>();
StringBuilder sb = new StringBuilder();
int parenCount = 0;
for (int i = 1; i < s.length() - 1; i++) { // go from 1 to length -1 to discard the surrounding ()
char c = s.charAt(i);
if (c == '(') parenCount++;
else if (c == ')') parenCount--;
if (parenCount == 0 && c == ',') {
result.add(sb.toString());
sb.setLength(0); // clear string builder
} else {
sb.append(c);
}
}
result.add(sb.toString());
return result.toArray(new String[0]);
}
Basically, we iterate through all the characters of the string keeping track of the parentheses. The first and last parentheses are not considered. We only split the string when we have seen the same amount of opening and closing parentheses and when the current character is ','.
This method will likely run much faster than any regex you may find.

A recursive function should work here, just not with plain split(). Try parsing your string character by character and act whenever you encounter a comma or paranthesis: , means you create a new element, ( you start a new nested list, ) means you finish the current nested list. This should even work with a more "unrolled" approach (i.e. no recursion but handling the nesting in a data structure).

Related

How to preserve the punctuation when converting words to Pig Latin?

I've been working on a Java program to convert English words to Pig Latin. I've done all the basic rules such as appending -ay, -way, etc., and special cases like question -> estionquay, rhyme -> ymerhay, and I also dealt with capitalization (Thomas -> Omasthay). However, I have one problem that I can't seem to solve: I need to preserve before-and-after punctuation. For example, What? -> Atwhay? Oh!->Ohway! "hello" -> "ellohay" and "Hello!" -> "Ellohay!" This is not a duplicate by the way, I've checked tons of pig latin questions and I cannot seem to figure out how to do it.
Here is my code so far (I've removed all the punctuation but can't figure out how to put it back in):
public static String scrub(String s)
{
String punct = ".,?!:;\"(){}[]<>";
String temp = "";
String pigged = "";
int index, index1, index2, index3 = 0;
for(int i = 0; i < s.length(); i++)
{
if(punct.indexOf(s.charAt(i)) == -1) //if s has no punctuation
{
temp+= s.charAt(i);
}
} //temp equals word without punctuation
pigged = pig(temp); //pig is the piglatin-translator method that I have already written,
//didn't want to put it here because it's almost 200 lines
for(int x = 0; x < s.length(); x++)
{
if(s.indexOf(punct)!= -1)//punctuation exists
{
index = x;
}
}
}
I get that in theory you could search the string for punctuation and that it should be near the beginning or end, so you would have to store the index and replace it after it is "piglatenized", but I keep getting confused about the for loop part. if you do index = x inside the for-loop, you're just replacing index every time the loop runs.
Help would be appreciated greatly! Also, please keep in mind I can't use shortcuts, I can use String methods and such but not things like Collections or ArrayLists (not that you'd need them here), I have to reinvent the wheel, basically. By the way, in case it wasn't clear, I already have the translating-to-piglatin thing down. I only need to preserve the punctuation before and after translating.

If you are allowed to use regular expressions, you can use the following code.
String pigSentence(String sentence) {
Matcher m = Pattern.compile("\\p{L}+").matcher(sentence);
StringBuffer sb = new StringBuffer();
while (m.find()) {
m.appendReplacement(pig(m.group()));
}
m.appendTail();
return sb.toString();
}
In plain English, the above code is:
for each word in the sentence:
replace it with pig(word)
But if regular expressions are forbidden, you can try this:
String pigSentence(String sentence) {
char[] chars = sentence.toCharArray();
int i = 0, len = chars.length;
StringBuilder sb = new StringBuilder();
while (i < len) {
while (i < len && !Character.isLetter(chars[i]))
sb.append(chars[i++]);
int wordStart = i;
while (i < len && Character.isLetter(chars[i]))
i++;
int wordEnd = i;
if (wordStart != wordEnd) {
String word = sentence.substring(wordStart, wordEnd - wordStart);
sb.append(pig(word));
}
}
return sb.toString();
}

What you need to do is: remove punctuation if it exists, convert to pig latin, add punctuation back.
Assuming punctuation is always and the end of the string, You can check for punctuation with the following:
String punctuation = "";
for (int i = str.length() - 1; i > 0; i--) {
if (!Character.isLetter(str.charAt(i))) {
punctuation = str.charAt(i) + punctuation;
} else {
break; // Found all punctuation
}
}
str = str.substring(0, str.length() - punctuation.length()); // Remove punctuation
// Convert str to pig latin
// Append punctuation to str

I'd find it troublesome to handle punctuation separate from the translation. For punctuation at the very beginning or very end, you can save them and tag them back on after translating.
But if you remove the punctuations from the middle of the word, it will be rather difficult to replace them back to their correct location. Their indices change from the original word to the pigged word, and by a variable amount. (For some a random example, consider "Hel'lo" and "Quest'ion". The apostrophe shifts left by either 1 or 2, and you won't know which.)
How does your translation method handle punctuation? Do you really need to remove all punctuation before passing it to the translator? I'd suggest having your pigging method handle at least the punctuation in the middle of the word.

String manipulation of function names

For this Kata, i am given random function names in the PEP8 format and i am to convert them to camelCase.
(input)get_speed == (output)getSpeed ....
(input)set_distance == (output)setDistance
I have a understanding on one way of doing this written in pseudo-code:
loop through the word,
if the letter is an underscore
then delete the underscore
then get the next letter and change to a uppercase
endIf
endLoop
return the resultant word
But im unsure the best way of doing this, would it be more efficient to create a char array and loop through the element and then when it comes to finding an underscore delete that element and get the next index and change to uppercase.
Or would it be better to use recursion:
function camelCase takes a string
if the length of the string is 0,
then return the string
endIf
if the character is a underscore
then change to nothing,
then find next character and change to uppercase
return the string taking away the character
endIf
finally return the function taking the first character away
Any thoughts please, looking for a good efficient way of handing this problem. Thanks :)

I would go with this:
divide given String by underscore to array
from second word until end take first letter and convert it to uppercase
join to one word
This will work in O(n) (go through all names 3 time). For first case, use this function:
str.split("_");
for uppercase use this:
String newName = substring(0, 1).toUpperCase() + stre.substring(1);
But make sure you check size of the string first...
Edited - added implementation
It would look like this:
public String camelCase(String str) {
if (str == null ||str.trim().length() == 0) return str;
String[] split = str.split("_");
String newStr = split[0];
for (int i = 1; i < split.length; i++) {
newStr += split[i].substring(0, 1).toUpperCase() + split[i].substring(1);
}
return newStr;
}
for inputs:
"test"
"test_me"
"test_me_twice"
it returns:
"test"
"testMe"
"testMeTwice"

It would be simpler to iterate over the string instead of recursing.
String pep8 = "do_it_again";
StringBuilder camelCase = new StringBuilder();
for(int i = 0, l = pep8.length(); i < l; ++i) {
if(pep8.charAt(i) == '_' && (i + 1) < l) {
camelCase.append(Character.toUpperCase(pep8.charAt(++i)));
} else {
camelCase.append(pep8.charAt(i));
}
}
System.out.println(camelCase.toString()); // prints doItAgain

The question you pose is whether to use an iterative or a recursive approach. For this case I'd go for the recursive approach because it's straightforward, easy to understand doesn't require much resources (only one array, no new stackframe etc), though that doesn't really matter for this example.
Recursion is good for divide-and-conquer problems, but I don't see that fitting the case well, although it's possible.
An iterative implementation of the algorithm you described could look like the following:
StringBuilder buf = new StringBuilder(input);
for(int i = 0; i < buf.length(); i++){
if(buf.charAt(i) == '_'){
buf.deleteCharAt(i);
if(i != buf.length()){ //check fo EOL
buf.setCharAt(i, Character.toUpperCase(buf.charAt(i)));
}
}
}
return buf.toString();
The check for the EOL is not part of the given algorithm and could be ommitted, if the input string never ends with '_'

Deleting all regex instances starting with char '[' and ending with char ']' from a String

I need to take a String and deleting all the regexes in it starting with character '[' and ending with character ']'.
Now i don't know how to tackle this problem. I tried to convert the String to character array and then putting empty characters from any starting '[' till his closing ']' and then convert it back to a String using toString() method.
MyCode:
char[] lyricsArray = lyricsParagraphElements.get(1).text().toCharArray();
for (int i = 0;i < lyricsArray.length;i++)
{
if (lyricsArray[i] == '[')
{
lyricsArray[i] = ' ';
for (int j = i + 1;j < lyricsArray.length;j++)
{
if (lyricsArray[j] == ']')
{
lyricsArray[j] = ' ';
i = j + 1;
break;
}
lyricsArray[j] = ' ';
}
}
}
String songLyrics = lyricsArray.toString();
System.out.println(songLyrics);
But in the print line of songLyrics i get weird stuff like
[C#71bc1ae4
[C#6ed3ef1
[C#2437c6dc
[C#1f89ab83
[C#e73f9ac
[C#61064425
[C#7b1d7fff
[C#299a06ac
[C#383534aa
[C#6bc168e5
I guess there is a simple method for it. Any help will be very appreciated.
For clarification:
converting "abcd[dsadsadsa]efg[adf%#1]d" Into "abcdefgd"

Or simply use a regular expression to replace all occurences of \\[.*\\] with nothing:
String songLyrics = text.replaceAll("\\[.*?\\]", "");
Where text is ofcourse:
String text = lyricsParagraphElements.get(1).text();
What does \\[.*\\] mean?
The first parameter of replaceAll is a string describing a regular expression. A regular expression defines a pattern to match in a string.
So let's split it up:
\\[ matches exactly the character [. Since [ has a special meaning within a regular expression, it needs to be escaped (twice!).
. matches any character, combine this with the (lazy) zero-or-more operator *?, and it will match any character until it finally finds:
\\], which matches the character ]. Note the escaping again.

Your code below is referencing to the string object and you are then printing the reference songLyrics.
String songLyrics = lyricsArray.toString();
System.out.println(songLyrics);
Replace above two lines with
String songLyrics = new String(lyricsArray);
System.out.println(songLyrics);
Ideone1
Other way without converting it into char array and again to string.
String lyricsParagraphElements = "asdasd[asd]";
String songLyrics = lyricsParagraphElements.replaceAll("\\[.*\\]", "");
System.out.println(songLyrics);
Ideone2

You're printing a char[] and Java char[] does not override toString(). And, a Java String is immutable, but Java does have StringBuilder which is mutable (and StringBuilder.delete(int, int) can remove arbitrary substrings). You could use it like,
String songLyrics = lyricsParagraphElements.get(1).text();
StringBuilder sb = new StringBuilder(songLyrics);
int p = 0;
while ((p = sb.indexOf("[", p)) >= 0) {
int e = sb.indexOf("]", p + 1);
if (e > p) {
sb.delete(p, e + 1);
}
p++;
}
System.out.println(sb);

You are getting "weird stuff" because you are printing the string representation of the array, not converting the array to a String.
Instead of lyricsArray.toString(), use
new String(lyricsArray);
But if you do this, you will find that you are not actually removing characters from the string, just replacing them with spaces.
Instead, you can shift all of the characters left in the array, and construct the new String only up to the right number of characters:
int src = 0, dst = 0;
while (src < lyricsArray.length) {
while (src < lyricsArray.length && lyricsArray[src] != '[') {
lyricsArray[dst++] = lyricsArray[src++];
}
if (src < lyricsArray.length) {
++src;
while (src - 1 < lyricsArray.length && lyricsArray[src - 1] != ']') {
src++;
}
}
}
String lyricsString = new String(lyricsArray, 0, dst);

This is exactly regex string for your case:
\\[([\\w\\%\\#]+)\\]
It's very hard when your plant string is contain special symbol. I can't find shorter regex, without explain special symbol like an exception.
reference: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#cg
================
I'm read your new case, a string contain symbol "-" or something else in
!"#$%&'()*+,-./:;<=>?#\^_`{|}~
add them (with prefix "\\") after \\# on my regex string.

Split by a comma that is not inside parentheses, skipping anything inside them

I know it might be another topic about regexes, but despite I searched it, I couldn't get the clear answer. So here is my problem- I have a string like this:
{1,2,{3,{4},5},{5,6}}
I'm removing the most outside parentheses (they are there from input, and I don't need them), so now I have this:
1,2,{3,{4},5},{5,6}
And now, I need to split this string into an array of elements, treating everything inside these parentheses as one, "seamless" element:
Arr[0] 1
Arr[1] 2
Arr[2] {3,{4},5}
Arr[3] {5,6}
I have tried doing it using lookahead but so far, I'm failing (miserably). What would be the neatest way of dealing with those things in terms of regex?

You cannot do this if elements like this should be kept together: {{1},{2}}. The reason is that a regex for this is equivalent to parsing the balanced parenthesis language. This language is context-free and cannot be parsed using a regular expression. The best way to handle this is not to use regex but use a for loop with a stack (the stack gives power to parse context-free languages). In pseudo code we could do:
for char in input
if stack is empty and char is ','
add substring(last, current position) to output array
last = current index
if char is '{'
push '{' on stack
if char is '}'
pop from stack
This pseudo code will construct the array as desired, note that it's best to loop over the indexes of the chars in the given string as you'll need those to determine the boundaries of the substrings to add to the array.

Almost near to the requirement. Running out of time. Will complete rest later (A single comma is incorrect).
Regex: ,(?=[^}]*(?:{|$))
To check regex validity: Go to http://regexr.com/
To implement this pattern in Java, there is a slight difference. \ needs to be added before { and }.
Hence, regex for Java Input: ,(?=[^\\}]*(?:\\{|$))
String numbers = {1,2,{3,{4},5},{5,6}};
numbers = numbers.substring(1, numbers.length()-1);
String[] separatedValues = numbers.split(",(?=[^\\}]*(?:\\{|$))");
System.out.println(separatedValues[0]);

Could not figure out a regex solution, but here's a non-regex solution. It involves parsing numbers (not in curly braces) before each comma (unless its the last number in the string) and parsing strings (in curly braces) until the closing curly brace of the group is found.
If regex solution is found, I'd love to see it.
public static void main(String[] args) throws Exception {
String data = "1,2,{3,{4},5},{5,6},-7,{7,8},{8,{9},10},11";
List<String> list = new ArrayList();
for (int i = 0; i < data.length(); i++) {
if ((Character.isDigit(data.charAt(i))) ||
// Include negative numbers
(data.charAt(i) == '-') && (i + 1 < data.length() && Character.isDigit(data.charAt(i + 1)))) {
// Get the number before the comma, unless it's the last number
int commaIndex = data.indexOf(",", i);
String number = commaIndex > -1
? data.substring(i, commaIndex)
: data.substring(i);
list.add(number);
i += number.length();
} else if (data.charAt(i) == '{') {
// Get the group of numbers until you reach the final
// closing curly brace
StringBuilder sb = new StringBuilder();
int openCount = 0;
int closeCount = 0;
do {
if (data.charAt(i) == '{') {
openCount++;
} else if (data.charAt(i) == '}') {
closeCount++;
}
sb.append(data.charAt(i));
i++;
} while (closeCount < openCount);
list.add(sb.toString());
}
}
for (int i = 0; i < list.size(); i++) {
System.out.printf("Arr[%d]: %s\r\n", i, list.get(i));
}
}
Results:
Arr[0]: 1
Arr[1]: 2
Arr[2]: {3,{4},5}
Arr[3]: {5,6}
Arr[4]: -7
Arr[5]: {7,8}
Arr[6]: {8,{9},10}
Arr[7]: 11

How to split this "Tree-like" string in Java regex?

This is the string:
String str = "(S(B1)(B2(B21)(B22)(B23))(B3)())";
Content in a son-() may be "", or just the value of str, or like that pattern, recursively, so a sub-() is a sub-tree.
Expected result:
str1 is "(S(B1))"
str2 is "(B2(B21)(B22)(B23))" //don't expand sons of a son
str3 is "(B3)"
str4 is "()"
str1-4 are e.g. elements in an Array
How to split the string?
I have a fimiliar question: How to split this string in Java regex? But its answer is not good enough for this one.

Regexes do not have sufficient power to parse balanced/nested brackets. This is essentially the same problem as parsing markup languages such as HTML where the consistent advice is to use special parsers, not regexes.
You should parse this as a tree. In overall terms:
Create a stack.
when you hit a "(" push the next chunk onto the stack.
when you hit a ")" pop the stack.
This takes a few minutes to write and will check that your input is well-formed.
This will save you time almost immediately. Trying to manage regexes for this will become more and more complex and will almost inevitably break down.
UPDATE: If you are only concerned with one level then it can be simpler (NOT debugged):
List<String> subTreeList = new ArrayList<String>();
String s = getMyString();
int level = 0;
int lastOpenBracket = -1
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == '(') {
level++;
if (level == 1) {
lastOpenBracket = i;
}
} else if (c == ')') {
if (level == 1) {
subStreeList.add(s.substring(lastOpenBracket, i);
}
level--;
}
}
I haven't checked it works, and you should debug it. You should also put checks to make sure you
don't have hanging brackets at the end or strange characters at level == 1;

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Split a string when the separator can be nested - java

I'm having some trouble while trying to split a string with a nested separator. My String would be like "(a,b(1,2,3),c,d(a,b,c))". How could I get an array ["a","b(1,2,3)","c","d(a,b,c)"] ? I obviously can't use .split(","), since it would also split my sub-strings.

Related

How to preserve the punctuation when converting words to Pig Latin?

String manipulation of function names

Deleting all regex instances starting with char '[' and ending with char ']' from a String

Split by a comma that is not inside parentheses, skipping anything inside them

How to split this "Tree-like" string in Java regex?

Categories

Resources