How to split this "Tree-like" string in Java regex?

How to split this "Tree-like" string in Java regex? - java

This is the string:
String str = "(S(B1)(B2(B21)(B22)(B23))(B3)())";
Content in a son-() may be "", or just the value of str, or like that pattern, recursively, so a sub-() is a sub-tree.
Expected result:
str1 is "(S(B1))"
str2 is "(B2(B21)(B22)(B23))" //don't expand sons of a son
str3 is "(B3)"
str4 is "()"
str1-4 are e.g. elements in an Array
How to split the string?
I have a fimiliar question: How to split this string in Java regex? But its answer is not good enough for this one.

Regexes do not have sufficient power to parse balanced/nested brackets. This is essentially the same problem as parsing markup languages such as HTML where the consistent advice is to use special parsers, not regexes.
You should parse this as a tree. In overall terms:
Create a stack.
when you hit a "(" push the next chunk onto the stack.
when you hit a ")" pop the stack.
This takes a few minutes to write and will check that your input is well-formed.
This will save you time almost immediately. Trying to manage regexes for this will become more and more complex and will almost inevitably break down.
UPDATE: If you are only concerned with one level then it can be simpler (NOT debugged):
List<String> subTreeList = new ArrayList<String>();
String s = getMyString();
int level = 0;
int lastOpenBracket = -1
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == '(') {
level++;
if (level == 1) {
lastOpenBracket = i;
}
} else if (c == ')') {
if (level == 1) {
subStreeList.add(s.substring(lastOpenBracket, i);
}
level--;
}
}
I haven't checked it works, and you should debug it. You should also put checks to make sure you
don't have hanging brackets at the end or strange characters at level == 1;

Related

Split a string when the separator can be nested

I'm having some trouble while trying to split a string with a nested separator.
My String would be like "(a,b(1,2,3),c,d(a,b,c))".
How could I get an array ["a","b(1,2,3)","c","d(a,b,c)"] ?
I obviously can't use .split(","), since it would also split my sub-strings.

Here is a straight forward non-recursive function that splits your string the way you want:
private String[] specialSplit(String s) {
List<String> result = new ArrayList<>();
StringBuilder sb = new StringBuilder();
int parenCount = 0;
for (int i = 1; i < s.length() - 1; i++) { // go from 1 to length -1 to discard the surrounding ()
char c = s.charAt(i);
if (c == '(') parenCount++;
else if (c == ')') parenCount--;
if (parenCount == 0 && c == ',') {
result.add(sb.toString());
sb.setLength(0); // clear string builder
} else {
sb.append(c);
}
}
result.add(sb.toString());
return result.toArray(new String[0]);
}
Basically, we iterate through all the characters of the string keeping track of the parentheses. The first and last parentheses are not considered. We only split the string when we have seen the same amount of opening and closing parentheses and when the current character is ','.
This method will likely run much faster than any regex you may find.

A recursive function should work here, just not with plain split(). Try parsing your string character by character and act whenever you encounter a comma or paranthesis: , means you create a new element, ( you start a new nested list, ) means you finish the current nested list. This should even work with a more "unrolled" approach (i.e. no recursion but handling the nesting in a data structure).

String manipulation of function names

For this Kata, i am given random function names in the PEP8 format and i am to convert them to camelCase.
(input)get_speed == (output)getSpeed ....
(input)set_distance == (output)setDistance
I have a understanding on one way of doing this written in pseudo-code:
loop through the word,
if the letter is an underscore
then delete the underscore
then get the next letter and change to a uppercase
endIf
endLoop
return the resultant word
But im unsure the best way of doing this, would it be more efficient to create a char array and loop through the element and then when it comes to finding an underscore delete that element and get the next index and change to uppercase.
Or would it be better to use recursion:
function camelCase takes a string
if the length of the string is 0,
then return the string
endIf
if the character is a underscore
then change to nothing,
then find next character and change to uppercase
return the string taking away the character
endIf
finally return the function taking the first character away
Any thoughts please, looking for a good efficient way of handing this problem. Thanks :)

I would go with this:
divide given String by underscore to array
from second word until end take first letter and convert it to uppercase
join to one word
This will work in O(n) (go through all names 3 time). For first case, use this function:
str.split("_");
for uppercase use this:
String newName = substring(0, 1).toUpperCase() + stre.substring(1);
But make sure you check size of the string first...
Edited - added implementation
It would look like this:
public String camelCase(String str) {
if (str == null ||str.trim().length() == 0) return str;
String[] split = str.split("_");
String newStr = split[0];
for (int i = 1; i < split.length; i++) {
newStr += split[i].substring(0, 1).toUpperCase() + split[i].substring(1);
}
return newStr;
}
for inputs:
"test"
"test_me"
"test_me_twice"
it returns:
"test"
"testMe"
"testMeTwice"

It would be simpler to iterate over the string instead of recursing.
String pep8 = "do_it_again";
StringBuilder camelCase = new StringBuilder();
for(int i = 0, l = pep8.length(); i < l; ++i) {
if(pep8.charAt(i) == '_' && (i + 1) < l) {
camelCase.append(Character.toUpperCase(pep8.charAt(++i)));
} else {
camelCase.append(pep8.charAt(i));
}
}
System.out.println(camelCase.toString()); // prints doItAgain

The question you pose is whether to use an iterative or a recursive approach. For this case I'd go for the recursive approach because it's straightforward, easy to understand doesn't require much resources (only one array, no new stackframe etc), though that doesn't really matter for this example.
Recursion is good for divide-and-conquer problems, but I don't see that fitting the case well, although it's possible.
An iterative implementation of the algorithm you described could look like the following:
StringBuilder buf = new StringBuilder(input);
for(int i = 0; i < buf.length(); i++){
if(buf.charAt(i) == '_'){
buf.deleteCharAt(i);
if(i != buf.length()){ //check fo EOL
buf.setCharAt(i, Character.toUpperCase(buf.charAt(i)));
}
}
}
return buf.toString();
The check for the EOL is not part of the given algorithm and could be ommitted, if the input string never ends with '_'

Split by a comma that is not inside parentheses, skipping anything inside them

I know it might be another topic about regexes, but despite I searched it, I couldn't get the clear answer. So here is my problem- I have a string like this:
{1,2,{3,{4},5},{5,6}}
I'm removing the most outside parentheses (they are there from input, and I don't need them), so now I have this:
1,2,{3,{4},5},{5,6}
And now, I need to split this string into an array of elements, treating everything inside these parentheses as one, "seamless" element:
Arr[0] 1
Arr[1] 2
Arr[2] {3,{4},5}
Arr[3] {5,6}
I have tried doing it using lookahead but so far, I'm failing (miserably). What would be the neatest way of dealing with those things in terms of regex?

You cannot do this if elements like this should be kept together: {{1},{2}}. The reason is that a regex for this is equivalent to parsing the balanced parenthesis language. This language is context-free and cannot be parsed using a regular expression. The best way to handle this is not to use regex but use a for loop with a stack (the stack gives power to parse context-free languages). In pseudo code we could do:
for char in input
if stack is empty and char is ','
add substring(last, current position) to output array
last = current index
if char is '{'
push '{' on stack
if char is '}'
pop from stack
This pseudo code will construct the array as desired, note that it's best to loop over the indexes of the chars in the given string as you'll need those to determine the boundaries of the substrings to add to the array.

Almost near to the requirement. Running out of time. Will complete rest later (A single comma is incorrect).
Regex: ,(?=[^}]*(?:{|$))
To check regex validity: Go to http://regexr.com/
To implement this pattern in Java, there is a slight difference. \ needs to be added before { and }.
Hence, regex for Java Input: ,(?=[^\\}]*(?:\\{|$))
String numbers = {1,2,{3,{4},5},{5,6}};
numbers = numbers.substring(1, numbers.length()-1);
String[] separatedValues = numbers.split(",(?=[^\\}]*(?:\\{|$))");
System.out.println(separatedValues[0]);

Could not figure out a regex solution, but here's a non-regex solution. It involves parsing numbers (not in curly braces) before each comma (unless its the last number in the string) and parsing strings (in curly braces) until the closing curly brace of the group is found.
If regex solution is found, I'd love to see it.
public static void main(String[] args) throws Exception {
String data = "1,2,{3,{4},5},{5,6},-7,{7,8},{8,{9},10},11";
List<String> list = new ArrayList();
for (int i = 0; i < data.length(); i++) {
if ((Character.isDigit(data.charAt(i))) ||
// Include negative numbers
(data.charAt(i) == '-') && (i + 1 < data.length() && Character.isDigit(data.charAt(i + 1)))) {
// Get the number before the comma, unless it's the last number
int commaIndex = data.indexOf(",", i);
String number = commaIndex > -1
? data.substring(i, commaIndex)
: data.substring(i);
list.add(number);
i += number.length();
} else if (data.charAt(i) == '{') {
// Get the group of numbers until you reach the final
// closing curly brace
StringBuilder sb = new StringBuilder();
int openCount = 0;
int closeCount = 0;
do {
if (data.charAt(i) == '{') {
openCount++;
} else if (data.charAt(i) == '}') {
closeCount++;
}
sb.append(data.charAt(i));
i++;
} while (closeCount < openCount);
list.add(sb.toString());
}
}
for (int i = 0; i < list.size(); i++) {
System.out.printf("Arr[%d]: %s\r\n", i, list.get(i));
}
}
Results:
Arr[0]: 1
Arr[1]: 2
Arr[2]: {3,{4},5}
Arr[3]: {5,6}
Arr[4]: -7
Arr[5]: {7,8}
Arr[6]: {8,{9},10}
Arr[7]: 11

RegEx to find URLs in HTML takes 25 seconds in Java/Android

In Android/Java, given a website's HTML source code, I would like to extract all XML and CSV file paths.
What I am doing (with RegEx) is this:
final HashSet<String> urls = new HashSet<String>();
final Pattern urlRegex = Pattern.compile(
"[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|].(xml|csv)");
final Matcher url = urlRegex.matcher(htmlString);
while (url.find()) {
urls.add(makeAbsoluteURL(url.group(0)));
}
public String makeAbsoluteURL(String url) {
if (url.startsWith("http://") || url.startsWith("http://")) {
return url;
}
else if (url.startsWith("/")) {
return mRootURL+url.substring(1);
}
else {
return mBaseURL+url;
}
}
Unfortunately, this runs for about 25 seconds for an average website with normal length. What is going wrong? Is my RegEx just bad? Or is RegEx just so slow?
Can I find the URLs faster without RegEx?
Edit:
The source for the valid characters was (roughly) this answer. However, I think the two character classes (square brackets) must be swapped so that you have a more limited character set for the first char of the URL and a broader character class for all remaining chars. This was the intention.

Your regex is written in a way that makes it slow for long inputs.
The * operator is greedy.
For instance for input:
http://stackoverflow.com/questions/19019504/regex-to-find-urls-in-html-takes-25-seconds-in-java-android.xml
The [-a-zA-Z0-9+&##/%?=~_|!:,.;]* part of the regex will consume the whole string. It will then try to match the next character group, which will fail (since whole string is consumed). It will then backtrack in match of first part of the regex by one character and try to match the second character group again. It will match. Then it will try to match the dot and fail because the whole string is consumed. Another backtrack etc...
In essence your regex is forcing a lot of backtracking to match anything. It will also waste a lot of time on matches that have no way of succeeding.
For word forest it will first consume whole word in the first part of expression and then repeatedly backtrack after failing to match the rest of expression. Huge waste of time.
Also:
the . in regex is unescaped and it will match ANY character.
url.group(0) is redundant. url.group() has same meaning
In order to speed up the regex you need to figure out a way to reduce the amount of backtracking and it would also help if you had a less general start of the match. Right now every single word will cause matching to start and generally fail. For instance typically in html all the links are inside 2 ". If that's the case you can start your matching at " which will speed it up tremendously. Try to find a better start of the expression.

I've nothing the say in the theoretical overview that U Mad did, he highlighted everything I'd noticed.
What I would like to suggest you, considering what are you look for with the RE, is to change the point of view of your RE :)
You are looking for xml and csv files, so why don't you reverse the html string, for example using:
new StringBuilder("bla bla bla foo letme/find.xml bla bla").reverse().toString()
after that you could look for the pattern:
final Pattern urlRegex = Pattern.compile(
"(vsc|lmx)\\.[-a-zA-Z0-9+&##/%=~_|][-a-zA-Z0-9+&##/%?=~_|!:,.;]*");
urlRegex pattern could be refined as U Mad has already suggested. But in this way you could reduce the number of failed matches.

I had my doubts, if there can be a String really long enough to take 25 seconds for parsing. So I tried and must admit now, that with about 27MB of text, it takes around 25 seconds to parse it with the given regular expression.
Being curious I changed the little test program with #FabioDch's approach (so, please vote for him, if you want to vote anywhere :-)
The result is quite impressing: Instead of 25 Seconds, #FabioDch's approach needed less then 1 second (100ms to 800ms) + 70ms to 85ms for reversing!
Here's the code I used. It reads text from the largest text file I've found and copies it 10 time to get 27MB of text. Then runs the regex against it and prints out the results.
#Test
public final void test() throws IOException {
final Pattern urlRegex = Pattern.compile("(lmx|vsc)\\.[-a-zA-Z0-9+&##/%=~_|][-a-zA-Z0-9+&##/%?=~_|!:,.;]*");
printTimePassed("initialized");
List<String> lines = Files.readAllLines(Paths.get("testdata", "Aster_Express_User_Guide_0500.txt"), Charset.defaultCharset());
StringBuilder sb = new StringBuilder();
for(int i=0; i<10; i++) { // Copy 10 times to get more useful data
for(String line : lines) {
sb.append(line);
sb.append('\n');
}
}
printTimePassed("loaded: " + lines.size() + " lines, in " + sb.length() + " chars");
String html = sb.reverse().toString();
printTimePassed("reversed");
int i = 0;
final Matcher url = urlRegex.matcher(html);
while (url.find()) {
System.out.println(i++ + ": FOUND: " + new StringBuilder(url.group()).reverse() + ", " + url.start() + ", " + url.end());
}
printTimePassed("ready");
}
private void printTimePassed(String msg) {
long current = System.currentTimeMillis();
System.out.printf("%s: took %d ms\n", msg, (current - ms));
ms = current;
}

Would suggest only using the regex to find file extensions (.xml or .csv). This should be a lot faster and when found, you can look backwards, examining each character before and stop when you reach one that couldn't be in a URL - see below:
final HashSet<String> urls = new HashSet<String>();
final Pattern fileExtRegex = Pattern.compile("\\.(xml|csv)");
final Matcher fileExtMatcher = fileExtRegex.matcher(htmlString);
// Find next occurrence of ".xml" or ".csv" in htmlString
while (fileExtMatcher.find()) {
// Go backwards from the character just before the file extension
int dotPos = fileExtMatcher.start() - 1;
int charPos = dotPos;
while (charPos >= 0) {
// Break if current character is not a valid URL character
char chr = htmlString.charAt(charPos);
if (!((chr >= 'a' && chr <= 'z') ||
(chr >= 'A' && chr <= 'Z') ||
(chr >= '0' && chr <= '9') ||
chr == '-' || chr == '+' || chr == '&' || chr == '#' ||
chr == '#' || chr == '/' || chr == '%' || chr == '?' ||
chr == '=' || chr == '~' || chr == '|' || chr == '!' ||
chr == ':' || chr == ',' || chr == '.' || chr == ';')) {
break;
}
charPos--;
}
// Extract/add URL if there are valid URL characters before file extension
if ((dotPos > 0) && (charPos < dotPos)) {
String url = htmlString.substring(charPos + 1, fileExtMatcher.end());
urls.add(makeAbsoluteURL(url));
}
}
Small disclaimer: I used part of your original regex for valid URL characters: [-a-zA-Z0-9+&##/%?=~_|!:,.;]. Haven't verified if this is comprehensive and there are perhaps further improvements that could be made, e.g. it would currently find local file paths (e.g. C:\TEMP\myfile.xml) as well as URLs. Wanted to keep the code above simple to demonstrate the technique so haven't tackled this.
EDIT Following the comment about effiency I've modified to no longer use a regex for checking valid URL characters. Instead, it compares the character against valid ranges manually. Uglier code but should be faster...

I know people love to use regex to parse html, but have you considered using jsoup?

For sake of clarity I created a separate answer for this regex:
Edited to escape the dot and remove reluctant quant.
(?<![-a-zA-Z0-9+&##/%=~_|])[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]‌\\.(xml|csv)
Please try this one and tell me how it goes.
Also here's a class which will enable you to search a reversed string without actually reversing it:
public class ReversedString implements CharSequence {
public ReversedString(String input) {
this.s = input;
this.len = s.length();
}
private final String s;
private final int len;
#Override
public CharSequence subSequence(final int start, final int end) {
return new CharSequence() {
#Override
public CharSequence subSequence(int start, int end) {
throw new UnsupportedOperationException();
}
#Override
public int length() {
return end-start;
}
#Override
public char charAt(int index) {
return s.charAt(len-start-index-1);
}
#Override
public String toString() {
StringBuilder buf = new StringBuilder(end-start);
for(int i = start;i < end;i++) {
buf.append(s.charAt(len-i-1));
}
return buf.toString();
}
};
}
#Override
public int length() {
return len;
}
#Override
public char charAt(int index) {
return s.charAt(len-1-index);
}
}
You can use this class as such:
pattern.matcher(new ReversedString(inputString));

Efficiently removing specific characters (some punctuation) from Strings in Java?

In Java, what is the most efficient way of removing given characters from a String? Currently, I have this code:
private static String processWord(String x) {
String tmp;
tmp = x.toLowerCase();
tmp = tmp.replace(",", "");
tmp = tmp.replace(".", "");
tmp = tmp.replace(";", "");
tmp = tmp.replace("!", "");
tmp = tmp.replace("?", "");
tmp = tmp.replace("(", "");
tmp = tmp.replace(")", "");
tmp = tmp.replace("{", "");
tmp = tmp.replace("}", "");
tmp = tmp.replace("[", "");
tmp = tmp.replace("]", "");
tmp = tmp.replace("<", "");
tmp = tmp.replace(">", "");
tmp = tmp.replace("%", "");
return tmp;
}
Would it be faster if I used some sort of StringBuilder, or a regex, or maybe something else? Yes, I know: profile it and see, but I hope someone can provide an answer of the top of their head, as this is a common task.

Although \\p{Punct} will specify a wider range of characters than in the question, it does allow for a shorter replacement expression:
tmp = tmp.replaceAll("\\p{Punct}+", "");

Here's a late answer, just for fun.
In cases like this, I would suggest aiming for readability over speed. Of course you can be super-readable but too slow, as in this super-concise version:
private static String processWord(String x) {
return x.replaceAll("[][(){},.;!?<>%]", "");
}
This is slow because everytime you call this method, the regex will be compiled. So you can pre-compile the regex.
private static final Pattern UNDESIRABLES = Pattern.compile("[][(){},.;!?<>%]");
private static String processWord(String x) {
return UNDESIRABLES.matcher(x).replaceAll("");
}
This should be fast enough for most purposes, assuming the JVM's regex engine optimizes the character class lookup. This is the solution I would use, personally.
Now without profiling, I wouldn't know whether you could do better by making your own character (actually codepoint) lookup table:
private static final boolean[] CHARS_TO_KEEP = new boolean[];
Fill this once and then iterate, making your resulting string. I'll leave the code to you. :)
Again, I wouldn't dive into this kind of optimization. The code has become too hard to read. Is performance that much of a concern? Also remember that modern languages are JITted and after warming up they will perform better, so use a good profiler.
One thing that should be mentioned is that the example in the original question is highly non-performant because you are creating a whole bunch of temporary strings! Unless a compiler optimizes all that away, that particular solution will perform the worst.

You could do something like this:
static String RemovePunct(String input)
{
char[] output = new char[input.length()];
int i = 0;
for (char ch : input.toCharArray())
{
if (Character.isLetterOrDigit(ch) || Character.isWhitespace(ch))
{
output[i++] = ch;
}
}
return new String(output, 0, i);
}
// ...
String s = RemovePunct("This is (a) test string.");
This will likely perform better than using regular expressions, if you find them to slow for your needs.
However, it could get messy fast if you have a long, distinct list of special characters you'd like to remove. In this case regular expressions are easier to handle.
http://ideone.com/mS8Irl

Strings are immutable so its not good to try and use them very dynamically try using StringBuilder instead of String and use all of its wonderful methods! It will let you do anything you want. Plus yes if you have something your trying to do, figure out the regex for it and it will work a lot better for you.

Use String#replaceAll(String regex, String replacement) as
tmp = tmp.replaceAll("[,.;!?(){}\\[\\]<>%]", "");
System.out.println(
"f,i.l;t!e?r(e)d {s}t[r]i<n>g%".replaceAll(
"[,.;!?(){}\\[\\]<>%]", "")); // prints "filtered string"

Right now your code will iterate over all characters of tmp and compare them with all possible characters that you want to remove, so it will use
number of tmp characters x number or characters you want to remove comparisons.
To optimize your code you could use short circuit OR || and do something like
StringBuilder sb = new StringBuilder();
for (char c : tmp.toCharArray()) {
if (!(c == ',' || c == '.' || c == ';' || c == '!' || c == '?'
|| c == '(' || c == ')' || c == '{' || c == '}' || c == '['
|| c == ']' || c == '<' || c == '>' || c == '%'))
sb.append(c);
}
tmp = sb.toString();
or like this
StringBuilder sb = new StringBuilder();
char[] badChars = ",.;!?(){}[]<>%".toCharArray();
outer:
for (char strChar : tmp.toCharArray()) {
for (char badChar : badChars) {
if (badChar == strChar)
continue outer;// we skip `strChar` since it is bad character
}
sb.append(strChar);
}
tmp = sb.toString();
This way you will iterate over every tmp characters but number of comparisons for that character can decrease if it is not % (because it will be last comparison, if character would be . program would get his result in one comparison).
If I am not mistaken this approach is used with character class ([...]) so maybe try it this way
Pattern p = Pattern.compile("[,.;!?(){}\\[\\]<>%]"); //store it somewhere so
//you wont need to compile it again
tmp = p.matcher(tmp).replaceAll("");

You can do this:
tmp.replaceAll("\\W", "");
to remove punctuation

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.