I need to identify {scope} in text, such as source code.
I'm starting with just a single line and will expand to search multiple lines, and exclude comments. I already have working code using Pattern Matcher, but I would like critiquing on how to improve such a search.
String line = "{{outside{inside}{inside2}}};";
String scopeOf = "outside";
findscope(line,scopeOf);
private static void findscope(String line,
String scopeOf) {
int layer = 1;
Pattern p = Pattern.compile(scopeOf);
Matcher m = p.matcher(line);
if (m.find()) {
int scopestart = m.start();
int scopeEnd = Integer.MIN_VALUE;
m.usePattern(Pattern.compile("\\{|\\}"));
while (m.find()) {
String group = m.group();
if (group.equals("{")) {
layer++;
} else if (group.equals("}")) {
layer--;
}
if (layer == 0) {
scopeEnd = m.start();
break;
}
}
System.out.println("Scope of " + scopeOf + " starts at " + scopestart +
" finishes at " + scopeEnd);
}
}
Well, you are using the wrong tool for the job (assuming you are also looking for nested scopes)
Note that regex (in the traditional form of regex) stands for Regular Expression - which is a way to describe a Regular Language.
However, the language L = { all words with legal scopings } is irregular - and thus cannot be identified by regex.
This langauge is actually Conext Free Langauge, and can be represented by a Context Free Grammer.
For parsing:
For relatively simple langauges (scoping is among them) - a deterministic push-down automaton is enough to verify them.
Some languages require non deterministic push down automaton - which is not very efficiently created, but there is a dynamic programming algorithm to parse them as well.
As a side note, there are some tools such as JavaCC that you can use to parse (and generate code/output) - have a look on them, but if you are simply looking for the scoping issue - it is probably an overkill.
Edit - pseudo code:
curr <- 0
count <- 0 //integer imitates the stack for this simple usage
l <- string.length()
while (curr < l):
if string.charAt(curr) == '{':
count++;
else if string.charAt(curr) == '}':
if curr <= 0:
return ERROR;
count--;
curr++;
if count != 0:
return ERROR;
return SUCCESS;
Note that in here we can use an integer to imitate the stack, in here an increase is basically a push() and a decrease is a pop().
Related
As the title indicates, please, how do I capture unpaired brackets or parentheses with regex, precisely, in java, being new to java. For instance, supposing I have the string below;
Programming is productive, (achieving a lot, and getting good results), it is often 1) demanding and 2) costly.
How do I capture 1) and 2).
I have tried:
([^\(\)][\)])
But, the result I am getting includes s) as below, instead of 1) and 2):
s), 1) and 2)
I have checked the link: Regular expression to match balanced parentheses, but, the question seem to be referring to recursive or nested structures, which is quite different from my situation.
My situation is to match the right parenthesis or right bracket, along with any associated text that does not have an associated left parenthesis or bracket.
Maybe,
\b\d+\)
might simply return the desired output, I guess.
Demo 1
Another way is to see what left boundary you might have, which in this case, I see digits, then what other chars we'd have prior to the closing curly bracket, and then we can design some other simple expression similar to:
\b\d[^)]*\)
Demo 2
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegularExpression{
public static void main(String[] args){
final String regex = "\\b\\d[^)]*\\)";
final String string = "Programming is productive, (achieving a lot, and getting good results), it is often 1) demanding and 2) costly.\n\n"
+ "Programming is productive, (achieving a lot, and getting good results), it is often 1a b) demanding and 2a a) costly.\n\n\n"
+ "Programming is productive, (achieving a lot, and getting good results), it is often 1b) demanding and 2b) costly.\n\n"
+ "It is not supposed to match ( s s 1) \n";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
}
}
Output
Full match: 1)
Full match: 2)
Full match: 1a b)
Full match: 2a a)
Full match: 1b)
Full match: 2b)
Full match: 1)
RegEx Circuit
jex.im visualizes regular expressions:
This is not a regex solution (obviously) but I can't think of a good way to do it. This simply uses a stack to keep track of parens.
For the input String "(*(**)**) first) second) (**) (*ksks*) third) ** fourth)( **)
It prints out
first)
second)
third)
fourth)
All other parentheses are ignored because they are matched.
String s =
"(*(**)**) first) second) (**) (*ksks*) third) ** fourth)( **)";
Pattern p;
List<String> found = new ArrayList<>();
Stack<Character> tokens = new Stack<>();
int pcount = 0;
for (char c : s.toCharArray()) {
switch (c) {
case ' ':
tokens.clear();
break;
case '(':
pcount++;
break;
case ')':
pcount--;
if (pcount == -1) {
String v = ")";
while (!tokens.isEmpty()) {
v = tokens.pop() + v;
}
found.add(v);
pcount = 0;
}
break;
default:
tokens.push(c);
}
}
found.forEach(System.out::println);
Note: Integrating brackets (]) into the above would be a challenge (though not impossible) because one would need to check constructs like ( [ ) ] where it is unclear how to interpret it. That's why when specifying requirements of this sort they need to be spelled out precisely.
I have use case where I have a line of text containing nesting tokens (like { and }), and I wish to transform certain substrings nested at specific depths.
Example, capitalize the word moo at depth 1:
moo [moo [moo moo]] moo ->
moo [MOO [moo moo]] moo
Achieved by:
replaceTokens(input, 1, "[", "]", "moo", String::toUpperCase);
Or real world example, supply "--options" not already colored with the color sequence cyan:
#|blue --ignoreLog|# works, but --ignoreOutput silences everything. ->
#|blue --ignoreLog|# works, but #|cyan --ignoreOutput|# silences everything.
Achieved by:
replaceTokens(input, 0, "#|", "|#", "--\\w*", s -> format("#|cyan %s|#", s));
I have implemented this logic and though I feel pretty good about it (except performance probably), I also feel I reinvented the wheel. Here's how I implemented it:
set currentPos to zero
while (input line not fully consumed) {
take the remaining line
if the open token is matched, add to output, increase counter and advance pos accordingly
else if the close token is matched, add to output, decrease counter and advance pos accordingly
else if the counter matches provided depth and given regex matches, invoke replacer function and advance pos accordingly
else just record the next character and advance pos by 1
}
Here's the actual implementation:
public static String replaceNestedTokens(String lineWithTokens, int nestingDepth, String tokenOpen, String tokenClose, String tokenRegexToReplace, Function<String, String> tokenReplacer) {
final Pattern startsWithOpen = compile(quote(tokenOpen));
final Pattern startsWithClose = compile(quote(tokenClose));
final Pattern startsWithTokenToReplace = compile(format("(?<token>%s)", tokenRegexToReplace));
final StringBuilder lineWithTokensReplaced = new StringBuilder();
int countOpenTokens = 0;
int pos = 0;
while (pos < lineWithTokens.length()) {
final String remainingLine = lineWithTokens.substring(pos);
if (startsWithOpen.matcher(remainingLine).lookingAt()) {
countOpenTokens++;
lineWithTokensReplaced.append(tokenOpen);
pos += tokenOpen.length();
} else if (startsWithClose.matcher(remainingLine).lookingAt()) {
countOpenTokens--;
lineWithTokensReplaced.append(tokenClose);
pos += tokenClose.length();
} else if (countOpenTokens == nestingDepth) {
Matcher startsWithTokenMatcher = startsWithTokenToReplace.matcher(remainingLine);
if (startsWithTokenMatcher.lookingAt()) {
String matchedToken = startsWithTokenMatcher.group("token");
lineWithTokensReplaced.append(tokenReplacer.apply(matchedToken));
pos += matchedToken.length();
} else {
lineWithTokensReplaced.append(lineWithTokens.charAt(pos++));
}
} else {
lineWithTokensReplaced.append(lineWithTokens.charAt(pos++));
}
assumeTrue(countOpenTokens >= 0, "Unbalanced token sets: closed token without open token\n\t" + lineWithTokens);
}
assumeTrue(countOpenTokens == 0, "Unbalanced token sets: open token without closed token\n\t" + lineWithTokens);
return lineWithTokensReplaced.toString();
}
I couldn't make it work with a regex like this or this (or Scanner) solution, but I feel I'm reinventing the wheel and could solve this with (vanilla Java) out-of-the-box classes with less code. Also, I'm pretty sure this is a performance nightmare with all the inline patterns/matcher instances and substrings.
Suggestions?
You could be using a parser like ANTLR to create a grammar to describe your language or syntax. Then use a listener or visitor to make an interpreter of tokens.
A sample of the grammar would be like this (what I can infer from your code):
grammar Expr;
prog: (expr NEWLINE)* ;
expr: id '[' expr ']'
| '#|' expr '|#'
| '--ignoreLog' expr
| '--ignoreOutput' expr
| string
;
string: [a-zA-Z0-9];
NEWLINE : [\r\n]+ ;
I have a code that works but is extremely slow. This code determines whether a string contains a keyword. The requirements I have need to be efficient for hundreds of keywords that I will search for in thousands of documents.
What can I do to make finding the keywords (without falsely returning a word that contains the keyword) efficiently?
For example:
String keyword="ac";
String document"..." //few page long file
If i use :
if(document.contains(keyword) ){
//do something
}
It will also return true if document contains a word like "account";
so I tried to use regular expression as follows:
String pattern = "(.*)([^A-Za-z]"+ keyword +"[^A-Za-z])(.*)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(document);
if(m.find()){
//do something
}
Summary:
This is the summary: Hopefully it will be useful to some one else:
My regular expression would work but extremely impractical while
working with big data. (it didn't terminate)
#anubhava perfected the regular expression. it was easy to
understand and implement. It managed to terminate which is a big
thing. but it was still a bit slow. (Roughly about 240 seconds)
#Tomalak solution is abit complex to implement and understand but it
was the fastest solution. so hats off mate.(18 seconds)
so #Tomalak solution was ~15 times faster than #anubhava.
Don't think you need to have .* in your regex.
Try this regex:
String pattern = "\\b"+ Pattern.quote(keyword) + "\\b";
Here \\b is used for word boundary. If the keyword can contain special characters, make sure they are not at the start or end of the word, or the word boundaries will fail to match.
Also you must be using Pattern.quote if your keyword contains special regex characters.
EDIT: You might use this regex if your keywords are separated by space.
String pattern = "(?<=\\s|^)"+ Pattern.quote(keyword) + "(?=\\s|$)";
The fastest-possible way to find substrings in Java is to use String.indexOf().
To achieve "entire-word-only" matches, you would need to add a little bit of logic to check the characters before and after a possible match to make sure they are non-word characters:
public class IndexOfWordSample {
public static void main(String[] args) {
String input = "There are longer strings than this not very long one.";
String search = "long";
int index = indexOfWord(input, search);
if (index > -1) {
System.out.println("Hit for \"" + search + "\" at position " + index + ".");
} else {
System.out.println("No hit for \"" + search + "\".");
}
}
public static int indexOfWord(String input, String word) {
String nonWord = "^\\W?$", before, after;
int index, before_i, after_i = 0;
while (true) {
index = input.indexOf(word, after_i);
if (index == -1 || word.isEmpty()) break;
before_i = index - 1;
after_i = index + word.length();
before = "" + (before_i > -1 ? input.charAt(before_i) : "");
after = "" + (after_i < input.length() ? input.charAt(after_i) : "");
if (before.matches(nonWord) && after.matches(nonWord)) {
return index;
}
}
return -1;
}
}
This would print:
Hit for "long" at position 44.
This should perform better than a pure regular expressions approach.
Think if ^\W?$ already matches your expectation of a "non-word" character. The regular expression is a compromise here and may cost performance if your input string contains many "almost"-matches.
For extra speed, ditch the regex and work with the Character class, checking a combination of the many properties it provides (like isAlphabetic, etc.) for before and after.
I've created a Gist with an alternative implementation that does that.
I searched even on page 3 at google for this problem, but it seems there is no proper solution.
The following string
"zhg,wimö,'astor wohnideen','multistore 2002',yonza,'asdf, saflk','marc o\'polo'"
should be splitted by comma in Java. The quotes can be double quotes or single. I tried the following regex
,(?=([^\"']*[\"'][^\"']*[\"'])*[^\"']*$)
but because of the escaped quote at 'marc o\'polo' it fails...
Can somebody help me out?
Code for tryout:
String checkString = "zhg,wimö,'astor wohnideen','multistore 2002',yonza,'asdf, saflk','marc \'opolo'";
Pattern COMMA_PATTERN = Pattern.compile(",(?=([^\"']*[\"'][^\"']*[\"'])*[^\"']*$)");
String[] splits = COMMA_PATTERN.split(checkString);
for (String split : splits) {
System.out.println(split);
}
You can do it like this:
List<String> result = new ArrayList<String>();
Pattern p = Pattern.compile("(?>[^,'\"]++|(['\"])(?>[^\"'\\\\]++|\\\\.|(?!\\1)[\"'])*\\1|(?<=,|^)\\s*(?=,|$))+", Pattern.DOTALL);
Matcher m = p.matcher(checkString);
while(m.find()) {
result.add(m.group());
}
Splitting CSV with regex is not the right solution... which is probably why you are struggling to find one with split/csv/regex search terms.
Using a dedicated library with a state machine is typically the best solution. There are a number of them:
This closed question seems relevant: https://stackoverflow.com/questions/12410538/which-is-the-best-csv-parser-in-java
I have used opencsv in the past, and I beleive the apache csv tool is good too. I am sure there are others. I am specifically not linking any library because you should o your own research on what to use.
I have been involved in a number of commercail projects where the csv parser was custom-built, but I see no reason why that should still be done.
What I can say, is that regex and CSV get very, very complicated relatively quickly (as you have discovered), and that for performance reasons alone, a 'raw' parser is better.
If you are parsing CVS (or something very similar) than using one of the stablished frameworks normally is a good idea as they cover most corner-cases and are tested by a wider audience thorough usage in different projects.
If however libraries are no option you could go with e.g. this:
public class Curios {
public static void main(String[] args) {
String checkString = "zhg,wimö,'astor wohnideen','multistore 2002',yonza,'asdf, saflk','marc o\\'polo'";
List<String> result = splitValues(checkString);
System.out.println(result);
System.out.println(splitValues("zhg\\,wi\\'mö,'astor wohnideen','multistore 2002',\"yo\\\"nza\",'asdf, saflk\\\\','marc o\\'polo',"));
}
public static List<String> splitValues(String checkString) {
List<String> result = new ArrayList<String>();
// Used for reporting errors and detecting quotes
int startOfValue = 0;
// Used to mark the next character as being escaped
boolean charEscaped = false;
// Is the current value quoted?
boolean quoted = false;
// Quote-character in use (only valid when quoted == true)
char quote = '\0';
// All characters read from current value
final StringBuilder currentValue = new StringBuilder();
for (int i = 0; i < checkString.length(); i++) {
final char charAt = checkString.charAt(i);
if (i == startOfValue && !quoted) {
// We have not yet decided if this is a quoted value, but we are right at the beginning of the next value
if (charAt == '\'' || charAt == '"') {
// This will be a quoted String
quote = charAt;
quoted = true;
startOfValue++;
continue;
}
}
if (!charEscaped) {
if (charAt == '\\') {
charEscaped = true;
} else if (quoted && charAt == quote) {
if (i + 1 == checkString.length()) {
// So we don't throw an exception
quoted = false;
// Last value will be added to result outside loop
break;
} else if (checkString.charAt(i + 1) == ',') {
// Ensure we don't parse , again
i++;
// Add the value to the result
result.add(currentValue.toString());
// Prepare for next value
currentValue.setLength(0);
startOfValue = i + 1;
quoted = false;
} else {
throw new IllegalStateException(String.format(
"Value was quoted with %s but prematurely terminated at position %d " +
"maybe a \\ is missing before this %s or a , after? " +
"Value up to this point: \"%s\"",
quote, i, quote, checkString.substring(startOfValue, i + 1)));
}
} else if (!quoted && charAt == ',') {
// Add the value to the result
result.add(currentValue.toString());
// Prepare for next value
currentValue.setLength(0);
startOfValue = i + 1;
} else {
// a boring character
currentValue.append(charAt);
}
} else {
// So we don't forget to reset for next char...
charEscaped = false;
// Here we can do interpolations
switch (charAt) {
case 'n':
currentValue.append('\n');
break;
case 'r':
currentValue.append('\r');
break;
case 't':
currentValue.append('\t');
break;
default:
currentValue.append(charAt);
}
}
}
if(charEscaped) {
throw new IllegalStateException("Input ended with a stray \\");
} else if (quoted) {
throw new IllegalStateException("Last value was quoted with "+quote+" but there is no terminating quote.");
}
// Add the last value to the result
result.add(currentValue.toString());
return result;
}
}
Why not simply a regular expression?
Regular expressions don't understand nesting very well. While certainly the regular expression by Casimir does a good job, differences between quoted and unquoted values are easier to model in some form of a state-machine. You see how difficult it was to ensure you don't accidentally match an ecaped or quoted ,. Also while you are allready evaluating every character it is easy to interpret escape-sequences like \n
What to watch out for?
My function was not written for white-space arround values (this can be changed)
My function will interpret the escape-sequences \n, \r, \t, \\ like most C-style language interpreters while reading \x as x (this can easily be changed)
My function accepts quotes and escapes inside unquoted values (this can easily be changed)
I did only a few tests and tried my best to exhibit a good memory-management and timing, but you will need to see if it fits your needs.
I have some programmatically assembled huge regex, like this
(A)|(B)|(C)|...
Each sub-pattern is in its capturing group. When I get a match, how do I figure out which group matches without linearly testing each group(i) to see it returns a non-null string?
If your regex is programmatically generated, why not programmatically generate n separate regexes and test each of them in turn? Unless they share a common prefix and the Java regex engine is clever, all alternatives get tested anyway.
Update: I just looked through the Sun Java source, in particular, java.util.regex.Pattern$Branch.match(), and that does also simply do a linear search over all alternatives, trying each in turn. The other places where Branch is used do not suggest any kind of optimization of common prefixes.
You can use non-capturing groups, instead of:
(A)|(B)|(C)|...
replace with
((?:A)|(?:B)|(?:C))
The non-capturing groups (?:) will not be included in the group count, but the result of the branch will be captured in the outer () group.
Break up your regex into three:
String[] regexes = new String[] { "pattern1", "pattern2", "pattern3" };
for(int i = 0; i < regexes.length; i++) {
Pattern pattern = Pattern.compile(regexes[i]);
Matcher matcher = pattern.matcher(inputStr);
if(matcher.matches()) {
//process, optionally break out of loop
}
}
public int getMatchedGroupIndex(Matcher matcher) {
int index = -1;
for(int i = 0; i < matcher.groupCount(); i++) {
if(matcher.group(i) != null && matcher.group(i).trim().length() > 0) {
index = i;
}
}
return index;
}
The alternative is:
for(int i = 0; i < matcher.groupCount(); i++) {
if(matcher.group(i) != null && matcher.group(i).trim().length() > 0) {
//process, optionally break out of loop
}
}
I don't think you can get around the linear search, but you can make it a lot more efficient by using start(int) instead of group(int).
static int getMatchedGroupIndex(Matcher m)
{
int index = -1;
for (int i = 1, n = m.groupCount(); i <= n; i++)
{
if ( (index = m.start(i)) != -1 )
{
break;
}
}
return index;
}
This way, instead of generating a substring for every group, you just query an int value representing its starting index.
From the various comments, it seems that the simple answer is "no", and that using separate regexes is a better idea. To improve on that approach, you might need to figure out the common pattern prefixes when you generate them, or use your own regex (or other) pattern matching engine. But before you go to all of that effort, you need to be sure that this is a significant bottleneck in your system. In other words, benchmark it and see if the performance is acceptable for realistic input data, and if not the profile it to see where the real bottlenecks are.