Lexing a String in Java with Regex

Lexing a String in Java with Regex - java

For some reason the while loop is only going through one time, picking up a NUMBER and then exiting. Does anyone have any idea why it isn't lexing the rest of the String? All I had was an input of 1 + 2. Any help is much appreciated!!
public Lexer(String input) throws TokenMismatchException {
tokens = new ArrayList<Token>();
// Lexing logic begins here
StringBuffer tokenPatternsBuffer = new StringBuffer();
for (Type type : Type.values())
tokenPatternsBuffer.append(String.format("|(?<%s>%s)", type.name(), type.pattern));
Pattern tokenPatterns = Pattern.compile(new String(tokenPatternsBuffer.substring(1)));
// Begin matching tokens
Matcher matcher = tokenPatterns.matcher(input.replaceAll(" ", ""));
while (matcher.find()) {
if (matcher.group(Type.NUMBER.name()) != null) {
tokens.add(new Token(Type.NUMBER, matcher.group(Type.NUMBER.name())));
continue;
} else if (matcher.group(Type.OPERATOR.name()) != null) {
tokens.add(new Token(Type.OPERATOR, matcher.group(Type.OPERATOR.name())));
continue;
} else if (matcher.group(Type.UNIT.name()) != null) {
tokens.add(new Token(Type.UNIT, matcher.group(Type.UNIT.name())));
continue;
} else if (matcher.group(Type.PARENTHESES.name()) != null) {
tokens.add(new Token(Type.PARENTHESES, matcher.group(Type.PARENTHESES.name())));
continue;
} else {
throw new TokenMismatchException();
}
}
}
enum Type {
NUMBER("[0-9]+.*[0-9]*"), OPERATOR("[*|/|+|-]"), UNIT("[in|pt]"), PARENTHESES("[(|)]");
public final String pattern;
private Type(String pattern) {
this.pattern = pattern;
}
}

This pattern:
"[0-9]+.*[0-9]*"
matches one or more digits, followed by zero or more of any character, followed by zero or more digits. The dot is a special character in regexes that means "any character". If you're trying to match a decimal point, you need to put a backslash before the dot:
"[0-9]+\\.*[0-9]*"
(The backslash is doubled because it's in a Java string literal.) It appears to work on "1 + 2" if that one fix is made. However, some of your other patterns show some misunderstanding of what [] does in a regex. This is a "character class" that matches any of the characters you list in between the brackets, except that - can be used for a range of characters (like 0-9). So
"[*|/|+|-]"
matches any of the characters *, |, /, +, - (the | does not mean "or" inside square brackets). - isn't treated as a range operator here since it's last, but it's probably best to get in the habit of using \ in front of it anyway, so you want
"[*/+\\-]"
Similarly,
"[in|pt]"
matches one of the five characters i, n, |, p, t--certainly not what you want. You probably want
"(in|pt)"
which matches either "in" or "pt"; the parentheses may not be necessary in your case, but in a different case, they may be necessary to prevent some other characters from being included in one of the alternatives when the pattern is included in a larger string.

Related

How to split a string by a string in Java, considering escaped ones

So I have a code here where I need to split incoming strings by the char ';'. However, there might be some that are escaped with \.
What I am doing then is to iterate it letter by letter for ; excluding if the previous letter was a \ and then replace any outcomes where there was an escaped \; with ;.
This seems all a bit cumbersome to me, is there a better way how to do this?
public void parse(String line, Player player) {
if (line.contains(";")) { // check split sign
String subString;
int previousIndex = 0; // location of the first letter
String search = "\\;";
String replace = ";";
// lets search for colons
int index = line.indexOf(';');
while (index >= 0) {
// check if the previous letter is a \ so we know it's escaped
if (line.charAt(index - 1) != '\\') {
// get a substring for the current segment:
subString = substring(previousIndex, index);
if (subString.contains("/Command/")) { // Check if line is an actual command line
// replace escaped colons and execute command
parseCommand(subString.replaceAll(search, replace), player);
} else if (subString.contains("/Output/")) {
parseOutput(subString.replaceAll(search, replace), player);
} else {
Main.logDebugInfo(Level.WARNING, "Command parsing: No command or output tag found!");
}
previousIndex = index;
}
index = line.indexOf(';', index + 1); // next letter
}
} else {
Main.logDebugInfo(Level.WARNING, "Command parsing: No ; found.");
}
}
Alternative that comes to my mind would be to first replace all \; with a Very specific substring (e.g. "%%€€", then split into a list by ; and re-substitute the escaped ones with ;. There is a tiny risk that this causes issues.
I am wondering if there is some standard routine/best practice to deal with escaped characters?

split takes regex as parameter, so you can use negative lookbehind:
String[] split = foo.split("(?<!\\\\);");
Yes, that's 4 \'s repeated, because each \ needs to be escaped.
I am wondering if there is some standard routine/best practice to deal with escaped characters?
Quote your values, or use a separator that doesn't appear in actual content. Or better yet, use some well-defined format for transmitting data, such as JSON.

Capitalize first letters in words in the string with different separators using java 8 stream

I need to capitalize first letter in every word in the string, BUT it's not so easy as it seems to be as the word is considered to be any sequence of letters, digits, "_" , "-", "`" while all other chars are considered to be separators, i.e. after them the next letter must be capitalized.
Example what program should do:
For input: "#he&llo wo!r^ld"
Output should be: "#He&Llo Wo!R^Ld"
There are questions that sound similar here, but there solutions really don't help.
This one for example:
String output = Arrays.stream(input.split("[\\s&]+"))
.map(t -> t.substring(0, 1).toUpperCase() + t.substring(1))
.collect(Collectors.joining(" "));
As in my task there can be various separators, this solution doesn't work.

It is possible to split a string and keep the delimiters, so taking into account the requirement for delimiters:
word is considered to be any sequence of letters, digits, "_" , "-", "`" while all other chars are considered to be separators
the pattern which keeps the delimiters in the result array would be: "((?<=[^-`\\w])|(?=[^-`\\w]))":
[^-`\\w]: all characters except -, backtick and word characters \w: [A-Za-z0-9_]
Then, the "words" are capitalized, and delimiters are kept as is:
static String capitalize(String input) {
if (null == input || 0 == input.length()) {
return input;
}
return Arrays.stream(input.split("((?<=[^-`\\w])|(?=[^-`\\w]))"))
.map(s -> s.matches("[-`\\w]+") ? Character.toUpperCase(s.charAt(0)) + s.substring(1) : s)
.collect(Collectors.joining(""));
}
Tests:
System.out.println(capitalize("#he&l_lo-wo!r^ld"));
System.out.println(capitalize("#`he`&l+lo wo!r^ld"));
Output:
#He&l_lo-wo!R^Ld
#`he`&L+Lo Wo!R^Ld
Update
If it is needed to process not only ASCII set of characters but apply to other alphabets or character sets (e.g. Cyrillic, Greek, etc.), POSIX class \\p{IsWord} may be used and matching of Unicode characters needs to be enabled using pattern flag (?U):
static String capitalizeUnicode(String input) {
if (null == input || 0 == input.length()) {
return input;
}
return Arrays.stream(input.split("(?U)((?<=[^-`\\p{IsWord}])|(?=[^-`\\p{IsWord}]))")
.map(s -> s.matches("(?U)[-`\\p{IsWord}]+") ? Character.toUpperCase(s.charAt(0)) + s.substring(1) : s)
.collect(Collectors.joining(""));
}
Test:
System.out.println(capitalizeUnicode("#he&l_lo-wo!r^ld"));
System.out.println(capitalizeUnicode("#привет&`ёж`+дос^βιδ/ως"));
Output:
#He&L_lo-wo!R^Ld
#Привет&`ёж`+Дос^Βιδ/Ως

You can't use split that easily - split will eliminate the separators and give you only the things in between. As you need the separators, no can do.
One real dirty trick is to use something called 'lookahead'. That argument you pass to split is a regular expression. Most 'characters' in a regexp have the property that they consume the matching input. If you do input.split("\\s+") then that doesn't 'just' split on whitespace, it also consumes them: The whitespace is no longer part of the individual entries in your string array.
However, consider ^ and $. or \\b. These still match things but don't consume anything. You don't consume 'end of string'. In fact, ^^^hello$$$ matches the string "hello" just as well. You can do this yourself, using lookahead: It matches when the lookahead is there but does not consume it:
String[] args = "Hello World$Huh Weird".split("(?=[\\s_$-]+)");
for (String arg : args) System.out.println("*" + args[i] + "*");
Unfortunately, this 'works', in that it saves your separators, but isn't getting you all that much closer to a solution:
*Hello*
* World*
*$Huh*
* *
* *
* Weird*
You can go with lookbehind as well, but it's limited; they don't do variable length, for example.
The conclusion should rapidly become: Actually, doing this with split is a mistake.
Then, once split is off the table, you should no longer use streams, either: Streams don't do well once you need to know stuff about the previous element in a stream to do the job: A stream of characters doesn't work, as you need to know if the previous character was a non-letter or not.
In general, "I want to do X, and use Y" is a mistake. Keep an open mind. It's akin to asking: "I want to butter my toast, and use a hammer to do it". Oookaaaaayyyy, you can probably do that, but, eh, why? There are butter knives right there in the drawer, just.. put down the hammer, that's toast. Not a nail.
Same here.
A simple loop can take care of this, no problem:
private static final String BREAK_CHARS = "&-_`";
public String toTitleCase(String input) {
StringBuilder out = new StringBuilder();
boolean atBreak = true;
for (char c : input.toCharArray()) {
out.append(atBreak ? Character.toUpperCase(c) : c);
atBreak = Character.isWhitespace(c) || (BREAK_CHARS.indexOf(c) > -1);
}
return out.toString();
}
Simple. Efficient. Easy to read. Easy to modify. For example, if you want to go with 'any non-letter counts', trivial: atBreak = Character.isLetter(c);.
Contrast to the stream solution which is fragile, weird, far less efficient, and requires a regexp that needs half a page's worth of comment for anybody to understand it.
Can you do this with streams? Yes. You can butter toast with a hammer, too. Doesn't make it a good idea though. Put down the hammer!

You can use a simple FSM as you iterate over the characters in the string, with two states, either in a word, or not in a word. If you are not in a word and the next character is a letter, convert it to upper case, otherwise, if it is not a letter or if you are already in a word, simply copy it unmodified.
boolean isWord(int c) {
return c == '`' || c == '_' || c == '-' || Character.isLetter(c) || Character.isDigit(c);
}
String capitalize(String s) {
StringBuilder sb = new StringBuilder();
boolean inWord = false;
for (int c : s.codePoints().toArray()) {
if (!inWord && Character.isLetter(c)) {
sb.appendCodePoint(Character.toUpperCase(c));
} else {
sb.appendCodePoint(c);
}
inWord = isWord(c);
}
return sb.toString();
}
Note: I have used codePoints(), appendCodePoint(int), and int so that characters outside the basic multilingual plane (with code points greater than 64k) are handled correctly.

I need to capitalize first letter in every word
Here is one way to do it. Admittedly this is a might longer but your requirement to change the first letter to upper case (not first digit or first non-letter) required a helper method. Otherwise it would have been easier. Some others seemed to have missed this point.
Establish word pattern, and test data.
String wordPattern = "[\\w_-`]+";
Pattern p = Pattern.compile(wordPattern);
String[] inputData = { "#he&llo wo!r^ld", "0hel`lo-w0rld" };
Now this simply finds each successive word in the string based on the established regular expression. As each word is found, it changes the first letter in the word to upper case and then puts it in a string buffer in the correct position where the match was found.
for (String input : inputData) {
StringBuilder sb = new StringBuilder(input);
Matcher m = p.matcher(input);
while (m.find()) {
sb.replace(m.start(), m.end(),
upperFirstLetter(m.group()));
}
System.out.println(input + " -> " + sb);
}
prints
#he&llo wo!r^ld -> #He&Llo Wo!R^Ld
0hel`lo-w0rld -> 0Hel`lo-W0rld
Since words may start with digits, and the requirement was to convert the first letter (not character) to upper case. This method finds the first letter, converts it to upper case and
returns the new string. So 01_hello would become 01_Hello
public static String upperFirstLetter(String word) {
char[] chs = word.toCharArray();
for (int i = 0; i < chs.length; i++) {
if (Character.isLetter(chs[i])) {
chs[i] = Character.toUpperCase(chs[i]);
break;
}
}
return String.valueOf(chs);
}

Regular Expression to restrict some special characters

I am trying to write regular expression to restrict some characters. The character to restrict is based on the requirement from various users.
I am trying to use this regex - [(char1|char2|char3|...)$]
Note: Each char will be from requirement.
If the user entered string matches any of the character i ll return true. Now,
what I want to know is weather this expression will work for all the conditions?
For example - requirement1 = .:, requirement2 = .:&%
I will concatinate | in between each char and then i will generate regular expression in java. This is working for my requirement1 but not for requirement2.
my sample java code
String requirement = ":>&%";
String regExp1 = null;
for (int i = 0; i < requirement.length(); i++) {
regExp1 = "[(" + requirement.charAt(i);
if (i - 1 != requirement.length()) {
regExp1.concat("|");
}
}
if (regExp1 != null) {
regExp1.concat(")]$");
}
Pattern p = Pattern.compile(regExp);
Matcher m = p.matcher(arg);
if (m.find())
return true;
else
return false;
How can I generate standard regular expression?

If you want "one of these characters" the brackets are good enough. No need for parenthesis and pipes.
Something like this : [.:,] and [.:&%] may work. If want them one or more times you have to had + at the end of your regex (ie: [.:&%]+).
As said in the comments, beware of special chars (like the dot, which means any chars in regex).

inserting parentheses and asterisks into string according to some conditions

I have the following method which is used to insert parentheses and asterisks into a boolean expression when dealing with multiplication. For instance, an input of A+B+AB will give A+B+(A*B).
However, I also need to take into account the primes (apostrophes). The following are some examples of input/output:
A'B'+CD should give (A'*B')+(C*D)
A'B'C'D' should give (A'*B'*C'*D')
(A+B)'+(C'D') should give (A+B)'+(C'*D')
I have tried the following code but seems to have errors. Any thoughts?
public static String modify(String expression)
{
String temp = expression;
StringBuilder validated = new StringBuilder();
boolean inBrackets=false;
for(int idx=0; idx<temp.length()-1; idx++)
{
//no prime
if((Character.isLetter(temp.charAt(idx))) && (Character.isLetter(temp.charAt(idx+1))))
{
if(!inBrackets)
{
inBrackets = true;
validated.append("(");
}
validated.append(temp.substring(idx,idx+1));
validated.append("*");
}
//first prime
else if((Character.isLetter(temp.charAt(idx))) && (temp.charAt(idx+1)=='\'') && (Character.isLetter(temp.charAt(idx+2))))
{
if(!inBrackets)
{
inBrackets = true;
validated.append("(");
}
validated.append(temp.substring(idx,idx+2));
validated.append("*");
idx++;
}
//second prime
else if((Character.isLetter(temp.charAt(idx))) && (temp.charAt(idx+2)=='\'') && (Character.isLetter(temp.charAt(idx+1))))
{
if(!inBrackets)
{
inBrackets = true;
validated.append("(");
}
validated.append(temp.substring(idx,idx+1));
validated.append("*");
idx++;
}
else
{
validated.append(temp.substring(idx,idx+1));
if(inBrackets)
{
validated.append(")");
inBrackets=false;
}
}
}
validated.append(temp.substring(temp.length()-1));
if(inBrackets)
{
validated.append(")");
inBrackets=false;
}
return validated.toString();
}
Your help will greatly be appreciated. Thank you in advance! :)

I would suggest you should start with positions of + character in your string. If they differ by 1, you dont do anything. If they differ by two then there are two possiblities: AB or A'. So you check for it. If they differ by more than 2, then just check for ' symbol and put required symbol.

You can do it in 2 passes using regular expressions:
StringBuilder input = new StringBuilder("A'B'+(CDE)+A'B");
Pattern pattern1 = Pattern.compile("[A-Z]'?(?=[A-Z]'?)");
Matcher matcher1 = pattern1.matcher(input);
while (matcher1.find()) {
input.insert(matcher1.end(), '*');
matcher1.region(matcher1.end() + 1, input.length());
}
Pattern pattern2 = Pattern.compile("([A-Z]'?[*])+[A-Z]'?");
Matcher matcher2 = pattern2.matcher(input);
while (matcher2.find()) {
int start = matcher2.start();
int end = matcher2.end();
if (start==0||input.charAt(start-1) != '(') {
input.insert(start, '(');
end++;
}
if (input.length() == end || input.charAt(end) != ')') {
input.insert(end, ')');
end++;
}
matcher2.region(end, input.length());
}
It works as follows: the regex [A-Z]'? will match a letter from A-Z (all the capital letters) and it can be followed by an optional apostrophe, so it conveniently takes care of whether there is an apostrophe or not for us. The regex [A-Z]'?(?=[A-Z]'?) then means "look for a capital letter followed by an option apostrophe and then look for (but don't match against) a capital letter followed by an option apostrophe. This wil be all the places after which you want to put an asterisk. We then create a Matcher and find all the characters that match it. then we insert the asterisk. Since we modified the string, we need to update the Matcher for it to function properly.
In the second pass, we use the regex ([A-Z]'?[*])+[A-Z]'? which will look for "a capital letter followed by an option apostrophe and then an asterisk at least one time and then a capital letter followed by an option apostrophe". this is where all the groups that parentheses need to go in lie. So we create a Matcher and find the matches. we then check to see if there is already a parentese there (making sure not to go out of bounds ). If not we add a one. We again need to update the Matcher since we inserted characters. once this is over we have or final string.
for more on regex:
Pattern documentation
Regex tutorial

How to split a comma separated String while ignoring escaped commas?

I need to write a extended version of the StringUtils.commaDelimitedListToStringArray function which gets an additional parameter: the escape char.
so calling my:
commaDelimitedListToStringArray("test,test\\,test\\,test,test", "\\")
should return:
["test", "test,test,test", "test"]
My current attempt is to use String.split() to split the String using regular expressions:
String[] array = str.split("[^\\\\],");
But the returned array is:
["tes", "test\,test\,tes", "test"]
Any ideas?

The regular expression
[^\\],
means "match a character which is not a backslash followed by a comma" - this is why patterns such as t, are matching, because t is a character which is not a backslash.
I think you need to use some sort of negative lookbehind, to capture a , which is not preceded by a \ without capturing the preceding character, something like
(?<!\\),
(BTW, note that I have purposefully not doubly-escaped the backslashes to make this more readable)

Try:
String array[] = str.split("(?<!\\\\),");
Basically this is saying split on a comma, except where that comma is preceded by two backslashes. This is called a negative lookbehind zero-width assertion.

For future reference, here is the complete method i ended up with:
public static String[] commaDelimitedListToStringArray(String str, String escapeChar) {
// these characters need to be escaped in a regular expression
String regularExpressionSpecialChars = "/.*+?|()[]{}\\";
String escapedEscapeChar = escapeChar;
// if the escape char for our comma separated list needs to be escaped
// for the regular expression, escape it using the \ char
if(regularExpressionSpecialChars.indexOf(escapeChar) != -1)
escapedEscapeChar = "\\" + escapeChar;
// see http://stackoverflow.com/questions/820172/how-to-split-a-comma-separated-string-while-ignoring-escaped-commas
String[] temp = str.split("(?<!" + escapedEscapeChar + "),", -1);
// remove the escapeChar for the end result
String[] result = new String[temp.length];
for(int i=0; i<temp.length; i++) {
result[i] = temp[i].replaceAll(escapedEscapeChar + ",", ",");
}
return result;
}

As matt b said, [^\\], will interpret the character preceding the comma as a part of the delimiter.
"test\\\\\\,test\\\\,test\\,test,test"
-(split)->
["test\\\\\\,test\\\\,test\\,tes" , "test"]
As drvdijk said, (?<!\\), will misinterpret escaped backslashes.
"test\\\\\\,test\\\\,test\\,test,test"
-(split)->
["test\\\\\\,test\\\\,test\\,test" , "test"]
-(unescape commas)->
["test\\\\,test\\,test,test" , "test"]
I would expect being able to escape backslashes as well...
"test\\\\\\,test\\\\,test\\,test,test"
-(split)->
["test\\\\\\,test\\\\" , "test\\,test" , "test"]
-(unescape commas and backslashes)->
["test\\,test\\" , "test,test" , "test"]
drvdijk suggested (?<=(?<!\\\\)(\\\\\\\\){0,100}), which works well for lists with elements ending with up to 100 backslashes. This is far enough... but why a limit? Is there a more efficient way (isn't lookbehind greedy)? What about invalid strings?
I searched for a while for a generic solution, then I wrote the thing myself... The idea is to split following a pattern that matches the list elements (instead of matching the delimiter).
My answer does not take the escape character as a parameter.
public static List<String> commaDelimitedListStringToStringList(String list) {
// Check the validity of the list
// ex: "te\\st" is not valid, backslash should be escaped
if (!list.matches("^(([^\\\\,]|\\\\,|\\\\\\\\)*(,|$))+")) {
// Could also raise an exception
return null;
}
// Matcher for the list elements
Matcher matcher = Pattern
.compile("(?<=(^|,))([^\\\\,]|\\\\,|\\\\\\\\)*(?=(,|$))")
.matcher(list);
ArrayList<String> result = new ArrayList<String>();
while (matcher.find()) {
// Unescape the list element
result.add(matcher.group().replaceAll("\\\\([\\\\,])", "$1"));
}
return result;
}
Description for the pattern (unescaped):
(?<=(^|,)) forward is start of string or a ,
([^\\,]|\\,|\\\\)* the element composed of \,, \\ or characters wich are neither \ nor ,
(?=(,|$)) behind is end of string or a ,
The pattern may be simplified.
Even with the 3 parsings (matches + find + replaceAll), this method seems faster than the one suggested by drvdijk. It can still be optimized by writing a specific parser.
Also, what is the need of having an escape character if only one character is special, it could simply be doubled...
public static List<String> commaDelimitedListStringToStringList2(String list) {
if (!list.matches("^(([^,]|,,)*(,|$))+")) {
return null;
}
Matcher matcher = Pattern.compile("(?<=(^|,))([^,]|,,)*(?=(,|$))")
.matcher(list);
ArrayList<String> result = new ArrayList<String>();
while (matcher.find()) {
result.add(matcher.group().replaceAll(",,", ","));
}
return result;
}

split(/(?<!\\),/g) worked for me, but the accepted answer did not
> var x = "test,test\,test\,test,test"
undefined
> x.split(/(?<!\\),/g)
[ 'test', 'test\\,test\\,test', 'test' ]
> x.split("(?<!\\\\),")
[ 'test,test\\,test\\,test,test' ]

It's probably not "super fancy" solution, but possibly more time-efficient one. Escaping an escape character is also supported and it's working in browsers not supporting 'lookbehinds'.
function splitByDelimiterIfItIsNotEscaped (text, delimiter, escapeCharacter) {
const splittedText = []
let numberOfDelimitersBeforeOtherCharacter = 0
let nextSplittedTextPartIndex = 0
for (let characterIndex = 0, character = text[0]; characterIndex < text.length; characterIndex++, character = text[characterIndex]) {
if (character === escapeCharacter) {
numberOfDelimitersBeforeOtherCharacter++
} else if (character === delimiter && (!numberOfDelimitersBeforeOtherCharacter || !(numberOfDelimitersBeforeOtherCharacter % 2))) {
splittedText.push(text.substring(nextSplittedTextPartIndex, characterIndex))
nextSplittedTextPartIndex = characterIndex + 1
} else {
numberOfDelimitersBeforeOtherCharacter = 0
}
}
if (nextSplittedTextPartIndex <= text.length) {
splittedText.push(text.substring(nextSplittedTextPartIndex, text.length))
}
return splittedText
}
function onChange () {
console.log(splitByDelimiterIfItIsNotEscaped(inputBox.value, ',', '\\'))
}
addEventListener('change', onChange)
onChange()
After making a change unfocus the input box (use tab for example).
<input id="inputBox" value="test,test\,test\,test,test"/>

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Lexing a String in Java with Regex - java

Related

How to split a string by a string in Java, considering escaped ones

Capitalize first letters in words in the string with different separators using java 8 stream

Regular Expression to restrict some special characters

inserting parentheses and asterisks into string according to some conditions

How to split a comma separated String while ignoring escaped commas?

Categories

Resources