RegEx help for chess moves (SAN)

RegEx help for chess moves (SAN) - java

I'm writing a program that should be able to read and parse chess moves (SAN).
Here's an example of possible accepted moves:
e4
Nf3
Nbd2
Nb1c3
R1a3
d8=Q
exd5
Nbxd2
...
I first wrote the NFA, then converted it to grammar and then I converted it to a regular expression.
With my conventions, this is how it looks
pln + plxln + plnxln + plnln + plln + pxln + lxln=(B+R+Q+N) + lxln + lnxln=(B+R+Q+N) + lnxln + lnln=(B+R+Q+N) + lnln + ln=(B+R+Q+N) + ln + pnxln + pnln
where:
p is a character of set {B,R,Q,N,K} (or think it as (B+R+Q+N+K) = [BRQNK]
l is a character among [a-h] interval (case sensitive)
n is a number among [1-8] interval
+ represents Union operation... if I got it right, (B+R+Q+N) is [BRQN] in regex's programming languages.
= is just a normal character... in chess moves it's used in promotion (ex. e8=Q)
x is a normal character too... used when by moving your piece in that location you're taking an opponent's one.
(/): Like in math
I tried to parse first part pln as: [BRQN][a-h][1-8] in an online java regex tester and worked for a move like Nf3. I didn't get well how to do the union thing for composite expression (like pln+plxln)... also how can I label to parts of regex so that when it's detected, I get all the infos? I tried to read docs about it but didn't figure out.
Any advice?

The + in your notation is | in regexes. So you could use the regex
[BRQNK][a-h][1-8]|[BRQNK][a-h]x[a-h][1-8]|[BRQNK][a-h][1-8]x[a-h][1-8]|[BRQNK][a-h][1-8][a-h][1-8]|[BRQNK][a-h][a-h][1-8]|[BRQNK]x[a-h][1-8]|[a-h]x[a-h][1-8]=(B+R+Q+N)|[a-h]x[a-h][1-8]|[a-h][1-8]x[a-h][1-8]=(B+R+Q+N)|[a-h][1-8]x[a-h][1-8]|[a-h][1-8][a-h][1-8]=(B+R+Q+N)|[a-h][1-8][a-h][1-8]|[a-h][1-8]=(B+R+Q+N)|[a-h][1-8]|[BRQNK][1-8]x[a-h][1-8]|[BRQNK][1-8][a-h][1-8]
This is, clearly, a bit ugly. I can think of 2 possible ways to make it nicer:
With the COMMENTS flag, you can add whitespace.
Combine the possibilities together in a nicer way. For example, [BRQNK][a-h]x[a-h][1-8]|[BRQNK][a-h][1-8]x[a-h][1-8] can be rewritten as [BRQNK][a-h][1-8]?x[a-h][1-8].
I also know of one other improvement which isn't available in java. (And maybe not many languages, but you can do it in Perl.) The subexpression (?1) (likewise (?2), etc) is a bit like \1, except that instead of matching the exact string that matched the first capture group, it matches any string that could have matched that capture group. In other words, it's equivalent to writing the capture group out again. So you could (in Perl) replace the first [BRQNK] with ([BRQNK]), then replace all subsequent occurrences with (?1).

/^([NBRQK])?([a-h])?([1-8])?(x)?([a-h][1-8])(=[NBRQK])?(\+|#)?$|^O-O(-O)?$/
.
.
.
.
This was unit tested against 2599 cases. See below for unit tests
describe('Importer/Game', function() {
let Importer, Game;
beforeEach(function() {
Importer = require(`${moduleDir}/import`).Importer;
Game = require(`${moduleDir}/import`).Game;
});
describe('moveRegex', function() {
describe('non-castling', function() {
// ([NBRQK])? ([a-h])? ([1-8])? (x)? ([a-h][1-8]) (=[NBRQK])? (+|#)?/
// unitType? startFile? startRank? capture? end promotion? checkState?
for(let unitType of ['', 'N', 'B', 'R', 'Q', 'K']) {
for(let startFile of ['', 'b']) {
for(let startRank of ['', '3']) {
for(let capture of ['', 'x']) {
for(let promotion of ['', '=Q']) {
for(let checkState of ['', '+', '#']) {
//TODO: castling
const dest = 'e4';
const san = unitType + startFile + startRank + capture + dest + promotion + checkState;
testPositive(san);
//TODO: negative substitutions here.
testNagative('Y' + startFile + startRank + capture + dest + promotion + checkState);
testNagative(unitType + 'i' + startRank + capture + dest + promotion + checkState);
testNagative(unitType + startFile + '9' + capture + dest + promotion + checkState);
testNagative(unitType + startFile + startRank + 'X' + dest + promotion + checkState);
testNagative(unitType + startFile + startRank + capture + 'i9' + promotion + checkState);
// testNagative(unitType + startFile + startRank + capture + '' + promotion + checkState);
testNagative(unitType + startFile + startRank + capture + dest + '=' + checkState);
testNagative(unitType + startFile + startRank + capture + dest + 'Q' + checkState);
testNagative(unitType + startFile + startRank + capture + dest + promotion + '++');
}
}
}
}
}
}
});
describe('castling', function() {
testPositive('O-O');
testPositive('O-O-O');
testNagative('OOO');
testNagative('OO');
testNagative('O-O-');
testNagative('O-O-O-O');
testNagative('O');
});
function testPositive(san) {
it(`should handle this san: ${san}`, function(done) {
const matches = san.match(Importer.moveRegex);
assert(matches);
done();
});
}
function testNagative(san) {
it(`should not match this: ${san}`, function(done) {
const matches = san.match(Importer.moveRegex);
assert(!matches);
done();
});
}
});
});

Re: /^([NBRQK])?([a-h])?([1-8])?(x)?([a-h][1-8])(=[NBRQK])?(\+|#)?$|^O-O(-O)?$/
It's both underinclusive and overinclusive.
It excludes the possibly legal moves O-O+, O-O-O+, O-O#, and O-O-O#.
It includes many strings that can never be legal: e8=K, Kaa4, Nf5=B, Qa1xb7
and so on.

I've made this one:
/(^([PNBRQK])?([a-h])?([1-8])?(x|X|-)?([a-h][1-8])(=[NBRQ]| ?e\.p\.)?|^O-O(-O)?)(\+|\#|\$)?$/
Includes: O-O+, O-O-O+, O-O# and O-O-O#
Also: e.p., N-f6 or NXf6 and Pe4 or Pe5xd6
Update:
Thanks #Toto for improving my version of regex above:
^([PNBRQK]?[a-h]?[1-8]?[xX-]?[a-h][1-8](=[NBRQ]| ?e\.p\.)?|^O-O(?:-O)?)[+#$]?$

I have been using this for a while in my web portal.
[BRQNK][a-h][1-8]| [a-h][1-8]|[BRQNK][a-h][a-h][1-8]|O-O|0-0-0|[BRQNK]x[a-h][1-8]|[a-h]x[a-h][1-8]|1\/2-1\/2|1\/-O|O-\/1

Related

Find Word Count"- My code doesn't work properly

"Find Word Count"- Instructions:
Given an input string (assume it's essentially a paragraph of text) and a
word to find, return the number of times in the input string that the word is
found. Should be case agnostic and remove space, commas, full stops, quotes, tabs etc while finding the matching word.
=======================
My code doesn't work properly.
`
String input = " It can hardly be a coincidence that no language on" +
" Earth has ever produced the expression as pretty as an airport." +
" Airports are ugly. Some are very ugly. Some attain a degree of ugliness" +
" that can only be the result of a special effort. This ugliness arises " +
"because airports are full of people who are tired, cross, and have just " +
"discovered that their luggage has landed in Murmansk (Murmansk airport " +
"is the only known exception to this otherwise infallible rule), and architects" +
" have on the whole tried to reflect this in their designs. They have sought" +
" to highlight the tiredness and crossness motif with brutal shapes and nerve" +
" jangling colors, to make effortless the business of separating the traveller" +
" for ever from his or her luggage or loved ones, to confuse the traveller with" +
" arrows that appear to point at the windows, distant tie racks, or the current " +
"position of Ursa Minor in the night sky, and wherever possible to expose the " +
"plumbing on the grounds that it is functional, and conceal the location of the" +
"departure gates, presumably on the grounds that they are not.";
input = input.toLowerCase();
String whichWord = "be";
whichWord = whichWord.toLowerCase();
int lastIndex = 0;
int count = 0;
while(lastIndex != -1){
lastIndex = input.indexOf(whichWord,lastIndex);
if(lastIndex != -1){
count ++;
lastIndex += whichWord.length();
}
}
System.out.println(count);
`

In your code you are not checking complete word. So, its matching both 'be' and 'because'. You're checking if there are any sub-strings contains the word 'be'. Could you please try below solution using regex? It will solve your purpose:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class WordCount {
public static void main(String[] args) {
String input = " It can hardly be a coincidence that no language on" +
" Earth has ever produced the expression as pretty as an airport." +
" Airports are ugly. Some are very ugly. Some attain a degree of ugliness" +
" that can only be the result of a special effort. This ugliness arises " +
"because airports are full of people who are tired, cross, and have just " +
"discovered that their luggage has landed in Murmansk (Murmansk airport " +
"is the only known exception to this otherwise infallible rule), and architects" +
" have on the whole tried to reflect this in their designs. They have sought" +
" to highlight the tiredness and crossness motif with brutal shapes and nerve" +
" jangling colors, to make effortless the business of separating the traveller" +
" for ever from his or her luggage or loved ones, to confuse the traveller with" +
" arrows that appear to point at the windows, distant tie racks, or the current " +
"position of Ursa Minor in the night sky, and wherever possible to expose the " +
"plumbing on the grounds that it is functional, and conceal the location of the" +
"departure gates, presumably on the grounds that they are not.";
input = input.toLowerCase();
String whichWord = "be";
whichWord = whichWord.toLowerCase();
int count = 0;
String regex = "(\\W|^)" + whichWord + "(\\W|$)";
Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(input);
while(matcher.find()) {
count++;
}
System.out.println(count);
}
}

Android toLowerCase() issue with accented characters

My app has a feature to filter content based on some keywords.
This is case insensitive so in order to work I first call String.toLowerCase() on the source content.
The issue I have is when the source is in upper case and contains accentuated characters like with the french word: "INVITÉ"
This word when set to lowercase using the device default locale returns "invité"
The problem is that the last character is not the same as the lowercase character "é"
Instead it's the combination of 2 chars:
"e" 101 &
" ' " 769
Because of this "invité" does not match "invité"
How can I solve this? I would prefer not to remove accentuated characters altogether

You should normalize the string like this.
String upper = "INVITÉ";
System.out.println(upper + " length=" + upper.length());
String lower = upper.toLowerCase();
System.out.println(lower + " length=" + lower.length());
String normalized = Normalizer.normalize(lower, Normalizer.Form.NFC);
System.out.println(normalized + " length=" + normalized.length());
output:
INVITÉ length=7
invité length=7
invité length=6
It also works for Japanese.
String japanese = "が";
System.out.println(japanese + " length=" + japanese.length());
String normalized = Normalizer.normalize(japanese, Normalizer.Form.NFC);
System.out.println(normalized + " length=" + normalized.length());
output:
が length=2
が length=1

Why does the Java regular expression "|" find a matching substring for any input string?

I am trying to understand why a regular expression ending with "|" (or simply "|" itself) will find a matching substring with start index 0 and end "offset after the last character matched (as per JavaDoc for Matcher)" 0.
The following code demonstrates this:
public static void main(String[] args) {
String regExp = "|";
String toMatch = "A";
Matcher m = Pattern.compile(regExp).matcher(toMatch);
System.out.println("ReqExp: " + regExp +
" found " + toMatch + "(" + m.find() + ") " +
" start: " + m.start() +
" end: " + m.end());
}
Output is:
ReqExp: | found A(true) start: 0 end: 0
I'm confused by the fact that it is even a valid regular expression. And further confused by the fact that start and end are both 0.
Hoping someone can explain this to me.

The pipe in a regular expression means "or." So your regular expression is basically "(empty string) or (empty string)". It successfully finds an empty string at the beginning of the string, and an empty string has a length of 0.

PatternSyntaxException while using string,match()

I'm getting a pattern syntax exception in this regular expression:
[^c]*[c]{freq}[^c]*
It checks for the multiple occurrence of the letter C (equal to frequency or amount of times).

You cannot use freq variable in regex like this. Build your regex as a String:
String regex = "[^c]*c{" + freq + "}[^c]*";
If c is also a variable then use:
String regex = "[^" + c + "]*" + c + "{" + freq + "}[^" + c + "]*";
RegEx Demo

Using setText for more strings

I'm currently working on an Android app. This is my code:
FieldSolution.setText("y =","(Double.toString(m))","x + ", "(Double.toString(b))");
I'm trying to print "y = mx + b" whereas m and b are doubles. Somehow I'm getting exceptions.
Where lies my mistake?

fieldSolution.setText("y =" + Double.toString(m) + " x + " + Double.toString(b));
or simply
fieldSolution.setText("y =" + m + " x + " + b);
Aside: Use Java naming conventions for variable names

You can use String.format:
FieldSolution.setText(String.format("y = %fx + %f", m, b));
You can use modifiers on the %f format specifier to control precision and width of the output. You can also, if appropriate, supply a locale as an argument to format().

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

RegEx help for chess moves (SAN) - java

Re: /^([NBRQK])?([a-h])?([1-8])?(x)?([a-h][1-8])(=[NBRQK])?(\+|#)?$|^O-O(-O)?$/ It's both underinclusive and overinclusive. It excludes the possibly legal moves O-O+, O-O-O+, O-O#, and O-O-O#. It includes many strings that can never be legal: e8=K, Kaa4, Nf5=B, Qa1xb7 and so on.

I have been using this for a while in my web portal. [BRQNK][a-h][1-8]| [a-h][1-8]|[BRQNK][a-h][a-h][1-8]|O-O|0-0-0|[BRQNK]x[a-h][1-8]|[a-h]x[a-h][1-8]|1\/2-1\/2|1\/-O|O-\/1

Related

Find Word Count"- My code doesn't work properly

Android toLowerCase() issue with accented characters

Why does the Java regular expression "|" find a matching substring for any input string?

PatternSyntaxException while using string,match()

Using setText for more strings

Categories

Resources