Regex to find if the partial input is valid JSON

Regex to find if the partial input is valid JSON - java

I have a scenario where I need to validate whether the partial input(see below) is valid JSON or not? I have referred this answer to identify whether the given string is a valid JSON or not.
Example input:
{
"JSON": [{
"foo":"bar",
"details": {
"name":"bar",
"id":"bar",
What I have tried so far:
/ (?(DEFINE)
(?<number> -? (?= [1-9]|0(?!\d) ) \d+ (\.\d+)? ([eE] [+-]? \d+)? )
(?<boolean> true | false | null )
(?<string> " ([^"\n\r\t\\\\]* | \\\\ ["\\\\bfnrt\/] | \\\\ u [0-9a-f]{4} )* " )
(?<array> \[ (?: (?&json) (?: , (?&json) )* )? \s* \]{0,1} )
(?<pair> \s* (?&string) \s* : (?&json) )
(?<object> \{ (?: (?&pair) (?: , (?&pair) )* )? \s* \}{0,1} )
(?<json> \s* (?: (?&number) | (?&boolean) | (?&string) | (?&array) | (?&object) ) \s* )
) \A (?&json)\,{0,1} \Z /six
I made the closing of the array and objects optional(allow zero or one time). But there are some cases where this will fail, for example when you open a object without closing another object(shown below) the regex will still find a match.
Invalid, but still matches:
{
"JSON": [{
"foo":"bar",
"details": {
"name":"bar",
"id":"bar",{
How to validate the partial JSON input?
EDIT:
As mentioned by #ntahdh in the comments, this regex won't work using the java.util.regex. So now I need a regex which should work without recursion

This is not quite an answer to you question and would have been if the form of a comment if the number of characters allowed for that were adequate.
JSON is not a regular language and cannot therefore be recognized solely by a regular expression engine (if you are programming in Python, the regex package provides extensions that might make it possible to accomplish your task, but what I said is generally true).
If a parser generator is not available for your preferred language, you might consider creating a simple recursive descent parser. The regular expressions you have already defined will serve you well for creating the tokens that will be the input to that parser. Of course, you will expect that a parsing error will occur -- but it should occur on the input token being the end-of-file token. A parsing error that occurs before the end-of-file token has been scanned suggests you do not have a prefix of valid JSON. If you are working with a bottom-up, shift-reduce parser such as one generated with YACC, then this would be a shift error on something other than the end-of-file token.

why not let a parser like Gson do it for you, you basically deal with a stream and at a token level.
import java.io.IOException;
import java.io.StringReader;
import com.google.gson.stream.JsonReader;
import com.google.gson.stream.JsonToken;
public class Main
{
public static void main(String[] args) throws Exception
{
String json = "{'id': 1001,'firstName': 'Lokesh','lastName': 'Gupta','email': null}";
JsonReader jsonReader = new JsonReader(new StringReader(json));
jsonReader.setLenient(true);
try
{
while (jsonReader.hasNext())
{
JsonToken nextToken = jsonReader.peek();
if (JsonToken.BEGIN_OBJECT.equals(nextToken)) {
jsonReader.beginObject();
} else if (JsonToken.NAME.equals(nextToken)) {
String name = jsonReader.nextName();
System.out.println("Token KEY >>>> " + name);
} else if (JsonToken.STRING.equals(nextToken)) {
String value = jsonReader.nextString();
System.out.println("Token Value >>>> " + value);
} else if (JsonToken.NUMBER.equals(nextToken)) {
long value = jsonReader.nextLong();
System.out.println("Token Value >>>> " + value);
} else if (JsonToken.NULL.equals(nextToken)) {
jsonReader.nextNull();
System.out.println("Token Value >>>> null");
} else if (JsonToken.END_OBJECT.equals(nextToken)) {
jsonReader.endObject();
}
}
} catch (IOException e) {
e.printStackTrace();
} finally {
jsonReader.close();
}
}
}
source: https://howtodoinjava.com/gson/jsonreader-streaming-json-parser/

I know that using regex to validate some strings with nested structures is not easy, if not unfeasible at all.
You will probably have more chance using an existing JSON parser.
Use a stack to keep track of still opened objects and arrays.
Add required closing curly and square brackets.
Ask to the JSON parser if your new string is a valid JSON.
You will probably have to do some work to handle commas and quotes too, but you get the idea.
With a code sample:
import com.google.gson.JsonParser;
import com.google.gson.JsonSyntaxException;
import java.util.Stack;
public class Main {
public static void main(String[] args) {
String valid = "{\n" +
"\"JSON\": [{\n" +
" \"foo\":\"bar\",\n" +
" \"details\": {\n" +
" \"name\":\"bar\",\n" +
" \"id\":\"bar\"";
System.out.println("Is valid?:\n" + valid + "\n" + validate(valid));
String invalid = "{ \n" +
" \"JSON\": [{\n" +
" \"foo\":\"bar\",\n" +
" \"details\": {\n" +
" \"name\":\"bar\",\n" +
" \"id\":\"bar\",{";
System.out.println("Is valid?:\n" + invalid + "\n" + validate(invalid));
}
public static boolean validate(String input) {
Stack<String> closings = new Stack<>();
for (char ch: input.toCharArray()) {
switch(ch) {
case '{':
closings.push("}");
break;
case '[':
closings.push("]");
break;
case '}':
case ']':
closings.pop();
}
}
StringBuilder closingBuilder = new StringBuilder();
while (! closings.empty()) {
closingBuilder.append(closings.pop());
}
String fullInput = input + closingBuilder.toString();
JsonParser parser = new JsonParser();
try{
parser.parse(fullInput);
}
catch(JsonSyntaxException jse){
return false;
}
return true;
}
}
Which results in:
Is valid?:
{
"JSON": [{
"foo":"bar",
"details": {
"name":"bar",
"id":"bar"
true
Is valid?:
{
"JSON": [{
"foo":"bar",
"details": {
"name":"bar",
"id":"bar",{
false
Note that adding a comma after the "bar" line in the valid example make it invalid (because "bar",}]}} is an invalid JSON).

Related

Camel-Case to Sentence-Case in Java

I have the following code to convert a camel-case phrase to sentence-case. It works fine for almost all cases, but it can't handle acronyms. How can this code be corrected to work with acronyms?
private static final Pattern UPPERCASE_LETTER = Pattern.compile("([A-Z]|[0-9]+)");
static String toSentenceCase(String camelCaseString) {
return camelCaseString.substring(0, 1).toUpperCase()
+ UPPERCASE_LETTER.matcher(camelCaseString.substring(1))
.replaceAll(matchResult -> " " + (matchResult.group(1).toLowerCase()));
}
JUnit5 test:
#ParameterizedTest(name = "#{index}: Convert {0} to sentence case")
#CsvSource(value = {"testOfAcronymUSA:Test of acronym USA"}, delimiter = ':')
void shouldSentenceCaseAcronym(String input, String expected) {
//TODO: currently fails
assertEquals(expected, toSentenceCase(input));
}
Output:
org.opentest4j.AssertionFailedError:
Expected :Test of acronym USA
Actual :Test of acronym u s a
I thought to add (?=[a-z]) to the end of the regex, but then it doesn't handle the spacing correctly.
I'm on Java 14.

Change the regex to (?<=[a-z])[A-Z]+|[A-Z](?=[a-z])|[0-9]+ where
(?<=[a-z])[A-Z]+ specifies positive lookbehind for [a-z]
[A-Z](?=[a-z]) specifies positive lookahead for [a-z]
Note that you do not need any capturing group.
Demo:
import java.util.regex.Pattern;
public class Main {
private static final Pattern UPPERCASE_LETTER = Pattern.compile("(?<=[a-z])[A-Z]+|[A-Z](?=[a-z])|[0-9]+");
static String toSentenceCase(String camelCaseString) {
return camelCaseString.substring(0, 1).toUpperCase() + UPPERCASE_LETTER.matcher(camelCaseString.substring(1))
.replaceAll(matchResult -> !matchResult.group().matches("[A-Z]{2,}")
? " " + matchResult.group().toLowerCase()
: " " + matchResult.group());
}
public static void main(String[] args) {
System.out.println(toSentenceCase("camelCaseString"));
System.out.println(toSentenceCase("USA"));
System.out.println(toSentenceCase("camelCaseStringUSA"));
}
}
Output:
Camel case string
USA
Camel case string USA

To fix your immediate issue you may use
private static final Pattern UPPERCASE_LETTER = Pattern.compile("([A-Z]{2,})|([A-Z]|[0-9]+)");
static String toSentenceCase(String camelCaseString) {
return camelCaseString.substring(0, 1).toUpperCase()
+ UPPERCASE_LETTER.matcher(camelCaseString.substring(1))
.replaceAll(m -> m.group(1) != null ? " " + m.group(1) : " " + m.group(2).toLowerCase() );
}
See the Java demo.
Details
([A-Z]{2,})|([A-Z]|[0-9]+) regex matches and captures into Group 1 two or more uppercase letters, or captures into Group 2 a single uppercase letter or 1+ digits
.replaceAll(m -> m.group(1) != null ? " " + m.group(1) : " " + m.group(2).toLowerCase() ) replaces with space + Group 1 if Group 1 matched, else with a space and Group 2 turned to lower case.

java grouping regex fail to match string+text

I wrote this test
#Test
public void removeRequestTextFromRouteError() throws Exception {
String input = "Failed to handle request regression_4828 HISTORIC_TIME from=s:33901510 tn:27825741 bd:false st:Winifred~Dr to=s:-1 d:false f:-1.0 x:-73.92752 y:40.696857 r:-1.0 cd:-1.0 fn:-1 tn:-1 bd:true 1 null false null on subject RoutingRequest";
final String answer = stringUtils.removeRequestTextFromError(input);
String expected = "Failed to handle request _ on subject RoutingRequest";
assertThat(answer, equalTo(expected));
}
which runs this method, but fails
public String removeRequestTextFromError(String answer) {
answer = answer.replaceAll("regression_\\d\\[.*?\\] on subject", "_ on subject");
return answer;
}
The input text stays the same and not replaced with "_"
how can I change the pattern matching to fix this?

You are using the a wrong regex. You are escaping [ and ] (not necessary at all) and using \\d instead of \\d+. Also, you should use a positive look-ahead instead of actually selecting and replacing the String "on subject"
Use :
public static void main(String[] args) {
String input = "Failed to handle request regression_4828 HISTORIC_TIME from=s:33901510 tn:27825741 bd:false st:Winifred~Dr to=s:-1 d:false f:-1.0 x:-73.92752 y:40.696857 r:-1.0 cd:-1.0 fn:-1 tn:-1 bd:true 1 null false null on subject RoutingRequest";
final String answer = input.replaceAll("regression_.* (?=on subject)", "_ ");
System.out.println(answer);
String expected = "Failed to handle request _ on subject RoutingRequest";
System.out.println(answer.equals(expected));
}
O/P :
Failed to handle request _ on subject RoutingRequest
true

As an alternative to the answer given by #TheLostMind, you can try breaking your input into 3 pieces, the second piece being what you want to match and then remove.
Each quantity in parentheses, if matched, will be available as a capture group. Here is the regex with the capture groups labelled:
(.*)(regression_\\d+.* on subject)(.*)
$1 $2 $3
You want to retain $1 and $3:
public String removeRequestTextFromError(String answer) {
answer = answer.replaceAll("(.*)(regression_\\d+.* on subject)(.*)", "$1$3");
}

Antlr : beginner 's mismatched input expecting ID

As a beginner, when I was learning ANTLR4 from the The Definitive ANTLR 4 Reference book, I tried to run my modified version of the exercise from Chapter 7:
/**
* to parse properties file
* this example demonstrates using embedded actions in code
*/
grammar PropFile;
#header {
import java.util.Properties;
}
#members {
Properties props = new Properties();
}
file
:
{
System.out.println("Loading file...");
}
prop+
{
System.out.println("finished:\n"+props);
}
;
prop
: ID '=' STRING NEWLINE
{
props.setProperty($ID.getText(),$STRING.getText());//add one property
}
;
ID : [a-zA-Z]+ ;
STRING :(~[\r\n])+; //if use STRING : '"' .*? '"' everything is fine
NEWLINE : '\r'?'\n' ;
Since Java properties are just key-value pair I use STRING to match eveything except NEWLINE (I don't want it to just support strings in the double-quotes). When running following sentence, I got:
D:\Antlr\Ex\PropFile\Prop1>grun PropFile prop -tokens
driver=mysql
^Z
[#0,0:11='driver=mysql',<3>,1:0]
[#1,12:13='\r\n',<4>,1:12]
[#2,14:13='<EOF>',<-1>,2:14]
line 1:0 mismatched input 'driver=mysql' expecting ID
When I use STRING : '"' .*? '"' instead, it works.
I would like to know where I was wrong so that I can avoid similar mistakes in the future.
Please give me some suggestion, thank you!

Since both ID and STRING can match the input text starting with "driver", the lexer will choose the longest possible match, even though the ID rule comes first.
So, you have several choices here. The most direct is to remove the ambiguity between ID and STRING (which is how your alternative works) by requiring the string to start with the equals sign.
file : prop+ EOF ;
prop : ID STRING NEWLINE ;
ID : [a-zA-Z]+ ;
STRING : '=' (~[\r\n])+;
NEWLINE : '\r'?'\n' ;
You can then use an action to trim the equals sign from the text of the string token.
Alternately, you can use a predicate to disambiguate the rules.
file : prop+ EOF ;
prop : ID '=' STRING NEWLINE ;
ID : [a-zA-Z]+ ;
STRING : { isValue() }? (~[\r\n])+;
NEWLINE : '\r'?'\n' ;
where the isValue method looks backwards on the character stream to verify that it follows an equals sign. Something like:
#members {
public boolean isValue() {
int offset = _tokenStartCharIndex;
for (int idx = offset-1; idx >=0; idx--) {
String s = _input.getText(Interval.of(idx, idx));
if (Character.isWhitespace(s.charAt(0))) {
continue;
} else if (s.charAt(0) == '=') {
return true;
} else {
break;
}
}
return false;
}
}

JavaCC: How can I specify which token(s) are expected in certain context?

I need to make JavaCC aware of a context (current parent token), and depending on that context, expect different token(s) to occur.
Consider the following pseudo-code:
TOKEN <abc> { "abc*" } // recognizes "abc", "abcd", "abcde", ...
TOKEN <abcd> { "abcd*" } // recognizes "abcd", "abcde", "abcdef", ...
TOKEN <element1> { "element1" "[" expectOnly(<abc>) "]" }
TOKEN <element2> { "element2" "[" expectOnly(<abcd>) "]" }
...
So when the generated parser is "inside" a token named "element1" and it encounter "abcdef" it recognizes it as <abc>, but when its "inside" a token named "element2" it recognizes the same string as <abcd>.
element1 [ abcdef ] // aha! it can only be <abc>
element2 [ abcdef ] // aha! it can only be <abcd>
If I'm not wrong, it would behave similar to more complex DTD definitions of an XML file.
So, how can one specify, in which "context" which token(s) are valid/expected?
NOTE: It would be not enough for my real case to define a kind of "hierarchy" of tokens, so that "abcdef" is always first matched against <abcd> and than <abc>. I really need context-aware tokens.

OK, it seems that you need a technique called lookahead here. Here is a very good tutorial:
Lookahead tutorial
My first attempt was wrong then, but as it works for distinct tokens which define a context I'll leave it here (Maybe it's useful for somebody ;o)).
Let's say we want to have some kind of markup language. All we want to "markup" are:
Expressions consisting of letters (abc...zABC...Z) and whitespaces --> words
Expressions consisting of numbers (0-9) --> numbers
We want to enclose words in tags and numbers in tags. So if i got you right that is what you want to do: If you're in the word context (between word tags) the compiler should expect letters and whitespaces, in the number context it expects numbers.
I created the file WordNumber.jj which defines the grammar and the parser to be generated:
options
{
LOOKAHEAD= 1;
CHOICE_AMBIGUITY_CHECK = 2;
OTHER_AMBIGUITY_CHECK = 1;
STATIC = true;
DEBUG_PARSER = false;
DEBUG_LOOKAHEAD = false;
DEBUG_TOKEN_MANAGER = false;
ERROR_REPORTING = true;
JAVA_UNICODE_ESCAPE = false;
UNICODE_INPUT = false;
IGNORE_CASE = false;
USER_TOKEN_MANAGER = false;
USER_CHAR_STREAM = false;
BUILD_PARSER = true;
BUILD_TOKEN_MANAGER = true;
SANITY_CHECK = true;
FORCE_LA_CHECK = false;
}
PARSER_BEGIN(WordNumberParser)
/** Model-tree Parser */
public class WordNumberParser
{
/** Main entry point. */
public static void main(String args []) throws ParseException
{
WordNumberParser parser = new WordNumberParser(System.in);
parser.Input();
}
}
PARSER_END(WordNumberParser)
SKIP :
{
" "
| "\n"
| "\r"
| "\r\n"
| "\t"
}
TOKEN :
{
< WORD_TOKEN : (["a"-"z"] | ["A"-"Z"] | " " | "." | ",")+ > |
< NUMBER_TOKEN : (["0"-"9"])+ >
}
/** Root production. */
void Input() :
{}
{
( WordContext() | NumberContext() )* < EOF >
}
/** WordContext production. */
void WordContext() :
{}
{
"<WORDS>" (< WORD_TOKEN >)+ "</WORDS>"
}
/** NumberContext production. */
void NumberContext() :
{}
{
"<NUMBER>" (< NUMBER_TOKEN >)+ "</NUMBER>"
}
You can test it with a file like that:
<WORDS>This is a sentence. As you can see the parser accepts it.</WORDS>
<WORDS>The answer to life, universe and everything is</WORDS><NUMBER>42</NUMBER>
<NUMBER>This sentence will make the parser sad. Do not make the parser sad.</NUMBER>
The Last line will cause the parser to throw an exception like this:
Exception in thread "main" ParseException: Encountered " <WORD_TOKEN> "This sentence will make the parser sad. Do not make the parser sad. "" at line 3, column 9.
Was expecting:
<NUMBER_TOKEN> ...
That is because the parser did not find what it expected.
I hope that helps.
Cheers!
P.S.: The parser can't "be" inside a token as a token is a terminal symbol (correct me if I'm wrong) which can't be replaced by production rules any further. So all the context aspects have to be placed inside a production rule (non terminal) like "WordContext" in my example.

You need to use lexer states. Your example becomes something like:
<DEFAULT> TOKEN: { <ELEMENT1: "element1">: IN_ELEMENT1 }
<DEFAULT> TOKEN: { <ELEMENT2: "element2">: IN_ELEMENT2 }
<IN_ELEMENT1> TOKEN: { <ABC: "abc" (...)*>: DEFAULT }
<IN_ELEMENT2> TOKEN: { <ABCD: "abcd" (...)*>: DEFAULT }
Please note that the (...)* are not proper JavaCC syntax, but your example is not either so I can only guess.

How to match a comment unless it's in a quoted string?

So I have some string:
//Blah blah blach
// sdfkjlasdf
"Another //thing"
And I'm using java regex to replace all the lines that have double slashes like so:
theString = Pattern.compile("//(.*?)\\n", Pattern.DOTALL).matcher(theString).replaceAll("");
And it works for the most part, but the problem is it removes all the occurrences and I need to find a way to have it not remove the quoted occurrence. How would I go about doing that?

Instead of using a parser that parses an entire Java source file, or writing something yourself that parses only those parts you're interested in, you could use some 3rd party tool like ANTLR.
ANTLR has the ability to define only those tokens you are interested in (and of course the tokens that can mess up your token-stream like multi-line comments and String- and char literals). So you only need to define a lexer (another word for tokenizer) that correctly handles those tokens.
This is called a grammar. In ANTLR, such a grammar could look like this:
lexer grammar FuzzyJavaLexer;
options{filter=true;}
SingleLineComment
: '//' ~( '\r' | '\n' )*
;
MultiLineComment
: '/*' .* '*/'
;
StringLiteral
: '"' ( '\\' . | ~( '"' | '\\' ) )* '"'
;
CharLiteral
: '\'' ( '\\' . | ~( '\'' | '\\' ) )* '\''
;
Save the above in a file called FuzzyJavaLexer.g. Now download ANTLR 3.2 here and save it in the same folder as your FuzzyJavaLexer.g file.
Execute the following command:
java -cp antlr-3.2.jar org.antlr.Tool FuzzyJavaLexer.g
which will create a FuzzyJavaLexer.java source class.
Of course you need to test the lexer, which you can do by creating a file called FuzzyJavaLexerTest.java and copying the code below in it:
import org.antlr.runtime.*;
public class FuzzyJavaLexerTest {
public static void main(String[] args) throws Exception {
String source =
"class Test { \n"+
" String s = \" ... \\\" // no comment \"; \n"+
" /* \n"+
" * also no comment: // foo \n"+
" */ \n"+
" char quote = '\"'; \n"+
" // yes, a comment, finally!!! \n"+
" int i = 0; // another comment \n"+
"} \n";
System.out.println("===== source =====");
System.out.println(source);
System.out.println("==================");
ANTLRStringStream in = new ANTLRStringStream(source);
FuzzyJavaLexer lexer = new FuzzyJavaLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
for(Object obj : tokens.getTokens()) {
Token token = (Token)obj;
if(token.getType() == FuzzyJavaLexer.SingleLineComment) {
System.out.println("Found a SingleLineComment on line "+token.getLine()+
", starting at column "+token.getCharPositionInLine()+
", text: "+token.getText());
}
}
}
}
Next, compile your FuzzyJavaLexer.java and FuzzyJavaLexerTest.java by doing:
javac -cp .:antlr-3.2.jar *.java
and finally execute the FuzzyJavaLexerTest.class file:
// *nix/MacOS
java -cp .:antlr-3.2.jar FuzzyJavaLexerTest
or:
// Windows
java -cp .;antlr-3.2.jar FuzzyJavaLexerTest
after which you'll see the following being printed to your console:
===== source =====
class Test {
String s = " ... \" // no comment ";
/*
* also no comment: // foo
*/
char quote = '"';
// yes, a comment, finally!!!
int i = 0; // another comment
}
==================
Found a SingleLineComment on line 7, starting at column 2, text: // yes, a comment, finally!!!
Found a SingleLineComment on line 8, starting at column 13, text: // another comment
Pretty easy, eh? :)

Use a parser, determine it char-by-char.
Kickoff example:
StringBuilder builder = new StringBuilder();
boolean quoted = false;
for (String line : string.split("\\n")) {
for (int i = 0; i < line.length(); i++) {
char c = line.charAt(i);
if (c == '"') {
quoted = !quoted;
}
if (!quoted && c == '/' && i + 1 < line.length() && line.charAt(i + 1) == '/') {
break;
} else {
builder.append(c);
}
}
builder.append("\n");
}
String parsed = builder.toString();
System.out.println(parsed);

(This is in answer to the question #finnw asked in the comment under his answer. It's not so much an answer to the OP's question as an extended explanation of why a regex is the wrong tool.)
Here's my test code:
String r0 = "(?m)^((?:[^\"]|\"(?:[^\"]|\\\")*\")*)//.*$";
String r1 = "(?m)^((?:[^\"\r\n]|\"(?:[^\"\r\n]|\\\")*\")*)//.*$";
String r2 = "(?m)^((?:[^\"\r\n]|\"(?:[^\"\r\n\\\\]|\\\\\")*\")*)//.*$";
String test =
"class Test { \n"+
" String s = \" ... \\\" // no comment \"; \n"+
" /* \n"+
" * also no comment: // but no harm \n"+
" */ \n"+
" /* no comment: // much harm */ \n"+
" char quote = '\"'; // comment \n"+
" // another comment \n"+
" int i = 0; // and another \n"+
"} \n"
.replaceAll(" +$", "");
System.out.printf("%n%s%n", test);
System.out.printf("%n%s%n", test.replaceAll(r0, "$1"));
System.out.printf("%n%s%n", test.replaceAll(r1, "$1"));
System.out.printf("%n%s%n", test.replaceAll(r2, "$1"));
r0 is the edited regex from your answer; it removes only the final comment (// and another), because everything else is matched in group(1). Setting multiline mode ((?m)) is necessary for ^ and $ to work right, but it doesn't solve this problem because your character classes can still match newlines.
r1 deals with the newline problem, but it still incorrectly matches // no comment in the string literal, for two reasons: you didn't include a backslash in the first part of (?:[^\"\r\n]|\\\"); and you only used two of them to match the backslash in the second part.
r2 fixes that, but it makes no attempt to deal with the quote in the char literal, or single-line comments inside the multiline comments. They can probably be handled too, but this regex is already Baby Godzilla; do you really want to see it all grown up?.

The following is from a grep-like program I wrote (in Perl) a few years ago. It has an option to strip java comments before processing the file:
# ============================================================================
# ============================================================================
#
# strip_java_comments
# -------------------
#
# Strip the comments from a Java-like file. Multi-line comments are
# replaced with the equivalent number of blank lines so that all text
# left behind stays on the same line.
#
# Comments are replaced by at least one space .
#
# The text for an entire file is assumed to be in $_ and is returned
# in $_
#
# ============================================================================
# ============================================================================
sub strip_java_comments
{
s!( (?: \" [^\"\\]* (?: \\. [^\"\\]* )* \" )
| (?: \' [^\'\\]* (?: \\. [^\'\\]* )* \' )
| (?: \/\/ [^\n] *)
| (?: \/\* .*? \*\/)
)
!
my $x = $1;
my $first = substr($x, 0, 1);
if ($first eq '/')
{
"\n" x ($x =~ tr/\n//);
}
else
{
$x;
}
!esxg;
}
This code does actually work properly and can't be fooled by tricky comment/quote combinations. It will probably be fooled by unicode escapes (\u0022 etc), but you can easily deal with those first if you want to.
As it's Perl, not java, the replacement code will have to change. I'll have a quick crack at producing equivalent java. Stand by...
EDIT: I've just whipped this up. Will probably need work:
// The trick is to search for both comments and quoted strings.
// That way we won't notice a (partial or full) comment withing a quoted string
// or a (partial or full) quoted-string within a comment.
// (I may not have translated the back-slashes accurately. You'll figure it out)
Pattern p = Pattern.compile(
"( (?: \" [^\"\\\\]* (?: \\\\. [^\"\\\\]* )* \" )" + // " ... "
" | (?: ' [^'\\\\]* (?: \\\\. [^'\\\\]* )* ' )" + // or ' ... '
" | (?: // [^\\n] * )" + // or // ...
" | (?: /\\* .*? \\* / )" + // or /* ... */
")",
Pattern.DOTALL | Pattern.COMMENTS
);
Matcher m = p.matcher(entireInputFileAsAString);
StringBuilder output = new StringBuilder();
while (m.find())
{
if (m.group(1).startsWith("/"))
{
// This is a comment. Replace it with a space...
m.appendReplacement(output, " ");
// ... or replace it with an equivalent number of newlines
// (exercise for reader)
}
else
{
// We matched a quoted string. Put it back
m.appendReplacement(output, "$1");
}
}
m.appendTail(output);
return output.toString();

You can't tell using regex if you are in double quoted string or not. In the end regex is just a state machine (sometimes extended abit). I would use a parser as provided by BalusC or this one.
If you want know why the regex are limited read about formal grammars. A wikipedia article is a good start.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex to find if the partial input is valid JSON - java

Related

Camel-Case to Sentence-Case in Java

java grouping regex fail to match string+text

Antlr : beginner 's mismatched input expecting ID

JavaCC: How can I specify which token(s) are expected in certain context?

How to match a comment unless it's in a quoted string?

Categories

Resources