A little fun with Java this time. I want to write a program that reads a code from standard input (line by line, for example), like:
// some comment
class Main {
/* blah */
// /* foo
foo();
// foo */
foo2();
/* // foo2 */
}
finds all comments in it and removes them. I'm trying to use regular expressions, and for now I've done something like this:
private static String ParseCode(String pCode)
{
String MyCommentsRegex = "(?://.*)|(/\\*(?:.|[\\n\\r])*?\\*/)";
return pCode.replaceAll(MyCommentsRegex, " ");
}
but it seems not to work for all the cases, e.g.:
System.out.print("We can use /* comments */ inside a string of course, but it shouldn't start a comment");
Any advice or ideas different from regex?
Thanks in advance.
You may have already given up on this by now but I was intrigued by the problem.
I believe this is a partial solution...
Native regex:
//.*|("(?:\\[^"]|\\"|.)*?")|(?s)/\*.*?\*/
In Java:
String clean = original.replaceAll( "//.*|(\"(?:\\\\[^\"]|\\\\\"|.)*?\")|(?s)/\\*.*?\\*/", "$1 " );
This appears to properly handle comments embedded in strings as well as properly escaped quotes inside strings. I threw a few things at it to check but not exhaustively.
There is one compromise in that all "" blocks in the code will end up with space after them. Keeping this simple and solving that problem would be very difficult given the need to cleanly handle:
int/* some comment */foo = 5;
A simple Matcher.find/appendReplacement loop could conditionally check for group(1) before replacing with a space and would only be a handful of lines of code. Still simpler than a full up parser maybe. (I could add the matcher loop too if anyone is interested.)
The last example is no problem I think:
/* we comment out some code
System.out.print("We can use */ inside a string of course");
we end the comment */
... because the comment actually ends with "We can use */. This code does not compile.
But I have another problematic case:
int/*comment*/foo=3;
Your pattern will transform this into:
intfoo=3;
...what is invalid code. So better replace your comments with " " instead of "".
I think a 100% correct solution using regular expressions is either inhuman or impossible (taking into account escapes, etc.).
I believe the best option would be using ANTLR- I believe they even provide a Java grammar you can use.
I ended up with this solution.
public class CommentsFun {
static List<Match> commentMatches = new ArrayList<Match>();
public static void main(String[] args) {
Pattern commentsPattern = Pattern.compile("(//.*?$)|(/\\*.*?\\*/)", Pattern.MULTILINE | Pattern.DOTALL);
Pattern stringsPattern = Pattern.compile("(\".*?(?<!\\\\)\")");
String text = getTextFromFile("src/my/test/CommentsFun.java");
Matcher commentsMatcher = commentsPattern.matcher(text);
while (commentsMatcher.find()) {
Match match = new Match();
match.start = commentsMatcher.start();
match.text = commentsMatcher.group();
commentMatches.add(match);
}
List<Match> commentsToRemove = new ArrayList<Match>();
Matcher stringsMatcher = stringsPattern.matcher(text);
while (stringsMatcher.find()) {
for (Match comment : commentMatches) {
if (comment.start > stringsMatcher.start() && comment.start < stringsMatcher.end())
commentsToRemove.add(comment);
}
}
for (Match comment : commentsToRemove)
commentMatches.remove(comment);
for (Match comment : commentMatches)
text = text.replace(comment.text, " ");
System.out.println(text);
}
//Single-line
// "String? Nope"
/*
* "This is not String either"
*/
//Complex */
///*More complex*/
/*Single line, but */
String moreFun = " /* comment? doubt that */";
String evenMoreFun = " // comment? doubt that ";
static class Match {
int start;
String text;
}
}
Another alternative is to use some library supporting AST parsing, for e.g. org.eclipse.jdt.core has all the APIs you need to do this and more. But then that's just one alternative:)
Related
private static String filterString(String code) {
String partialFiltered = code.replaceAll("/\\*.*\\*/", "");
String fullFiltered = partialFiltered.replaceAll("//.*(?=\\n)", "");
return fullFiltered;
}
I tried above code to remove all comments in a string but it isn't working - please help.
Works with both // single and multi-line /* comments */.
String sourceCode =
"/*\n"
+ " * Multi-line comment\n"
+ " * Creates a new Object.\n"
+ " */\n"
+ "public Object someFunction() {\n"
+ " // single line comment\n"
+ " Object obj = new Object();\n"
+ " return obj; /* single-line comment */\n"
+ "}";
System.out.println(sourceCode.replaceAll(
"//.*|/\\*((.|\\n)(?!=*/))+\\*/", ""));
Input :
/*
* Multi-line comment
* Creates a new Object.
*/
public Object someFunction() {
// single line comment
Object obj = new Object();
return obj; /* single-line comment */
}
Output :
public Object someFunction() {
Object obj = new Object();
return obj;
}
How about....
private static String filterString(String code) {
return code.Replace("//", "").Replace("/*", "").Replace("*/", "");
}
Replace below code
partialFiltered.replaceAll("//.*(?=\\n)", "");
With,
partialFiltered.replaceAll("//.*?\n","\n");
You need to use (?s) at the start of your partialFiltered regex to allow for comments spanning multiple lines (e.g. see Pattern.DOTALL with String.replaceAll).
But then the .* in the middle of /\\*.*\\*/ uses a greedy match so I'd expect it to replace the whole lot between two separate comment blocks. E.g., given the following:
/* Comment #1 */
for (i = 0; i < 10; i++)
{
i++
}
/* Comment #2 */
Haven't tested this so am risking egg on my face but would expect it to remove the whole lot including the code in the middle rather than just the two comments. One way to prevent would be to use .*? to make the inner matching non-greedy, i.e. to match as little as possible:
String partialFiltered = code.replaceAll("(?s)/\\*.*?\\*/", "");
Since the fullFiltered regex doesn't begin with (?s), it should work without the (?=\\n) (since the replaceAll regex doesn't span multiple lines by default) - so you should be able to change it to:
String fullFiltered = partialFiltered.replaceAll("//.*", "");
There are also possible issues with looking for the characters denoting a comment, e.g. if they appear within a string or regular expression pattern but I'm assuming these aren't important for your application - if they are it's probably the end of the road for using simple regular expressions and you may need a parser instead...
Maybe this can help someone:
return code.replaceAll(
"((['\"])(?:(?!\\2|\\\\).|\\\\.)*\\2)|\\/\\/[^\\n]*|\\/\\*(?:[^*]|\\*(?!\\/))*\\*\\/", "$1");
Use this regexp to test ((['"])(?:(?!\2|\\).|\\.)*\2)|\/\/[^\n]*|\/\*(?:[^*]|\*(?!\/))*\*\/ here
I have code below, but it looks it parse keyword in wrong way for chinese. How can i change it?
OUTPUT:
keyword:test
keyword:中
keyword:文
keyword:U
keyword:I
keyword:素
keyword:材
Should be below:
keyword:test
keyword:中文
keyword:UI
keyword:素材
This is my code:
public class test {
public static final Pattern KEYWORDS_REGEX =
Pattern.compile("[^\\s,](?:[^,]+[^\\s,])?");
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
String keywords = "test, 中文, UI, 素材";
Matcher matcher = KEYWORDS_REGEX.matcher(keywords);
while (matcher.find()) {
String s = matcher.group();
System.out.println("keyword:" +s);
}
}
Thanks!
The problem isn't with Chinese characters, the problem is with keywords that are two characters long. (That's why it affects UI as well.) This regex:
[^\s,](?:[^,]+[^\s,])?
allows two possibilities:
[^\s,] <-- exactly one character
[^\s,][^,]+[^\s,] <-- three or more characters
so any keywords with two characters will not match, so they get split into single-character keywords.
You could fix your regex by changing [^,]+ to [^,]*, but I'm inclined to agree with the spirit of Kisaro's comment above; I think you'd be better off using Pattern.split:
private static final KEYWORD_SPLITTER = Pattern.compile("\\s*,\\s*");
for(final String s : KEYWORD_SPLITTER.split(keywords))
System.out.println("keyword:" + s);
Your regex should be \\w to match words. This should generate the desired output.
Also since someone suggested explode: Apache Commons
I am having a group of strings in Arraylist.
I want to remove all the strings with only numbers
and also strings like this : (0.75%),$1.5 ..basically everything that does not contain the characters.
2) I want to remove all special characters in the string before i write to the console.
"God should be printed God.
"Including should be printed: quoteIncluding
'find should be find
Java boasts a very nice Pattern class that makes use of regular expressions. You should definitely read up on that. A good reference guide is here.
I was going to post a coding solution for you, but styfle beat me to it! The only thing I was going to do different here was within the for loop, I would have used the Pattern and Matcher class, as such:
for(int i = 0; i < myArray.size(); i++){
Pattern p = Pattern.compile("[a-z][A-Z]");
Matcher m = p.matcher(myArray.get(i));
boolean match = m.matches();
//more code to get the string you want
}
But that too bulky. styfle's solution is succinct and easy.
When you say "characters," I'm assuming you mean only "a through z" and "A through Z." You probably want to use Regular Expressions (Regex) as D1e mentioned in a comment. Here is an example using the replaceAll method.
import java.util.ArrayList;
public class Test {
public static void main(String[] args) {
ArrayList<String> list = new ArrayList<String>(5);
list.add("\"God");
list.add(""Including");
list.add("'find");
list.add("24No3Numbers97");
list.add("w0or5*d;");
for (String s : list) {
s = s.replaceAll("[^a-zA-Z]",""); //use whatever regex you wish
System.out.println(s);
}
}
}
The output of this code is as follows:
God
quotIncluding
find
NoNumbers
word
The replaceAll method uses a regex pattern and replaces all the matches with the second parameter (in this case, the empty string).
A little fun with Java this time. I want to write a program that reads a code from standard input (line by line, for example), like:
// some comment
class Main {
/* blah */
// /* foo
foo();
// foo */
foo2();
/* // foo2 */
}
finds all comments in it and removes them. I'm trying to use regular expressions, and for now I've done something like this:
private static String ParseCode(String pCode)
{
String MyCommentsRegex = "(?://.*)|(/\\*(?:.|[\\n\\r])*?\\*/)";
return pCode.replaceAll(MyCommentsRegex, " ");
}
but it seems not to work for all the cases, e.g.:
System.out.print("We can use /* comments */ inside a string of course, but it shouldn't start a comment");
Any advice or ideas different from regex?
Thanks in advance.
You may have already given up on this by now but I was intrigued by the problem.
I believe this is a partial solution...
Native regex:
//.*|("(?:\\[^"]|\\"|.)*?")|(?s)/\*.*?\*/
In Java:
String clean = original.replaceAll( "//.*|(\"(?:\\\\[^\"]|\\\\\"|.)*?\")|(?s)/\\*.*?\\*/", "$1 " );
This appears to properly handle comments embedded in strings as well as properly escaped quotes inside strings. I threw a few things at it to check but not exhaustively.
There is one compromise in that all "" blocks in the code will end up with space after them. Keeping this simple and solving that problem would be very difficult given the need to cleanly handle:
int/* some comment */foo = 5;
A simple Matcher.find/appendReplacement loop could conditionally check for group(1) before replacing with a space and would only be a handful of lines of code. Still simpler than a full up parser maybe. (I could add the matcher loop too if anyone is interested.)
The last example is no problem I think:
/* we comment out some code
System.out.print("We can use */ inside a string of course");
we end the comment */
... because the comment actually ends with "We can use */. This code does not compile.
But I have another problematic case:
int/*comment*/foo=3;
Your pattern will transform this into:
intfoo=3;
...what is invalid code. So better replace your comments with " " instead of "".
I think a 100% correct solution using regular expressions is either inhuman or impossible (taking into account escapes, etc.).
I believe the best option would be using ANTLR- I believe they even provide a Java grammar you can use.
I ended up with this solution.
public class CommentsFun {
static List<Match> commentMatches = new ArrayList<Match>();
public static void main(String[] args) {
Pattern commentsPattern = Pattern.compile("(//.*?$)|(/\\*.*?\\*/)", Pattern.MULTILINE | Pattern.DOTALL);
Pattern stringsPattern = Pattern.compile("(\".*?(?<!\\\\)\")");
String text = getTextFromFile("src/my/test/CommentsFun.java");
Matcher commentsMatcher = commentsPattern.matcher(text);
while (commentsMatcher.find()) {
Match match = new Match();
match.start = commentsMatcher.start();
match.text = commentsMatcher.group();
commentMatches.add(match);
}
List<Match> commentsToRemove = new ArrayList<Match>();
Matcher stringsMatcher = stringsPattern.matcher(text);
while (stringsMatcher.find()) {
for (Match comment : commentMatches) {
if (comment.start > stringsMatcher.start() && comment.start < stringsMatcher.end())
commentsToRemove.add(comment);
}
}
for (Match comment : commentsToRemove)
commentMatches.remove(comment);
for (Match comment : commentMatches)
text = text.replace(comment.text, " ");
System.out.println(text);
}
//Single-line
// "String? Nope"
/*
* "This is not String either"
*/
//Complex */
///*More complex*/
/*Single line, but */
String moreFun = " /* comment? doubt that */";
String evenMoreFun = " // comment? doubt that ";
static class Match {
int start;
String text;
}
}
Another alternative is to use some library supporting AST parsing, for e.g. org.eclipse.jdt.core has all the APIs you need to do this and more. But then that's just one alternative:)
I need help to replace all \n (new line) caracters for in a String, but not those \n inside [code][/code] tags.
My brain is burning, I can't solve this by my own :(
Example:
test test test
test test test
test
test
[code]some
test
code
[/code]
more text
Should be:
test test test<br />
test test test<br />
test<br />
test<br />
<br />
[code]some
test
code
[/code]<br />
<br />
more text<br />
Thanks for your time.
Best regards.
I would suggest a (simple) parser, and not a regular expression. Something like this (bad pseudocode):
stack elementStack;
foreach(char in string) {
if(string-from-char == "[code]") {
elementStack.push("code");
string-from-char = "";
}
if(string-from-char == "[/code]") {
elementStack.popTo("code");
string-from-char = "";
}
if(char == "\n" && !elementStack.contains("code")) {
char = "<br/>\n";
}
}
You've tagged the question regex, but this may not be the best tool for the job.
You might be better using basic compiler building techniques (i.e. a lexer feeding a simple state machine parser).
Your lexer would identify five tokens: ("[code]", '\n', "[/code]", EOF, :all other strings:) and your state machine looks like:
state token action
------------------------
begin :none: --> out
out [code] OUTPUT(token), --> in
out \n OUTPUT(break), OUTPUT(token)
out * OUTPUT(token)
in [/code] OUTPUT(token), --> out
in * OUTPUT(token)
* EOF --> end
EDIT: I see other poster discussing the possible need for nesting the blocks. This state machine won't handle that. For nesting blocks, use a recursive decent parser (not quite so simple but still easy enough and extensible).
EDIT: Axeman notes that this design excludes the use of "[/code]" in the code. An escape mechanism can be used to beat this. Something like add '\' to your tokens and add:
state token action
------------------------
in \ -->esc-in
esc-in * OUTPUT(token), -->in
out \ -->esc-out
esc-out * OUTPUT(token), -->out
to the state machine.
The usual arguments in favor of machine generated lexers and parsers apply.
This seems to do it:
private final static String PATTERN = "\\*+";
public static void main(String args[]) {
Pattern p = Pattern.compile("(.*?)(\\[/?code\\])", Pattern.DOTALL);
String s = "test 1 ** [code]test 2**blah[/code] test3 ** blah [code] test * 4 [code] test 5 * [/code] * test 6[/code] asdf **";
Matcher m = p.matcher(s);
StringBuffer sb = new StringBuffer(); // note: it has to be a StringBuffer not a StringBuilder because of the Pattern API
int codeDepth = 0;
while (m.find()) {
if (codeDepth == 0) {
m.appendReplacement(sb, m.group(1).replaceAll(PATTERN, ""));
} else {
m.appendReplacement(sb, m.group(1));
}
if (m.group(2).equals("[code]")) {
codeDepth++;
} else {
codeDepth--;
}
sb.append(m.group(2));
}
if (codeDepth == 0) {
StringBuffer sb2 = new StringBuffer();
m.appendTail(sb2);
sb.append(sb2.toString().replaceAll(PATTERN, ""));
} else {
m.appendTail(sb);
}
System.out.printf("Original: %s%n", s);
System.out.printf("Processed: %s%n", sb);
}
Its not a straightforward regex but I don't think you can do what you want with a straightforward regex. Not with handling nested elements and so forth.
As mentioned by other posters, regular expressions are not the best tool for the job because they are almost universally implemented as greedy algorithms. This means that even if you tried to match code blocks using something like:
(\[code\].*\[/code\])
Then the expression will match everything from the first [code] tag to the last [/code] tag, which is clearly not what you want. While there are ways to get around this, the resulting regular expressions are usually brittle, unintuitive, and downright ugly. Something like the following python code would work much better.
output = []
def add_brs(str):
return str.replace('\n','<br/>\n')
# the first block will *not* have a matching [/code] tag
blocks = input.split('[code]')
output.push(add_brs(blocks[0]))
# for all the rest of the blocks, only add <br/> tags to
# the segment after the [/code] segment
for block in blocks[1:]:
if len(block.split('[/code]'))!=1:
raise ParseException('Too many or few [/code] tags')
else:
# the segment in the code block is pre, everything
# after is post
pre, post = block.split('[/code]')
output.push(pre)
output.push(add_brs(post))
# finally join all the processed segments together
output = "".join(output)
Note the above code was not tested, it's just a rough idea of what you'll need to do.
To get it right, you really need to make three passes:
Find [code] blocks and replace them with a unique token + index (saving the original block), e.g., "foo [code]abc[/code] bar[code]efg[/code]" becomes "foo TOKEN-1 barTOKEN-2"
Do your newline replacement.
Scan for escape tokens and restore the original block.
The code looks something* like:
Matcher m = escapePattern.matcher(input);
while(m.find()) {
String key = nextKey();
escaped.put(key,m.group());
m.appendReplacement(output1,"TOKEN-"+key);
}
m.appendTail(output1);
Matcher m2 = newlinePatten.matcher(output1);
while(m2.find()) {
m.appendReplacement(output2,newlineReplacement);
}
m2.appendTail(output2);
Matcher m3 = Pattern.compile("TOKEN-(\\d+)").matcher(output2);
while(m3.find()) {
m.appendReplacement(finalOutput,escaped.get(m3.group(1)));
}
m.appendTail(finalOutput);
That's the quick and dirty way. There are more efficient ways (others have mentioned parser/lexers), but unless you're processing millions of lines and your code is CPU bound (rather than I/O bound, like most webapps) and you've confirmed with a profiler that this is the bottleneck, they probably aren't worth it.
* I haven't run it, this is all from memory. Just check the API and you'll be able to work it out.
It is hard because if regexes are good at finding something, they are not so good at matching everything except something... So you have to use a loop, I doubt you can do that in one go.
After searching, I found something close of cletus's solution, except I supposed code block cannot be nested, leading to simpler code: choose what is suited to your needs.
import java.util.regex.*;
class Test
{
static final String testString = "foo\nbar\n[code]\nprint'';\nprint{'c'};\n[/code]\nbar\nfoo";
static final String replaceString = "<br>\n";
public static void main(String args[])
{
Pattern p = Pattern.compile("(.+?)(\\[code\\].*?\\[/code\\])?", Pattern.DOTALL);
Matcher m = p.matcher(testString);
StringBuilder result = new StringBuilder();
while (m.find())
{
result.append(m.group(1).replaceAll("\\n", replaceString));
if (m.group(2) != null)
{
result.append(m.group(2));
}
}
System.out.println(result.toString());
}
}
Crude quick test, you need more (null, empty string, no code tag, multiple, etc.).