I am facing a little difficulty with a Syntax highlighter that I've made and is 90% complete. What it does is that it reads in the text from the source of a .java file, detects keywords, comments, etc and writes a (colorful) output in an HTML file. Sample output from it is:
(I couldn't upload a whole html page, so this is a screenshot.) As (I hope) you can see, my program seems to work correctly with keywords, literals and comments (see below) and hence can normally document almost all programs. But it seems to break apart when I store the escape sequence for " i.e. \" inside a String. An error case is shown below:
The string literal highlighting doesn't stop at the end of the literal, but continues until it finds another cue, like a keyword or another literal.
So, the question is how do I disguise/hide/remove this \" from within a String?
The stringFilter method of my program is:
public String stringFilter(String line) {
if (line == null || line.equals("")) {
return "";
}
StringBuffer buf = new StringBuffer();
if (line.indexOf("\"") <= -1) {
return keywordFilter(line);
}
int start = 0;
int startStringIndex = -1;
int endStringIndex = -1;
int tempIndex;
//Keep moving through String characters until we want to stop...
while ((tempIndex = line.indexOf("\"")) > -1 && !isInsideString(line, tempIndex)) {
//We found the beginning of a string
if (startStringIndex == -1) {
startStringIndex = 0;
buf.append( stringFilter(line.substring(start,tempIndex)) );
buf.append("</font>");
buf.append(literal).append("\"");
line = line.substring(tempIndex+1);
}
//Must be at the end
else {
startStringIndex = -1;
endStringIndex = tempIndex;
buf.append(line.substring(0,endStringIndex+1));
buf.append("</font>");
buf.append(normal);
line = line.substring(endStringIndex+1);
}
}
buf.append( keywordFilter(line) );
return buf.toString();
}
EDIT
in response to the first few comments and answers, here's what I tried:
A snippet from htmlFilter(String), but it doesn't work :(
//replace '&' i.e. ampersands with HTML escape sequence for ampersand.
line = line.replaceAll("&", "&");
//line = line.replaceAll(" ", " ");
line = line.replaceAll("" + (char)35, "#");
// replace less-than signs which might be confused
// by HTML as tag angle-brackets;
line = line.replaceAll("<", "<");
// replace greater-than signs which might be confused
// by HTML as tag angle-brackets;
line = line.replaceAll(">", ">");
line = multiLineCommentFilter(line);
//replace the '\\' i.e. escape for backslash with HTML escape sequences.
//fixes a problem when backslashes preceed quotes.
//line = line.replaceAll("\\\"", "\"");
//line = line.replaceAll("" + (char)92 + (char)92, "\\");
return line;
My idea is that when a backslash is met, ignore the next character.
String str = "blah\"blah\\blah\n";
int index = 0;
while (true) {
// find the beginning
while (index < str.length() && str.charAt(index) != '\"')
index++;
int beginIndex = index;
if (index == str.length()) // no string found
break;
index++;
// find the ending
while (index < str.length()) {
if (str.charAt(index) == '\\') {
// escape, ignore the next character
index += 2;
} else if (str.charAt(index) == '\"') {
// end of string found
System.out.println(beginIndex + " " + index);
break;
} else {
// plain content
index++;
}
}
if (index >= str.length())
throw new IllegalArgumentException(
"String literal is not properly closed by a double-quote");
index++;
}
Check for char found at tempIndex-1 it it is \ then don't consider as beginning or ending of string.
String originalLine=line;
if ((tempIndex = originalLine.indexOf("\"", tempIndex + 1)) > -1) {
if (tempIndex==0 || originalLine.charAt(tempIndex - 1) != '\\') {
...
Steps to follow:
First replace all \" with some temp string such as
String tempStr="forward_slash_followed_by_double_quote";
line = line.replaceAll("\\\\\"", tempStr);
//line = line.replaceAll("\\\"", tempStr);
do what ever you are doing
Finally replace that temp string with \"
line = line.replaceAll(tempStr, "\\\\\"");
//line = line.replaceAll(tempStr, "\\\"");
The trouble with finding a quote and then trying to work out whether it's escaped is that it's not enough to simply look at the previous character to see if it's a backslash - consider
String basedir = "C:\\Users\\";
where the \" isn't an escaped quote, but is actually an escaped backslash followed by an unescaped quote. In general a quote preceded by an odd number of backslashes is escaped, one preceded by an even number of backslashes isn't.
A more sensible approach would be to parse through the string one character at a time from left to right rather than trying to jump ahead to quote characters. If you don't want to have to learn a proper parser generator like JavaCC or antlr then you can tackle this case with regular expressions using the \G anchor (to force each subsequent match to start at the end of the previous one with no gaps) - if we assume that str is a substring of your input starting with the character following the opening quote of a string literal then
Pattern p = Pattern.compile("\\G(?:\\\\u[0-9A-Fa-f]{4}|\\\\.|[^\"\\\\])");
StringBuilder buf = new StringBuilder();
Matcher m = p.matcher(str);
while(m.find()) buf.append(m.group());
will leave buf containing the content of the string literal up to but not including the closing quote, and will handle escapes like \", \\ and unicode escapes \uNNNN.
Use double slash "\\"" instead of "\""... Maybe it works...
Related
Am using opencsv 2.3 and it does not appear to be dealing with escape characters as I expect. I need to be able to handle an escaped separator in a CSV file that does not use quoting characters.
Sample test code:
CSVReader reader = new CSVReader(new FileReader("D:/Temp/test.csv"), ',', '"', '\\');
String[] nextLine;
while ((nextLine = reader.readNext()) != null) {
for (String string : nextLine) {
System.out.println("Field [" + string + "].");
}
}
and the csv file:
first field,second\,field
and the output:
Field [first field].
Field [second].
Field [field].
Note that if I change the csv to
first field,"second\,field"
then I get the output I am after:
Field [first field].
Field [second,field].
However, in my case I do not have the option of modifying the source CSV.
Unfortunately it looks like opencsv does not support escaping of separator characters unless they're in quotes. The following method (taken from opencsv's source) is called when an escape character is encountered.
protected boolean isNextCharacterEscapable(String nextLine, boolean inQuotes, int i) {
return inQuotes // we are in quotes, therefore there can be escaped quotes in here.
&& nextLine.length() > (i + 1) // there is indeed another character to check.
&& (nextLine.charAt(i + 1) == quotechar || nextLine.charAt(i + 1) == this.escape);
}
As you can see, this method only returns true if the character following the escape character is a quote character or another escape character. You could patch the library to this, but in its current form, it won't let you do what you're trying to do.
I would like to add 2 numbers of backslash to the prefix of a RegEx to overcome dangling metacharacter exception. I am using the following piece of code:
if (pattern != null) {
pattern = pattern.trim();
if (pattern.length() > 0) {
if (pattern.indexOf("+[") >= 0) {
pattern = "\\\\" + pattern;
}
String formatterPattern ="formatter.setPattern(Pattern.compile(\"" + pattern + "\"));";
System.out.println("formatterPattern : " + formatterPattern);
}
}
But, I am getting,
formatter.setPattern(Pattern.compile("\+[0-9]{1,3}-[0-9()+-]{1,30}"));
instead of
formatter.setPattern(Pattern.compile("\\+[0-9]{1,3}-[0-9()+-]{1,30}"));
Even I tried with multiple \\s like, still, it adds only one backslash to the RegEx.
I have a string with space and I want that space replace by "\_"
. For example here is my code
String example = "Bill Gates";
example = example.replaceAll(" ","\\_");
And the result of example is: "Bill_Gates" not "Bill\_Gates". When I try to do like this
String example = "Bill Gates";
example = example.replaceAll(" ","\\\\_");
The result of example is: "Bill\\_Gates" not "Bill\_Gates"
You need to use replaceAll(" ","\\\\_") instead of replaceAll(" ","\\_"). Because '\\' is a literal. It will be compiled as '\' single slash. When you pass this to replaceall method. It will take first slash as escaping character for "_". If you look inside replaceall method
while (cursor < replacement.length()) {
char nextChar = replacement.charAt(cursor);
if (nextChar == '\\') {
cursor++;
if (cursor == replacement.length())
throw new IllegalArgumentException(
"character to be escaped is missing");
nextChar = replacement.charAt(cursor);
result.append(nextChar);
cursor++;
When it finds a single slash it will replace next character of that slash. So you have to input "\\\\_" to replace method. Then it will be processed as "\\_". Method will look first slash and replace second slash. Then it will replace underscore.
Try:
String example = "Bill Gates";
example = example.replaceAll(" ","\\\\_");
System.out.println(example);
public static void main(String[] args) {
String example = "Bill Gates";
example = example.replaceAll(" ", "\\\\_");
System.out.println(example);
}
output
Bill\_Gates
I'm having some difficulties in excluding part of strings after the "#" symbol.
I explain myself better:
This is a sample input text a user could insert in a textbox:
Some Text
Some Text again #A comment
#A comment line
Another Text
Another Text again#Comment
I need to read this text and ignore all text after "#" symbol.
This should be the expected output:
Some Text;Some Text again;Another Text;Another Text again
As for now here's the code:
This replaces all newlines with ";"
readText = userInputTextArea.getText();
readTextAllInALine = readText.replaceAll("\\n", ";");
so the output after this is:
Some Text;Some Text again #A comment;#A comment line;Another Text;Another Text again#Comment
This code is to ignore all characters after the first "#" but works fine just for the first line if we read it all sequentially.
int startIndex = inputCommandText.indexOf("#");
int endIndex = inputCommandText.indexOf(";");
String toBeReplaced = inputCommandText.substring(startIndex, endIndex);
readTextAllInALine.replace(toBeReplaced, "");
I'm stuck in finding a way for having the expected output. I was thinking of using a StringTokenizer, processing every line, removing text after "#" or ignoring the whole line if it starts with "#", and then printing all tokens (i.e. all lines) separating them with ";" but I cannot make it work.
Any help will be appreciated.
Thank you very much in advance.
Regards.
Just call this replace command on your pure string, retrieved from the text input. The regex #[^;]* grabs everything, starting at the hash until it reads a semicolon. Afterwards it replaces it with an empty string.
public static void main(String[] args) {
String text = "Some Text;Some Text again #A comment;#A comment line;Another Text;Another Text again#Comment";
System.out.println(text);
text = text.replaceAll("#[^;]*", "");
System.out.println(text);
}
A regex is useful here but it's tricky because your pattern is moderately complex. The comments are end line so they can appear in more than one arrangement.
I came up with the following which is a two-pass:
replaceAll(" *(#.*(?=\\n|$))", "").replaceAll("\\n+", ";");
The two-pass circumvents the fact that sometimes you get a duplicate line break. The first expression replaces comments but not new line characters and the second expression replaces multiple new line characters with a single semicolon.
The individual parts of the expression in the first pass are the following:
" *"
This includes zero or more leading spaces in the comment match. IE in "...again #A...", we want to remove that space between n and #.
"(#.* )"
The start of the comment match: matches a # followed by zero or more characters. (Typically the . matches any character except a new line.)
"(?= )"
This is a positive lookahead and where the regex starts to get tricky. It looks for whatever is inside this expression but doesn't include it in the text that's matched. It asserts that the #.* is followed by a certain string but doesn't replace that certain string.
"\\n|$"
The lookahead finds a new line or the end anchor. This will find a comment ended with a new line character or a comment that is at the end of the String. But again, since it's inside the lookahead, the new line doesn't get replaced.
So given the input:
String text = (
"Some Text" + '\n' +
"Some Text again #A comment" + '\n' +
"#A comment line" + '\n' +
"Another Text" + '\n' +
"Another Text again#Comment"
);
System.out.println(
text.replaceAll(" *(#.*(?=\\n|$))", "").replaceAll("\\n+", ";")
);
The output is:
Some Text;Some Text again;Another Text;Another Text again
readText = userInputTextArea.getText();
readText = readText.replaceAll("\\s*#[^\n]*", "");
readText = readText.replaceAll("\n+", ";");
Just to make it clear, Coxer's reply is the way to go. Far more precise and clean. But in any case, if you fancy experimenting here is a recursive solution that will work:
public class IgnoreHash {
#Test
public void test() {
String readTextAllInALine = "Some Text;Some Text again #A comment;#A comment line;Another Text;Another Text again#Comment;";
String actualResult = removeHashComments(readTextAllInALine);
Assert.assertEquals(actualResult, "Some Text;Some Text again ;Another Text;Another Text again");
}
private String removeHashComments(String input) {
StringBuffer result = new StringBuffer();
int hashIndex = input.indexOf("#");
int endIndex = input.indexOf(";");
if(hashIndex != -1){
result.append(input.substring(0, hashIndex));
//first line
if(hashIndex < endIndex ) {
result.append(removeHashComments(input.substring(endIndex)));
} // the case of ;#
else if (endIndex == hashIndex-1) {
int endIndex2 = input.indexOf(";", hashIndex+1);
result.append(removeHashComments(input.substring(endIndex2+1)));
}
else {
result.append(removeHashComments(input.substring(hashIndex)));
}
}
return result.toString();
}
}
I try to split a String into tokens.
The token delimiters are not single characters, some delimiters are included into others (example, & and &&), and I need to have the delimiters returned as token.
StringTokenizer is not able to deal with multiple characters delimiters. I presume it's possible with String.split, but fail to guess the magical regular expression that will suits my needs.
Any idea ?
Example:
Token delimiters: "&", "&&", "=", "=>", " "
String to tokenize: a & b&&c=>d
Expected result: an string array containing "a", " ", "&", " ", "b", "&&", "c", "=>", "d"
--- Edit ---
Thanks to all for your help, Dasblinkenlight gives me the solution. Here is the "ready to use" code I wrote with his help:
private static String[] wonderfulTokenizer(String string, String[] delimiters) {
// First, create a regular expression that matches the union of the delimiters
// Be aware that, in case of delimiters containing others (example && and &),
// the longer may be before the shorter (&& should be before &) or the regexpr
// parser will recognize && as two &.
Arrays.sort(delimiters, new Comparator<String>() {
#Override
public int compare(String o1, String o2) {
return -o1.compareTo(o2);
}
});
// Build a string that will contain the regular expression
StringBuilder regexpr = new StringBuilder();
regexpr.append('(');
for (String delim : delimiters) { // For each delimiter
if (regexpr.length() != 1) regexpr.append('|'); // Add union separator if needed
for (int i = 0; i < delim.length(); i++) {
// Add an escape character if the character is a regexp reserved char
regexpr.append('\\');
regexpr.append(delim.charAt(i));
}
}
regexpr.append(')'); // Close the union
Pattern p = Pattern.compile(regexpr.toString());
// Now, search for the tokens
List<String> res = new ArrayList<String>();
Matcher m = p.matcher(string);
int pos = 0;
while (m.find()) { // While there's a delimiter in the string
if (pos != m.start()) {
// If there's something between the current and the previous delimiter
// Add it to the tokens list
res.add(string.substring(pos, m.start()));
}
res.add(m.group()); // add the delimiter
pos = m.end(); // Remember end of delimiter
}
if (pos != string.length()) {
// If it remains some characters in the string after last delimiter
// Add this to the token list
res.add(string.substring(pos));
}
// Return the result
return res.toArray(new String[res.size()]);
}
It could be optimize if you have many strings to tokenize by creating the Pattern only one time.
You can use the Pattern and a simple loop to achieve the results that you are looking for:
List<String> res = new ArrayList<String>();
Pattern p = Pattern.compile("([&]{1,2}|=>?| +)");
String s = "s=a&=>b";
Matcher m = p.matcher(s);
int pos = 0;
while (m.find()) {
if (pos != m.start()) {
res.add(s.substring(pos, m.start()));
}
res.add(m.group());
pos = m.end();
}
if (pos != s.length()) {
res.add(s.substring(pos));
}
for (String t : res) {
System.out.println("'"+t+"'");
}
This produces the result below:
's'
'='
'a'
'&'
'=>'
'b'
Split won't do it for you as it removed the delimeter. You probably need to tokenize the string on your own (i.e. a for-loop) or use a framework like
http://www.antlr.org/
Try this:
String test = "a & b&&c=>d=A";
String regEx = "(&[&]?|=[>]?)";
String[] res = test.split(regEx);
for(String s : res){
System.out.println("Token: "+s);
}
I added the '=A' at the end to show that that is also parsed.
As mentioned in another answer, if you need the atypical behaviour of keeping the delimiters in the result, you will probably need to create you parser yourself....but in that case you really have to think about what a "delimiter" is in your code.