Control code 0x6 causing XML error - java

I have a Java application running which fetches data by XML, but once in a while i have some data consisting some sort of control code?
An invalid XML character (Unicode: 0x6) was found in the CDATA section.
org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x6) was found in the CDATA section.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at domain.Main.processLogFromUrl(Main.java:342)
at domain.Main.<init>(Main.java:67)
at domain.Main.main(Main.java:577)
Can anyone explain what this control code exactly does as i cannot find much info?
Thanks in advance.

You need to write a FilterInputStream to filter the data before the SAX parser gets it. It must either remove or recode the bad data.
Apache have a super-flexible example. You may wish to put together a much simpler one.
Here's one of mine which does other cleaning up but I am sure it will be a good start.
/* Cleans up often very bad xml.
*
* 1. Strips leading white space.
* 2. Recodes £ etc to &#...;.
* 3. Recodes lone & as &amp.
*
*/
public class XMLInputStream extends FilterInputStream {
private static final int MIN_LENGTH = 2;
// Everything we've read.
StringBuilder red = new StringBuilder();
// Data I have pushed back.
StringBuilder pushBack = new StringBuilder();
// How much we've given them.
int given = 0;
// How much we've read.
int pulled = 0;
public XMLInputStream(InputStream in) {
super(in);
}
public int length() {
// NB: This is a Troll length (i.e. it goes 1, 2, many) so 2 actually means "at least 2"
try {
StringBuilder s = read(MIN_LENGTH);
pushBack.append(s);
return s.length();
} catch (IOException ex) {
log.warning("Oops ", ex);
}
return 0;
}
private StringBuilder read(int n) throws IOException {
// Input stream finished?
boolean eof = false;
// Read that many.
StringBuilder s = new StringBuilder(n);
while (s.length() < n && !eof) {
// Always get from the pushBack buffer.
if (pushBack.length() == 0) {
// Read something from the stream into pushBack.
eof = readIntoPushBack();
}
// Pushback only contains deliverable codes.
if (pushBack.length() > 0) {
// Grab one character
s.append(pushBack.charAt(0));
// Remove it from pushBack
pushBack.deleteCharAt(0);
}
}
return s;
}
// Returns false at eof.
// Might not actually push back anything but usually will.
private boolean readIntoPushBack() throws IOException {
// File finished?
boolean eof = false;
// Next char.
int ch = in.read();
if (ch >= 0) {
// Discard whitespace at start?
if (!(pulled == 0 && isWhiteSpace(ch))) {
// Good code.
pulled += 1;
// Parse out the &stuff;
if (ch == '&') {
// Process the &
readAmpersand();
} else {
// Not an '&', just append.
pushBack.append((char) ch);
}
}
} else {
// Hit end of file.
eof = true;
}
return eof;
}
// Deal with an ampersand in the stream.
private void readAmpersand() throws IOException {
// Read the whole word, up to and including the ;
StringBuilder reference = new StringBuilder();
int ch;
// Should end in a ';'
for (ch = in.read(); isAlphaNumeric(ch); ch = in.read()) {
reference.append((char) ch);
}
// Did we tidily finish?
if (ch == ';') {
// Yes! Translate it into a &#nnn; code.
String code = XML.hash(reference);
if (code != null) {
// Keep it.
pushBack.append(code);
} else {
throw new IOException("Invalid/Unknown reference '&" + reference + ";'");
}
} else {
// Did not terminate properly!
// Perhaps an & on its own or a malformed reference.
// Either way, escape the &
pushBack.append("&").append(reference).append((char) ch);
}
}
private void given(CharSequence s, int wanted, int got) {
// Keep track of what we've given them.
red.append(s);
given += got;
log.finer("Given: [" + wanted + "," + got + "]-" + s);
}
#Override
public int read() throws IOException {
StringBuilder s = read(1);
given(s, 1, 1);
return s.length() > 0 ? s.charAt(0) : -1;
}
#Override
public int read(byte[] data, int offset, int length) throws IOException {
int n = 0;
StringBuilder s = read(length);
for (int i = 0; i < Math.min(length, s.length()); i++) {
data[offset + i] = (byte) s.charAt(i);
n += 1;
}
given(s, length, n);
return n > 0 ? n : -1;
}
#Override
public String toString() {
String s = red.toString();
String h = "";
// Hex dump the small ones.
if (s.length() < 8) {
Separator sep = new Separator(" ");
for (int i = 0; i < s.length(); i++) {
h += sep.sep() + Integer.toHexString(s.charAt(i));
}
}
return "[" + given + "]-\"" + s + "\"" + (h.length() > 0 ? " (" + h + ")" : "");
}
private boolean isWhiteSpace(int ch) {
switch (ch) {
case ' ':
case '\r':
case '\n':
case '\t':
return true;
}
return false;
}
private boolean isAlphaNumeric(int ch) {
return ('a' <= ch && ch <= 'z')
|| ('A' <= ch && ch <= 'Z')
|| ('0' <= ch && ch <= '9');
}
}

Quite why you've got that character will depend on what the data is meant to represent. (Apparently it's ACK, but that's odd to represent in a file...) However, the important point is that it makes the XML invalid - you simply can't represent that character in XML.
From the XML 1.0 spec, section 2.2:
Character Range
/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF]
| [#xE000-#xFFFD] | [#x10000-#x10FFFF]
Note how this excludes Unicode values below U+0020 other than U+0009 (tab), U+000A (line-feed) and U+000D (carriage return).
If you have any influence over the data coming back, you should change it to return valid XML. If not, you'll have to do some preprocessing on it before parsing it as XML. Quite what you'll want to do with unwanted control characters depends on what meaning they have in your situation.

Try to define your XML as version 1.1:
<?xml version="1.1"?>

Related

Shortening the representation of a string by adding the number of consecutive characters

Given a random character string not including (0-9), I need to shorten the representation of that string by adding the number of consecutive characters. For e.g: ggee will result in g2e2 being displayed.
I managed to implement the program and tested it (works correctly) through various inputs. I have run into the issue where I cannot seem to understand how the character "e" is displayed given the input above.
I have traced my code multiple times but I don't see when/how "e" is displayed when "i" is 2/3.
String input = new String("ggee");
char position = input.charAt(0);
int accumulator = 1;
for (int i = 1; i < input.length(); i++)
{
// Correction. Was boolean lastIndexString = input.charAt(i) == (input.charAt(input.length() - 1));
boolean lastIndexString = i == (input.length() - 1);
if (position == input.charAt(i))
{
accumulator++;
if (lastIndexOfString)
System.out.print(accumulator); // In my mind, I should be printing
// (input.charAt(i) + "" + accumulator); here
}
else //(position != input.charAt(i))
{
if (accumulator > 1)
{
System.out.print(position + "" + accumulator);
}
else
{
System.out.print(position + "");
}
position = input.charAt(i);
accumulator = 1;
if (lastIndexOfString)
System.out.print(input.charAt(i)); // This is always printing when
// I am at the last index of my string,
// even ignoring my condition of
// (position == input.charAt(i))
}
}
In Java 9+, using regular expression to find consecutive characters, the following will do it:
static String shorten(String input) {
return Pattern.compile("(.)\\1+").matcher(input)
.replaceAll(r -> r.group(1) + r.group().length());
}
Test
System.out.println(shorten("ggggeecaaaaaaaaaaaa"));
System.out.println(shorten("ggggee😀😀😀😁😁😁😁"));
Output
g4e2ca12
g4e2😀6😁8
However, as you can see, that code doesn't work if input string contains Unicode characters from the supplemental planes, such as Emoji characters.
Small modification will fix that:
static String shorten(String input) {
return Pattern.compile("(.)\\1+").matcher(input)
.replaceAll(r -> r.group(1) + r.group().codePointCount(0, r.group().length()));
}
Or:
static String shorten(String input) {
return Pattern.compile("(.)\\1+").matcher(input)
.replaceAll(r -> r.group(1) + input.codePointCount(r.start(), r.end()));
}
Output
g4e2ca12
g4e2😀3😁4
Basically you want each char with no of repeats.
*******************************************************************************/
public class Main
{
public static void main(String[] args) {
String s="ggggggeee";
StringBuilder s1=new
StringBuilder("") ;
;
for(int i=0;i<s.length();i++)
{
int count=0,j;
for( j=i+1;j<s.length();j++)
{
if(s.charAt(i)==s.charAt(j))
count++;
else
{
break;}
}
i=j-1;
s1=s1.append(s.charAt(i)+""+(count+1));
}
System.out.print(s1);
}}
Output

Replace nested string with some rules

There are 3 rules in the string:
It contains either word or group (enclosed by parentheses), and group can be nested;
If there is a space between word or group, those words or groups should append with "+".
For example:
"a b" needs to be "+a +b"
"a (b c)" needs to be "+a +(+b +c)"
If there is a | between word or group, those words or groups should be surround with parentheses.
For example:
"a|b" needs to be "(a b)"
"a|b|c" needs to be "(a b c)"
Consider all the rules, here is another example:
"aa|bb|(cc|(ff gg)) hh" needs to be "+(aa bb (cc (+ff +gg))) +hh"
I have tried to use regex, stack and recursive descent parser logic, but still cannot fully solve the problem.
Could anyone please share the logic or pseudo code on this problem?
New edited:
One more important rule: vertical bar has higher precedence.
For example:
aa|bb hh cc|dd (a|b) needs to be +(aa bb) +hh +(cc dd) +((a b))
(aa dd)|bb|cc (ee ff)|(gg hh) needs to be +((+aa +dd) bb cc) +((+ee +ff) (+gg +hh))
New edited:
To solve the precedence problem, I find a way to add the parentheses before calling Sunil Dabburi's methods.
For example:
aa|bb hh cc|dd (a|b) will be (aa|bb) hh (cc|dd) (a|b)
(aa dd)|bb|cc (ee ff)|(gg hh) will be ((aa dd)|bb|cc) ((ee ff)|(gg hh))
Since the performance is not a big concern to my application, this way at least make it work for me. I guess the JavaCC tool may solve this problem beautifully. Hope someone else can continue to discuss and contribute this problem.
Here is my attempt. Based on your examples and a few that I came up with I believe it is correct under the rules. I solved this by breaking the problem up into 2 parts.
Solving the case where I assume the string only contains words or is a group with only words.
Solving words and groups by substituting child groups out, use the 1) part and recursively repeating 2) with the child groups.
private String transformString(String input) {
Stack<Pair<Integer, String>> childParams = new Stack<>();
String parsedInput = input;
int nextInt = Integer.MAX_VALUE;
Pattern pattern = Pattern.compile("\\((\\w|\\|| )+\\)");
Matcher matcher = pattern.matcher(parsedInput);
while (matcher.find()) {
nextInt--;
parsedInput = matcher.replaceFirst(String.valueOf(nextInt));
String childParam = matcher.group();
childParams.add(Pair.of(nextInt, childParam));
matcher = pattern.matcher(parsedInput);
}
parsedInput = transformBasic(parsedInput);
while (!childParams.empty()) {
Pair<Integer, String> childGroup = childParams.pop();
parsedInput = parsedInput.replace(childGroup.fst.toString(), transformBasic(childGroup.snd));
}
return parsedInput;
}
// Transform basic only handles strings that contain words. This allows us to simplify the problem
// and not have to worry about child groups or nested groups.
private String transformBasic(String input) {
String transformedBasic = input;
if (input.startsWith("(")) {
transformedBasic = input.substring(1, input.length() - 1);
}
// Append + in front of each word if there are multiple words.
if (transformedBasic.contains(" ")) {
transformedBasic = transformedBasic.replaceAll("( )|^", "$1+");
}
// Surround all words containing | with parenthesis.
transformedBasic = transformedBasic.replaceAll("([\\w]+\\|[\\w|]*[\\w]+)", "($1)");
// Replace pipes with spaces.
transformedBasic = transformedBasic.replace("|", " ");
if (input.startsWith("(") && !transformedBasic.startsWith("(")) {
transformedBasic = "(" + transformedBasic + ")";
}
return transformedBasic;
}
Verified with the following test cases:
#ParameterizedTest
#CsvSource({
"a b,+a +b",
"a (b c),+a +(+b +c)",
"a|b,(a b)",
"a|b|c,(a b c)",
"aa|bb|(cc|(ff gg)) hh,+(aa bb (cc (+ff +gg))) +hh",
"(aa(bb(cc|ee)|ff) gg),(+aa(bb(cc ee) ff) +gg)",
"(a b),(+a +b)",
"(a(c|d) b),(+a(c d) +b)",
"bb(cc|ee),bb(cc ee)",
"((a|b) (a b)|b (c|d)|e),(+(a b) +((+a +b) b) +((c d) e))"
})
void testTransformString(String input, String output) {
Assertions.assertEquals(output, transformString(input));
}
#ParameterizedTest
#CsvSource({
"a b,+a +b",
"a b c,+a +b +c",
"a|b,(a b)",
"(a b),(+a +b)",
"(a|b),(a b)",
"a|b|c,(a b c)",
"(aa|bb cc|dd),(+(aa bb) +(cc dd))",
"(aa|bb|ee cc|dd),(+(aa bb ee) +(cc dd))",
"aa|bb|cc|ff gg hh,+(aa bb cc ff) +gg +hh"
})
void testTransformBasic(String input, String output) {
Assertions.assertEquals(output, transformBasic(input));
}
I tried to solve the problem. Not sure if it works in all cases. Verified with the inputs given in the question and it worked fine.
We need to format the pipes first. That will help add necessary parentheses and spacing.
The spaces generated as part of pipe processing can interfere with actual spaces that are available in our expression. So used $ symbol to mask them.
To process spaces, its tricky as parantheses need to be processed individually. So the approach I am following is to find a set of parantheses starting from outside and going inside.
So typically we have <left_part><parantheses_code><right_part>. Now left_part can be empty, similary right_part can be empty. we need to handle such cases.
Also, if the right_part starts with a space, we need to add '+' to left_part as per space requirement.
NOTE: I am not sure what's expected of (a|b). If the result should be ((a b)) or (a b). I am going with ((a b)) purely by the definition of it.
Now here is the working code:
public class Test {
public static void main(String[] args) {
String input = "aa|bb hh cc|dd (a|b)";
String result = formatSpaces(formatPipes(input)).replaceAll("\\$", " ");
System.out.println(result);
}
private static String formatPipes(String input) {
while (true) {
char[] chars = input.toCharArray();
int pIndex = input.indexOf("|");
if (pIndex == -1) {
return input;
}
input = input.substring(0, pIndex) + '$' + input.substring(pIndex + 1);
int first = pIndex - 1;
int closeParenthesesCount = 0;
while (first >= 0) {
if (chars[first] == ')') {
closeParenthesesCount++;
}
if (chars[first] == '(') {
if (closeParenthesesCount > 0) {
closeParenthesesCount--;
}
}
if (chars[first] == ' ') {
if (closeParenthesesCount == 0) {
break;
}
}
first--;
}
String result;
if (first > 0) {
result = input.substring(0, first + 1) + "(";
} else {
result = "(";
}
int last = pIndex + 1;
int openParenthesesCount = 0;
while (last <= input.length() - 1) {
if (chars[last] == '(') {
openParenthesesCount++;
}
if (chars[last] == ')') {
if (openParenthesesCount > 0) {
openParenthesesCount--;
}
}
if (chars[last] == ' ') {
if (openParenthesesCount == 0) {
break;
}
}
last++;
}
if (last >= input.length() - 1) {
result = result + input.substring(first + 1) + ")";
} else {
result = result + input.substring(first + 1, last) + ")" + input.substring(last);
}
input = result;
}
}
private static String formatSpaces(String input) {
if (input.isEmpty()) {
return "";
}
int startIndex = input.indexOf("(");
if (startIndex == -1) {
if (input.contains(" ")) {
String result = input.replaceAll(" ", " +");
if (!result.trim().startsWith("+")) {
result = '+' + result;
}
return result;
} else {
return input;
}
}
int endIndex = startIndex + matchingCloseParenthesesIndex(input.substring(startIndex));
if (endIndex == -1) {
System.out.println("Invalid input!!!");
return "";
}
String first = "";
String last = "";
if (startIndex > 0) {
first = input.substring(0, startIndex);
}
if (endIndex < input.length() - 1) {
last = input.substring(endIndex + 1);
}
String result = formatSpaces(first);
String parenthesesStr = input.substring(startIndex + 1, endIndex);
if (last.startsWith(" ") && first.isEmpty()) {
result = result + "+";
}
result = result + "("
+ formatSpaces(parenthesesStr)
+ ")"
+ formatSpaces(last);
return result;
}
private static int matchingCloseParenthesesIndex(String input) {
int counter = 1;
char[] chars = input.toCharArray();
for (int i = 1; i < chars.length; i++) {
char ch = chars[i];
if (ch == '(') {
counter++;
} else if (ch == ')') {
counter--;
}
if (counter == 0) {
return i;
}
}
return -1;
}
}

Recursively decompressing a String

I have this assignment that needs me to decompress a previously compressed string.
Examples of this would be
i4a --> iaaaa
q3w2ai2b --> qwwwaaibb
3a --> aaa
Here's what I've written so far:
public static String decompress(String compressedText)
{
char c;
char let;
int num;
String done = "";
String toBeDone = "";
String toBeDone2 = "";
if(compressedText.length() <= 1)
{
return compressedText;
}
if (Character.isLetter(compressedText.charAt(0)))
{
done = compressedText.substring(0,1);
toBeDone = compressedText.substring(1);
return done + decompress(toBeDone);
}
else
{
c = compressedText.charAt(0);
num = Character.getNumericValue(c);
let = compressedText.charAt(1);
if (num > 0)
{
num--;
toBeDone = num + Character.toString(let);
toBeDone2 = compressedText.substring(2);
return Character.toString(let) + decompress(toBeDone) + decompress(toBeDone2);
}
else
{
toBeDone2 = compressedText.substring(2);
return Character.toString(let) + decompress(toBeDone2);
}
}
}
My return values are absolutely horrendous.
"ab" yields "babb" somehow.
"a" or any 1 letter string string yields the right result
"2a" yields "aaaaaaaaaaa"
"2a3b" gives me "aaaabbbbbbbbbbbbbbbbbbbbbbbbbbaaabbbbaaaabbbbbbbbbbbbbbbbbbbbbbbbbb"
The only place I can see a mistake in would probably be the last else section, since I wasn't entirely sure on what to do once the number reaches 0 and I have to stop using recursion on the letter after it. Other than that, I can't really see a problem that gives such horrifying outputs.
I reckon something like this would work:
public static String decompress(String compressedText) {
if (compressedText.length() <= 1) {
return compressedText;
}
char c = compressedText.charAt(0);
if (Character.isDigit(c)) {
return String.join("", Collections.nCopies(Character.digit(c, 10), compressedText.substring(1, 2))) + decompress(compressedText.substring(2));
}
return compressedText.charAt(0) + decompress(compressedText.substring(1));
}
As you can see, the base case is when the compressed String has a length less than or equal to 1 (as you have it in your program).
Then, we check if the first character is a digit. If so, we substitute in the correct amount of characters, and continue with the recursive process until we reach the base case.
If the first character is not a digit, then we simply append it and continue.
Keep in mind that this will only work with numbers from 1 to 9; if you require higher values, let me know!
EDIT 1: If the Collections#nCopies method is too complex, here is an equivalent method:
if (Character.isDigit(c)) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < Character.digit(c, 10); i++) {
sb.append(compressedText.charAt(1));
}
return sb.toString() + decompress(compressedText.substring(2));
}
EDIT 2: Here is a method that uses a recursive helper-method to repeat a String:
public static String decompress(String compressedText) {
if (compressedText.length() <= 1) {
return compressedText;
}
char c = compressedText.charAt(0);
if (Character.isDigit(c)) {
return repeatCharacter(compressedText.charAt(1), Character.digit(c, 10)) + decompress(compressedText.substring(2));
}
return compressedText.charAt(0) + decompress(compressedText.substring(1));
}
public static String repeatCharacter(char character, int counter) {
if (counter == 1) {
return Character.toString(character);
}
return character + repeatCharacter(character, counter - 1);
}

Can Java String.toUpperCase() ever fail?

Situation: There is a Java ESB, which is taking input (family name) from a Vaadin web form, and should guarantee upper-casing it before saving it into DB.
I was assigned to investigate a reported issue, that lower-case characters sometimes appear in DB. I have learned, that the program is using String.toUpperCase() just before saving data through EntityManager (it is the only place that is modifying received data).
So what I wonder is, whether this shall be enough. So far I havent found any "well-known" problems related to toUpperCase() function, but I wanna be sure.
So the question - Does String.toUpperCase() always do its job? Or are there any possible characters or circumstances when error may occur and the letters may not be upper-cased?
Can Java String.toUpperCase() ever fail?
It depends on whether you are passing in locale sensitive Strings (see below).
In the implementation for Java.lang.String, it simply uses the default locale:
public String toUpperCase() {
return toUpperCase(Locale.getDefault());
}
toUpperCase(Locale) converts all of the characters in this String to upper case using the rules of the given Locale. Case mapping is based on the Unicode Standard version specified by the Character class. Since case mappings are not always 1:1 char mappings, the resulting String may be a different length than the original String.
This method is locale sensitive, and may produce unexpected results if used for strings that are intended to be interpreted locale independently. Examples are programming language identifiers, protocol keys, and HTML tags.
To obtain correct results for locale insensitive strings, use toUpperCase(Locale.ENGLISH).
In case you are interested on how toUpperCase(Locale) was implemented:
public String toUpperCase(Locale locale) {
if (locale == null) {
throw new NullPointerException();
}
int firstLower;
final int len = value.length;
/* Now check if there are any characters that need to be changed. */
scan: {
for (firstLower = 0 ; firstLower < len; ) {
int c = (int)value[firstLower];
int srcCount;
if ((c >= Character.MIN_HIGH_SURROGATE)
&& (c <= Character.MAX_HIGH_SURROGATE)) {
c = codePointAt(firstLower);
srcCount = Character.charCount(c);
} else {
srcCount = 1;
}
int upperCaseChar = Character.toUpperCaseEx(c);
if ((upperCaseChar == Character.ERROR)
|| (c != upperCaseChar)) {
break scan;
}
firstLower += srcCount;
}
return this;
}
/* result may grow, so i+resultOffset is the write location in result */
int resultOffset = 0;
char[] result = new char[len]; /* may grow */
/* Just copy the first few upperCase characters. */
System.arraycopy(value, 0, result, 0, firstLower);
String lang = locale.getLanguage();
boolean localeDependent =
(lang == "tr" || lang == "az" || lang == "lt");
char[] upperCharArray;
int upperChar;
int srcChar;
int srcCount;
for (int i = firstLower; i < len; i += srcCount) {
srcChar = (int)value[i];
if ((char)srcChar >= Character.MIN_HIGH_SURROGATE &&
(char)srcChar <= Character.MAX_HIGH_SURROGATE) {
srcChar = codePointAt(i);
srcCount = Character.charCount(srcChar);
} else {
srcCount = 1;
}
if (localeDependent) {
upperChar = ConditionalSpecialCasing.toUpperCaseEx(this, i, locale);
} else {
upperChar = Character.toUpperCaseEx(srcChar);
}
if ((upperChar == Character.ERROR)
|| (upperChar >= Character.MIN_SUPPLEMENTARY_CODE_POINT)) {
if (upperChar == Character.ERROR) {
if (localeDependent) {
upperCharArray =
ConditionalSpecialCasing.toUpperCaseCharArray(this, i, locale);
} else {
upperCharArray = Character.toUpperCaseCharArray(srcChar);
}
} else if (srcCount == 2) {
resultOffset += Character.toChars(upperChar, result, i + resultOffset) - srcCount;
continue;
} else {
upperCharArray = Character.toChars(upperChar);
}
/* Grow result if needed */
int mapLen = upperCharArray.length;
if (mapLen > srcCount) {
char[] result2 = new char[result.length + mapLen - srcCount];
System.arraycopy(result, 0, result2, 0, i + resultOffset);
result = result2;
}
for (int x = 0; x < mapLen; ++x) {
result[i + resultOffset + x] = upperCharArray[x];
}
resultOffset += (mapLen - srcCount);
} else {
result[i + resultOffset] = (char)upperChar;
}
}
return new String(result, 0, len + resultOffset);
}
Without any further information which charater (you descide to be lowercase) is stored in the database I would guess the origin is similar to cases which are explained in those blogs
by Heinz Kabutz
http://www.javaspecialists.eu/archive/Issue209.html
http://www.javaspecialists.eu/archive/Issue211.html
by Elliotte Rusty Harold
http://cafe.elharo.com/blogroll/turkish/
edit It could be that in the database is a character stored which looks similar (based on the font) to a Latin character and for which no uppercase letter exists.
One example is the GREEK LETTER YOT which looks similar to the LATIN SMALL LETTER J and has no uppercase letter.
Small snippet for demonstration.
int[] codePoints = { 0x03F3, 0x006A};
for (int codePoint : codePoints) {
char lowerCase = (char) Character.toLowerCase(codePoint);
char upperCase = (char) Character.toUpperCase(codePoint);
System.out.printf("Unicode name: %s%n", Character.getName(codePoint));
System.out.printf("lowercase : %s%n", lowerCase);
System.out.printf("uppercase : %s (%s)%n", upperCase,
Character.isUpperCase(upperCase));
}
The output is
Unicode name: GREEK LETTER YOT
lowercase : ϳ
uppercase : ϳ (false)
Unicode name: LATIN SMALL LETTER J
lowercase : j
uppercase : J (true)

Removing all types of comments in java file

I have a java project and i have used comments in many location in various java files in the project. Now i need to remove all type of comments : single line , multiple line comments .
Please provide automation for removing comments. using tools or in eclipse etc.
Currently i am manually trying to remove all commetns
You can remove all single- or multi-line block comments (but not line comments with //) by searching for the following regular expression in your project(s)/file(s) and replacing by $1:
^([^"\r\n]*?(?:(?<=')"[^"\r\n]*?|(?<!')"[^"\r\n]*?"[^"\r\n]*?)*?)(?<!/)/\*[^\*]*(?:\*+[^/][^\*]*)*?\*+/
It's possible that you have to execute it more than once.
This regular expression avoids the following pitfalls:
Code between two comments /* Comment 1 */ foo(); /* Comment 2 */
Line comments starting with an asterisk: //***NOTE***
Comment delimiters inside string literals: stringbuilder.append("/*");; also if there is a double quote inside single quotes before the comment
To remove all single-line comments, search for the following regular expression in your project(s)/file(s) and replace by $1:
^([^"\r\n]*?(?:(?<=')"[^"\r\n]*?|(?<!')"[^"\r\n]*?"[^"\r\n]*?)*?)\s*//[^\r\n]*
This regular expression also avoids comment delimiters inside double quotes, but does NOT check for multi-line comments, so /* // */ will be incorrectly removed.
I had to write somehting to do this a few weeks ago. This should handle all comments, nested or otherwise. It is long, but I haven't seen a regex version that handled nested comments properly. I didn't have to preserve javadoc, but I presume you do, so I added some code that I belive should handle that. I also added code to support the \r\n and \r line separators. The new code is marked as such.
public static String removeComments(String code) {
StringBuilder newCode = new StringBuilder();
try (StringReader sr = new StringReader(code)) {
boolean inBlockComment = false;
boolean inLineComment = false;
boolean out = true;
int prev = sr.read();
int cur;
for(cur = sr.read(); cur != -1; cur = sr.read()) {
if(inBlockComment) {
if (prev == '*' && cur == '/') {
inBlockComment = false;
out = false;
}
} else if (inLineComment) {
if (cur == '\r') { // start untested block
sr.mark(1);
int next = sr.read();
if (next != '\n') {
sr.reset();
}
inLineComment = false;
out = false; // end untested block
} else if (cur == '\n') {
inLineComment = false;
out = false;
}
} else {
if (prev == '/' && cur == '*') {
sr.mark(1); // start untested block
int next = sr.read();
if (next != '*') {
inBlockComment = true; // tested line (without rest of block)
}
sr.reset(); // end untested block
} else if (prev == '/' && cur == '/') {
inLineComment = true;
} else if (out){
newCode.append((char)prev);
} else {
out = true;
}
}
prev = cur;
}
if (prev != -1 && out && !inLineComment) {
newCode.append((char)prev);
}
} catch (IOException e) {
e.printStackTrace();
}
return newCode.toString();
}
you can try it with the java-comment-preprocessor:
java -jar ./jcp-6.0.0.jar --i:/sourceFolder --o:/resultFolder -ef:none --r
source
I made a open source library and uploaded to github, its called CommentRemover you can remove single line and multiple line Java Comments.
It supports remove or NOT remove TODO's.
Also it supports JavaScript , HTML , CSS , Properties , JSP and XML Comments too.
There is a little code snippet how to use it (There is 2 type usage):
First way InternalPath
public static void main(String[] args) throws CommentRemoverException {
// root dir is: /Users/user/Projects/MyProject
// example for startInternalPath
CommentRemover commentRemover = new CommentRemover.CommentRemoverBuilder()
.removeJava(true) // Remove Java file Comments....
.removeJavaScript(true) // Remove JavaScript file Comments....
.removeJSP(true) // etc.. goes like that
.removeTodos(false) // Do Not Touch Todos (leave them alone)
.removeSingleLines(true) // Remove single line type comments
.removeMultiLines(true) // Remove multiple type comments
.startInternalPath("src.main.app") // Starts from {rootDir}/src/main/app , leave it empty string when you want to start from root dir
.setExcludePackages(new String[]{"src.main.java.app.pattern"}) // Refers to {rootDir}/src/main/java/app/pattern and skips this directory
.build();
CommentProcessor commentProcessor = new CommentProcessor(commentRemover);
commentProcessor.start();
}
Second way ExternalPath
public static void main(String[] args) throws CommentRemoverException {
// example for externalInternalPath
CommentRemover commentRemover = new CommentRemover.CommentRemoverBuilder()
.removeJava(true) // Remove Java file Comments....
.removeJavaScript(true) // Remove JavaScript file Comments....
.removeJSP(true) // etc..
.removeTodos(true) // Remove todos
.removeSingleLines(false) // Do not remove single line type comments
.removeMultiLines(true) // Remove multiple type comments
.startExternalPath("/Users/user/Projects/MyOtherProject")// Give it full path for external directories
.setExcludePackages(new String[]{"src.main.java.model"}) // Refers to /Users/user/Projects/MyOtherProject/src/main/java/model and skips this directory.
.build();
CommentProcessor commentProcessor = new CommentProcessor(commentRemover);
commentProcessor.start();
}
This is an old post but this may help someone who enjoys working on command line like myself:
The perl one-liner below will remove all comments:
perl -0pe 's|//.*?\n|\n|g; s#/\*(.|\n)*?\*/##g;' test.java
Example:
cat test.java
this is a test
/**
*This should be removed
*This should be removed
*/
this should not be removed
//this should be removed
this should not be removed
this should not be removed //this should be removed
Output:
perl -0pe 's#/\*\*(.|\n)*?\*/##g; s|//.*?\n|\n|g' test.java
this is a test
this should not be removed
this should not be removed
this should not be removed
If you want get rid of multiple blank lines as well:
perl -0pe 's|//.*?\n|\n|g; s#/\*(.|\n)*?\*/##g; s/\n\n+/\n\n/g' test.java
this is a test
this should not be removed
this should not be removed
this should not be removed
EDIT: Corrected regex
Dealing with source code is hard unless you know more on the writing of comment.
In the more general case, you could have // or /* in text constants. So your really need to parse the file at a syntaxic level, not only lexical. IMHO the only bulletproof solution would be to start for example with the java parser from openjdk.
If you know that your comments are never deeply mixed with the code (in my exemple comments MUST be full lines), a python script could help
multiple = False
for line in text:
stripped = line.strip()
if multiple:
if stripped.endswith('*/'):
multiple = False
continue
elif stripped.startswith('/*'):
multiple = True
elif stripped.startswith('//'):
pass
else:
print(line)
If you are using Eclipse IDE, you could make regex do the work for you.
Open the search window (Ctrl+F), and check 'Regular Expression'.
Provide the expression as
/\*\*(?s:(?!\*/).)*\*/
Prasanth Bhate has explained it in Tool to remove JavaDoc comments?
public class TestForStrings {
/**
* The main method.
*
* #param args
* the arguments
* #throws Exception
* the exception
*/
public static void main(String args[]) throws Exception {
String[] imports = new String[100];
String fileName = "Menu.java";
// This will reference one API at a time
String line = null;
try {
FileReader fileReader = new FileReader(fileName);
// Always wrap FileReader in BufferedReader.
BufferedReader bufferedReader = new BufferedReader(fileReader);
int startingOffset = 0;
// This will reference one API at a time
List<String> lines = Files.readAllLines(Paths.get(fileName),
Charset.forName("ISO-8859-1"));
// remove single line comments
for (int count = 0; count < lines.size(); count++) {
String tempString = lines.get(count);
lines.set(count, removeSingleLineComment(tempString));
}
// remove multiple lines comment
for (int count = 0; count < lines.size(); count++) {
String tempString = lines.get(count);
removeMultipleLineComment(tempString, count, lines);
}
for (int count = 0; count < lines.size(); count++) {
System.out.println(lines.get(count));
}
} catch (FileNotFoundException ex) {
System.out.println("Unable to open file '" + fileName + "'");
} catch (IOException ex) {
System.out.println("Error reading file '" + fileName + "'");
} catch (Exception e) {
}
}
/**
* Removes the multiple line comment.
*
* #param tempString
* the temp string
* #param count
* the count
* #param lines
* the lines
* #return the string
*/
private static List<String> removeMultipleLineComment(String tempString,
int count, List<String> lines) {
try {
if (tempString.contains("/**") || (tempString.contains("/*"))) {
int StartIndex = count;
while (!(lines.get(count).contains("*/") || lines.get(count)
.contains("**/"))) {
count++;
}
int endIndex = ++count;
if (StartIndex != endIndex) {
while (StartIndex != endIndex) {
lines.set(StartIndex, "");
StartIndex++;
}
}
}
} catch (Exception e) {
// Do Nothing
}
return lines;
}
/**
* Remove single line comments .
*
* #param line
* the line
* #return the string
* #throws Exception
* the exception
*/
private static String removeSingleLineComment(String line) throws Exception {
try {
if (line.contains(("//"))) {
int startIndex = line.indexOf("//");
int endIndex = line.length();
String tempoString = line.substring(startIndex, endIndex);
line = line.replace(tempoString, "");
}
if ((line.contains("/*") || line.contains("/**"))
&& (line.contains("**/") || line.contains("*/"))) {
int startIndex = line.indexOf("/**");
int endIndex = line.length();
String tempoString = line.substring(startIndex, endIndex);
line = line.replace(tempoString, "");
}
} catch (Exception e) {
// Do Nothing
}
return line;
}
}
This is what I came up with yesterday.
This is actually homework I got from school so if anybody reads this and finds a bug before I turn it in, please leave a comment =)
ps. 'FilterState' is a enum class
public static String deleteComments(String javaCode) {
FilterState state = FilterState.IN_CODE;
StringBuilder strB = new StringBuilder();
char prevC=' ';
for(int i = 0; i<javaCode.length(); i++){
char c = javaCode.charAt(i);
switch(state){
case IN_CODE:
if(c=='/')
state = FilterState.CAN_BE_COMMENT_START;
else {
if (c == '"')
state = FilterState.INSIDE_STRING;
strB.append(c);
}
break;
case CAN_BE_COMMENT_START:
if(c=='*'){
state = FilterState.IN_COMMENT_BLOCK;
}
else if(c=='/'){
state = FilterState.ON_COMMENT_LINE;
}
else {
state = FilterState.IN_CODE;
strB.append(prevC+c);
}
break;
case ON_COMMENT_LINE:
if(c=='\n' || c=='\r') {
state = FilterState.IN_CODE;
strB.append(c);
}
break;
case IN_COMMENT_BLOCK:
if(c=='*')
state=FilterState.CAN_BE_COMMENT_END;
break;
case CAN_BE_COMMENT_END:
if(c=='/')
state = FilterState.IN_CODE;
else if(c!='*')
state = FilterState.IN_COMMENT_BLOCK;
break;
case INSIDE_STRING:
if(c == '"' && prevC!='\\')
state = FilterState.IN_CODE;
strB.append(c);
break;
default:
System.out.println("unknown case");
return null;
}
prevC = c;
}
return strB.toString();
}
private static int find(String s, String t, int start) {
int ret = s.indexOf(t, start);
return ret < 0 ? Integer.MAX_VALUE : ret;
}
private static int findSkipEsc(String s, String t, int start) {
while(true) {
int ret = find(s, t, start);
if( ret == Integer.MAX_VALUE) return -1;
int esc = find(s, "\\", start);
if( esc > ret) return ret;
start += 2;
}
}
private static String removeLineCommnt(String s) {
int i, start = 0;
while (0 <= (i = find(s, "//", start))) { //Speed it up
int j = find(s, "'", start);
int k = find(s, "\"", start);
int first = min(i, min(j, k));
if (first == Integer.MAX_VALUE) return s;
if (i == first) return s.substring(0, i);
//skipp quoted string
start = first+1;
if (k == first) { // " asdas\"dasd "
start = findSkipEsc(s,"\"",start);
if (start < 0) return s;
start++;
continue;
}
//if j == first ' asda\'sasd ' --- not in JSON
start = findSkipEsc(s,"'\"'",start);
if (start < 0) return s;
start++;
}
return s;
}
static String removeLineCommnts(String s) {
if (!s.contains("//")) return s; //Speed it up
return Arrays.stream(s.split("[\\n\\r]+")).
map(Common::removeLineCommnt).
collect(Collectors.joining("\n"));
}

Categories

Resources