regex and substitution in java

regex and substitution in java - java

I am trying to strip and replace a text string that looks as follows in the most elegant way possible:
element {"item"} {text {
} {$i/child::itemno}
To look like:
<item> {$i/child::itemno}
Hence removing the element text substituting its braces and removing text and its accompanying braces.
I believe the appropriate regex to do this is:
/element\s*\{"([^"]+)"\}\s*{text\s*{\s*}\s*({[^}]*})/
but I am unsure as to the number of backslashes to use in java and also how to complete the final substitution which makes use of my group(1) and replaces it with < at its start and > at its end:
So far I have this (although perhaps I might be better off with a full rewrite ?)
Pattern p = Pattern.compile("/element\\s*\\{\"([^\"]+)\"\\}\\s*{text\\s*{\\s*}\\s*({[^}]*})/ ");
// Split input with the pattern
Matcher m = p.matcher("element {\"item\"} {text {\n" +
" } {$i/child::itemno} text { \n" +
" } {$i/child::description} text {\n" +
" } element {\"high_bid\"} {{max($b/child::bid)}} text {\n" +
" }} ");
// Next for each instance of group 1, replace it with < > at the start
I think I've stumbled across a problem. What I am trying to do is somewhat harder than I previously stated. With the solution I have below:
element {"item"} {text { } {$i/child::itemno} text { } {$i/child::description} text { } element {"high_bid"} {{max($b/child::bid)}} text { }}
GIVES:
<item> {$i/child::itemno} text { } {$i/child::description} text { } element {"high_bid"} {{max($b/child::bid)}} text { }}
When I expected:
<item>{$i/child::itemno}{$i/child::description}<high_bid>{fn:max($b/child::bid)}</high_bid></item>

Java regex-es are written without delimiters. So lose the forward slashes;
every single backslash needs one extra, so \s becomes \\s;
all { need to be escaped: \\{, and } need no escape (although it doesn't hurt if you do escape them).
Try:
String text = "element {\"item\"} {text { } {$i/child::itemno}";
System.out.println(text.replaceAll("element\\s*\\{\"([^\"]+)\"}\\s*\\{text\\s*\\{\\s*}\\s*(\\{[^}]*})", "<$1> $2"));
Output:
<item> {$i/child::itemno}

Related

How to replace text using ReplaceFirst() without case sensitivity

I'm trying to create a method which can highlight text in a jlabel according user entered search text. it works fine except it case sensitive. I used a regex (?i) to ignore case. But still it case sensitive.
private void jTextField1KeyReleased(java.awt.event.KeyEvent evt) {
String SourceText = "this is a sample text";
String SearchText = jTextField1.getText();
if (SourceText.contains(SearchText)) {
String OutPut = "<html>" + SourceText.replaceFirst("(?i)" + SearchText, "<span style=\"background-color: #d5f4e6;\">" + SearchText + "</span>") + "</html>";
jLabel1.setText(OutPut);
} else {
jLabel1.setText(SourceText);
}
}
How can i fix this.
Update
contains is case sensitive.
How to check if a String contains another String in a case insensitive manner in Java

You have not used the matched text in the replacement, you hard-coded the same string you used in the search. Since you wrap the whole match with html tags, you need to use the $0 backreference in the replacement (it refers to the whole match that resides in Group 0).
Besides, you have not escaped ("quoted") the search term, it may cause trouble if the SearchText contains special regex metacharacters.
You can fix the code using
String OutPut = "<html>" + SourceText.replaceFirst("(?i)" + Pattern.quote(SearchText), "<span style=\"background-color: #d5f4e6;\">$0</span>") + "</html>";

How can I get non-matching groups using a Matcher in Java?

I'm trying to write a java regex to catch some groups of words from a String using a Matcher.
Say i got this string: "Hello, we are #happy# to see you today".
I would like to get 2 group of matches, one having
Hello, we are
to see you today
and the other
happy
So far, I was only able to match the word between the #s using this Pattern:
Pattern p = Pattern.compile("#(.+?)#");
I've read about negative lookahead and lookaround, played a bit with it but without success.
I assume I should do some sort of negation of the regex so far, but I couldn't come up with anything.
Any help would be really appreciated, thank you.

From comment:
I may incur in a string where I got more than one instances of words wrapped by #, such as "#Hello# kind #stranger#"
From comment:
I need to apply some different style format to both the text inside and outside.
Since you need to apply different stylings, the code need to process each block of text separately, and needs to know if the text is inside or outside a #..# section.
Note, in the following code, it will silently skip the last #, if there is an odd number of them.
String input = ...
for (Matcher m = Pattern.compile("([^#]+)|#([^#]+)#").matcher(input); m.find(); ) {
if (m.start(1) != -1) {
String outsideText = m.group(1);
System.out.println("Outside: \"" + outsideText + "\"");
} else {
String insideText = m.group(2);
System.out.println("Inside: \"" + insideText + "\"");
}
}
Output for input = "Hello, we are #happy# to see you today"
Outside: "Hello, we are "
Inside: "happy"
Outside: " to see you today"
Output for input = "#Hello# kind #stranger#"
Inside: "Hello"
Outside: " kind "
Inside: "stranger"
Output for input = "This #text# has unpaired # characters"
Outside: "This "
Inside: "text"
Outside: " has unpaired "
Outside: " characters"

The best I could do is splitting in 3 groups, then merging the group 1 and 4 :
(^.*)(\#(.+?)\#)(.*)
Test it here
EDIT: Taking remarks from the comments :
(^[^\#]*)(?:\#(.+?)\#)([^\#]*)
Thanks to #Lino we don't capture the useless group with # anymore, and we capture anything except #, instead of any non whitespace character in the 1st and 2nd groups.
Test it here

Is this solution fine?
Pattern pattern =
Pattern.compile("([^#]+)|#([^#]*)#");
Matcher matcher =
pattern.matcher("Hello, we are #happy# to see you today");
List<String> notBetween = new ArrayList<>(); // not surrounded by #
List<String> between = new ArrayList<>(); // surrounded by #
while (matcher.find()) {
if (Objects.nonNull(matcher.group(1))) notBetween.add(matcher.group(1));
if (Objects.nonNull(matcher.group(2))) between.add(matcher.group(2));
}
System.out.println("Printing group 1");
for (String string :
notBetween) {
System.out.println(string);
}
System.out.println("Printing group 2");
for (String string :
between) {
System.out.println(string);
}

Unescaped java not matching in regex matcher.find()

I have the following code that basically matches "Match this:" and keeps the first sentence. However, there are sometimes unicode characters that get passed into the text that are causing backtracking on other more complicated regex's. Escaping seem to alleviate the backtracking index out of range exceptions. However, now the regex isn't matching.
What i would like to know is why this regex isn't matching when escaped? If you comment out the escape/unescape java lines everything.
String text = "Keep this\n\n"
+ "Match this:\n\nDelete 📱 this";
text = org.apache.commons.lang.StringEscapeUtils.escapeJava(text);
Pattern PATTERN = Pattern.compile("^Match this:$",
Pattern.MULTILINE);
Matcher m = PATTERN.matcher(text);
if (m.find()) {
text = text.substring(0, m.start()).replaceAll("[\\n]+$", "");
}
text = org.apache.commons.lang.StringEscapeUtils.unescapeJava(text);
System.out.println(text);

What i would like to know is why this regex isn't matching when escaped?
When you escape string like "foo\nbar" which printed is similar to
foo
bar
you are getting "foo\\nbar" which printed looks like
foo\nbar
It happens because StringEscapeUtils.escapeJava escapes also \n and is replacing it with \\n, so it is no longer line separator but simple literal, so it can't be matched with ^ or $.
Possible solution could be replacing back "\\n" with "\n" after StringEscapeUtils.escapeJava. You will need to be careful here, not to "unescapee" real "\\n" which after replacing would give you "\\\\n" which printed would look like \\n. So maybe use
text = org.apache.commons.lang3.StringEscapeUtils.escapeJava(text);
text = text.replaceAll("(?<!\\\\)\\\\n", "\n");// escape `\n`
// if it is not preceded with `\`
//do your job
//and now you can unescape your text (\n will stay \n)
text = org.apache.commons.lang3.StringEscapeUtils.unescapeJava(text);
Another option could be creating your own implementation similar to StringEscapeUtils.escapeJava. If you take a look at this method body you will see
return ESCAPE_JAVA.translate(input);
Where ESCAPE_JAVA is
CharSequenceTranslator ESCAPE_JAVA =
new LookupTranslator(
new String[][] {
{"\"", "\\\""},
{"\\", "\\\\"},
}).with(
new LookupTranslator(EntityArrays.JAVA_CTRL_CHARS_ESCAPE())
).with(
UnicodeEscaper.outsideOf(32, 0x7f)
);
and EntityArrays.JAVA_CTRL_CHARS_ESCAPE() returns clone of
String[][] JAVA_CTRL_CHARS_ESCAPE = {
{"\b", "\\b"},
{"\n", "\\n"},
{"\t", "\\t"},
{"\f", "\\f"},
{"\r", "\\r"}
};
array. So if you provide here your own table which will tell explicitly that \n should be left as it is (so it should be replaced with itself \n) your code will ignore it.
So this is how your own implementation can look like
private static CharSequenceTranslator translatorIgnoringLineSeparators =
new LookupTranslator(
new String[][] {
{ "\"", "\\\"" },
{ "\\", "\\\\" },
}).with(
new LookupTranslator(new String[][] {
{ "\b", "\\b" },
{ "\n", "\n" },//this will handle `\n` and will not change it
{ "\r", "\r" },//this will handle `\r` and will not change it
{ "\t", "\\t" },
{ "\f", "\\f" },
})).with(UnicodeEscaper.outsideOf(32, 0x7f));
public static String myJavaEscaper(CharSequence input) {
return translatorIgnoringLineSeparators.translate(input);
}
This method will prevent escaping \r and \n.

JAVA - Ignore part of strings containing "#"

I'm having some difficulties in excluding part of strings after the "#" symbol.
I explain myself better:
This is a sample input text a user could insert in a textbox:
Some Text
Some Text again #A comment
#A comment line
Another Text
Another Text again#Comment
I need to read this text and ignore all text after "#" symbol.
This should be the expected output:
Some Text;Some Text again;Another Text;Another Text again
As for now here's the code:
This replaces all newlines with ";"
readText = userInputTextArea.getText();
readTextAllInALine = readText.replaceAll("\\n", ";");
so the output after this is:
Some Text;Some Text again #A comment;#A comment line;Another Text;Another Text again#Comment
This code is to ignore all characters after the first "#" but works fine just for the first line if we read it all sequentially.
int startIndex = inputCommandText.indexOf("#");
int endIndex = inputCommandText.indexOf(";");
String toBeReplaced = inputCommandText.substring(startIndex, endIndex);
readTextAllInALine.replace(toBeReplaced, "");
I'm stuck in finding a way for having the expected output. I was thinking of using a StringTokenizer, processing every line, removing text after "#" or ignoring the whole line if it starts with "#", and then printing all tokens (i.e. all lines) separating them with ";" but I cannot make it work.
Any help will be appreciated.
Thank you very much in advance.
Regards.

Just call this replace command on your pure string, retrieved from the text input. The regex #[^;]* grabs everything, starting at the hash until it reads a semicolon. Afterwards it replaces it with an empty string.
public static void main(String[] args) {
String text = "Some Text;Some Text again #A comment;#A comment line;Another Text;Another Text again#Comment";
System.out.println(text);
text = text.replaceAll("#[^;]*", "");
System.out.println(text);
}

A regex is useful here but it's tricky because your pattern is moderately complex. The comments are end line so they can appear in more than one arrangement.
I came up with the following which is a two-pass:
replaceAll(" *(#.*(?=\\n|$))", "").replaceAll("\\n+", ";");
The two-pass circumvents the fact that sometimes you get a duplicate line break. The first expression replaces comments but not new line characters and the second expression replaces multiple new line characters with a single semicolon.
The individual parts of the expression in the first pass are the following:
" *"
This includes zero or more leading spaces in the comment match. IE in "...again #A...", we want to remove that space between n and #.
"(#.* )"
The start of the comment match: matches a # followed by zero or more characters. (Typically the . matches any character except a new line.)
"(?= )"
This is a positive lookahead and where the regex starts to get tricky. It looks for whatever is inside this expression but doesn't include it in the text that's matched. It asserts that the #.* is followed by a certain string but doesn't replace that certain string.
"\\n|$"
The lookahead finds a new line or the end anchor. This will find a comment ended with a new line character or a comment that is at the end of the String. But again, since it's inside the lookahead, the new line doesn't get replaced.
So given the input:
String text = (
"Some Text" + '\n' +
"Some Text again #A comment" + '\n' +
"#A comment line" + '\n' +
"Another Text" + '\n' +
"Another Text again#Comment"
);
System.out.println(
text.replaceAll(" *(#.*(?=\\n|$))", "").replaceAll("\\n+", ";")
);
The output is:
Some Text;Some Text again;Another Text;Another Text again

readText = userInputTextArea.getText();
readText = readText.replaceAll("\\s*#[^\n]*", "");
readText = readText.replaceAll("\n+", ";");

Just to make it clear, Coxer's reply is the way to go. Far more precise and clean. But in any case, if you fancy experimenting here is a recursive solution that will work:
public class IgnoreHash {
#Test
public void test() {
String readTextAllInALine = "Some Text;Some Text again #A comment;#A comment line;Another Text;Another Text again#Comment;";
String actualResult = removeHashComments(readTextAllInALine);
Assert.assertEquals(actualResult, "Some Text;Some Text again ;Another Text;Another Text again");
}
private String removeHashComments(String input) {
StringBuffer result = new StringBuffer();
int hashIndex = input.indexOf("#");
int endIndex = input.indexOf(";");
if(hashIndex != -1){
result.append(input.substring(0, hashIndex));
//first line
if(hashIndex < endIndex ) {
result.append(removeHashComments(input.substring(endIndex)));
} // the case of ;#
else if (endIndex == hashIndex-1) {
int endIndex2 = input.indexOf(";", hashIndex+1);
result.append(removeHashComments(input.substring(endIndex2+1)));
}
else {
result.append(removeHashComments(input.substring(hashIndex)));
}
}
return result.toString();
}
}

Is it possible to reverse escape string?

I know both NetBeans and Eclipse has options where if you paste multi-line, un-escaped string into a string variable, it will automatically add escape characters and add line breaks in. Is there a way to reverse the process?
For example:
function ShowHideOptions(trigger, element) {
if( trigger ) {
document.getElementById( element ).style.display = "";
} else {
document.getElementById( element ).style.display = "none";
}
}
if pasted in to string becomes:
private static final String LABEL_JAVASCRIPT = "function ShowHideOptions(trigger, element) {\n"
+ " if( trigger ) {\n"
+ " document.getElementById( element ).style.display = \"\";\n"
+ " } else {\n"
+ " document.getElementById( element ).style.display = \"none\";\n"
+ " }\n"
+ "}";
I want reverse this process.

I believe your question warrants another question. Why?
If you reverse this, the quotes would not be escaped and therefore you would get errors. Example:
System.out.println("System.out.println("Test");");
^
Error, everything after this quote is
considered code
Notice the quotes aren't escaped. This code would generate an error where I marked it because the quotes apparently mean the string should end.
Also, if the newlines are reversed, this example:
System.out.println("test");
System.out.println("test2");
Would become:
System.out.println("test");System.out.println("test2");
The following code works fine. Please clarify the problem.
System.out.println(LABEL_JAVASCRIPT);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

regex and substitution in java - java

Related

How to replace text using ReplaceFirst() without case sensitivity

How can I get non-matching groups using a Matcher in Java?

Unescaped java not matching in regex matcher.find()

JAVA - Ignore part of strings containing "#"

Is it possible to reverse escape string?

Categories

Resources