Java String replacement with custom regex - java

I have a Java application which streams Twitter data.
Assuming that I have a String text = tweet.getText() variable.
In a text we can have one or more #MentionedUser. I'd like to delete not just the # but the username too.
How can I do this with replaceAll and without touching the rest of the string?
Thank you.

I would like to use (^|\s)#\w+($|\s) because you can get emails in your input like :
a #twitter username and a simple#email.com another #twitterUserName
So you can use :
String text = "a #twitter username and a simple#email.com another #twitterUserName";
text = text.replaceAll("(^|\\s)#\\w+($|\\s)", "$1$2");
// Output : a username and a simple#email.com another
Details :
(^|\s) which match ^ start of string or | a space \s
#\w+ match # followed by one or more word characters which is equivalent to [A-Za-z0-9_]
($|\s) which match $ end of string or | a space \s
If you want to go deeper to specify the correct syntax of twitter usernames i read this article here they mention some helpful information :
Your username cannot be longer than 15 characters. Your name can be longer (50 characters), but usernames are kept shorter for the
sake of ease.
A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. ...
From this rules you use this regex as well :
(?i)(^|\s)#[a-z0-9_]{1,15}($|\s)

Here is an alternative which does not produce doubled whitespaces and also does not capture emails:
String str = "a #twitter #user username and a john.doe#gmail.com another #twitterUserName #test jane#doe.com";
System.out.println(str.replaceAll("(?<=[^\\w])#[^#\\s]+(\\s+|$)", ""));
Output:
a username and a john.doe#gmail.com another jane#doe.com
Explanation of the parts of the actual regex expression (?<=[^\w])#[^#\s]+(\s+|$) :
(?<=[^\w])# - Try to find the '#' character and then look back to check that there is no regular character behind it (uses zero-width positive lookbehind).
[^#\s]+ - Find something which is not an '#' or space character
(\s+|$) - Find multiple spaces or the end of the line

Related

Replace repeated xml tags value using regex

Input -
String ipXmlString = "<root>"
+ "<accntNoGrp><accntNo>1234567</accntNo></accntNoGrp>"
+ "<accntNoGrp><accntNo>6663823</accntNo></accntNoGrp>"
+ "</root>";
Tried follwing things using to mask values within using
String op = ipXmlString .replaceAll("<accntNo>(.+?)</accntNo>", "######");
But above code masks all the values
<root><accntNoGrp>######</accntNoGrp><accntNoGrp>######</accntNoGrp></root>
Expected Output:
<root><accntNoGrp><accntNo>#####67</accntNo></accntNoGrp><accntNoGrp><accntNo>#####23</accntNo></accntNoGrp></root>
How to achieve this using java regex ?Could someone help
Your replacement is wrong, you need to include the <accntNo> tag in the actual replacement. Also, it appears that you want to show the last two characters/numbers of the account number. In this case, we can capture this information during the match and use it in the replacement.
Code:
String op = ipXmlString.replaceAll("<accntNo>(?:.+?)(.{2})</accntNo>", "<accntNo>######$1</accntNo>");
Explanation:
<accntNo> match an opening tag
(?:.+?) match, but do not capture, anything up until the first
(.{2}) two characters before closing tag (and capture this)
</accntNo> match a closing tag
Note here that by using ?: inside a parenthesis in the pattern, we tell the regex engine to not capture it. There is no point in capturing anything before the last two characters of the account number because we don't want to us it.
The $1 quantity in the replacement refers to the first capture group. In this case, it is the last two characters of the account number. Hence, we build the replacement string you want this way.
Demo here:
Rextester
Try this code:
public static void main(String[] args) {
String ipXmlString = "<root>"
+ "<accntNoGrp><accntNo>1234567</accntNo></accntNoGrp>"
+ "<accntNoGrp><accntNo>6663823</accntNo></accntNoGrp>"
+ "</root>";
String replaceAll = ipXmlString.replaceAll("\\d+", "######");
System.out.println(replaceAll);
}
Prints:
<root><accntNoGrp><accntNo>######</accntNo></accntNoGrp><accntNoGrp><accntNo>######</accntNo></accntNoGrp></root>

Split and replace Java string

I am trying to read a text file, split the contents as explained below, and append the split comments in to a Java List.
The error is in the splitting part.
Existing String:
a1(X1, UniqueVar1), a2(X2, UniqueVar1), a3(UniqueVar1, UniqueVar2)
Expected—to split them and append them to Java list:
a1(X1, UniqueVar1)
a2(X2, UniqueVar1)
a3(UniqueVar1, UniqueVar2)
Code:
subSplit = obj.split("\\), ");
for (String subObj: subSplit)
{
System.out.println(subObj.trim());
}
Result:
a1(X1, UniqueVar1
a2(X2, UniqueVar1
...
Please suggest how to correct this.
Use a positive lookbehind in your regular expression:
String[] subSplit = obj.split("(?<=\\)), ");
This expression matches a , preceded by a ), but because the lookbehind part (?<=\\)) is non-capturing (zero-width), it doesn't get discarded as being part of the split separator.
More information about lookaround assertions and non-capturing groups can be found in the javadoc of the Pattern class.

Escape special characters using Regex in java [duplicate]

Does Java have a built-in way to escape arbitrary text so that it can be included in a regular expression? For example, if my users enter "$5", I'd like to match that exactly rather than a "5" after the end of input.
Since Java 1.5, yes:
Pattern.quote("$5");
Difference between Pattern.quote and Matcher.quoteReplacement was not clear to me before I saw following example
s.replaceFirst(Pattern.quote("text to replace"),
Matcher.quoteReplacement("replacement text"));
It may be too late to respond, but you can also use Pattern.LITERAL, which would ignore all special characters while formatting:
Pattern.compile(textToFormat, Pattern.LITERAL);
I think what you're after is \Q$5\E. Also see Pattern.quote(s) introduced in Java5.
See Pattern javadoc for details.
First off, if
you use replaceAll()
you DON'T use Matcher.quoteReplacement()
the text to be substituted in includes a $1
it won't put a 1 at the end. It will look at the search regex for the first matching group and sub THAT in. That's what $1, $2 or $3 means in the replacement text: matching groups from the search pattern.
I frequently plug long strings of text into .properties files, then generate email subjects and bodies from those. Indeed, this appears to be the default way to do i18n in Spring Framework. I put XML tags, as placeholders, into the strings and I use replaceAll() to replace the XML tags with the values at runtime.
I ran into an issue where a user input a dollars-and-cents figure, with a dollar sign. replaceAll() choked on it, with the following showing up in a stracktrace:
java.lang.IndexOutOfBoundsException: No group 3
at java.util.regex.Matcher.start(Matcher.java:374)
at java.util.regex.Matcher.appendReplacement(Matcher.java:748)
at java.util.regex.Matcher.replaceAll(Matcher.java:823)
at java.lang.String.replaceAll(String.java:2201)
In this case, the user had entered "$3" somewhere in their input and replaceAll() went looking in the search regex for the third matching group, didn't find one, and puked.
Given:
// "msg" is a string from a .properties file, containing "<userInput />" among other tags
// "userInput" is a String containing the user's input
replacing
msg = msg.replaceAll("<userInput \\/>", userInput);
with
msg = msg.replaceAll("<userInput \\/>", Matcher.quoteReplacement(userInput));
solved the problem. The user could put in any kind of characters, including dollar signs, without issue. It behaved exactly the way you would expect.
To have protected pattern you may replace all symbols with "\\\\", except digits and letters. And after that you can put in that protected pattern your special symbols to make this pattern working not like stupid quoted text, but really like a patten, but your own. Without user special symbols.
public class Test {
public static void main(String[] args) {
String str = "y z (111)";
String p1 = "x x (111)";
String p2 = ".* .* \\(111\\)";
p1 = escapeRE(p1);
p1 = p1.replace("x", ".*");
System.out.println( p1 + "-->" + str.matches(p1) );
//.*\ .*\ \(111\)-->true
System.out.println( p2 + "-->" + str.matches(p2) );
//.* .* \(111\)-->true
}
public static String escapeRE(String str) {
//Pattern escaper = Pattern.compile("([^a-zA-z0-9])");
//return escaper.matcher(str).replaceAll("\\\\$1");
return str.replaceAll("([^a-zA-Z0-9])", "\\\\$1");
}
}
Pattern.quote("blabla") works nicely.
The Pattern.quote() works nicely. It encloses the sentence with the characters "\Q" and "\E", and if it does escape "\Q" and "\E".
However, if you need to do a real regular expression escaping(or custom escaping), you can use this code:
String someText = "Some/s/wText*/,**";
System.out.println(someText.replaceAll("[-\\[\\]{}()*+?.,\\\\\\\\^$|#\\\\s]", "\\\\$0"));
This method returns: Some/\s/wText*/\,**
Code for example and tests:
String someText = "Some\\E/s/wText*/,**";
System.out.println("Pattern.quote: "+ Pattern.quote(someText));
System.out.println("Full escape: "+someText.replaceAll("[-\\[\\]{}()*+?.,\\\\\\\\^$|#\\\\s]", "\\\\$0"));
^(Negation) symbol is used to match something that is not in the character group.
This is the link to Regular Expressions
Here is the image info about negation:

How to skip a specific word when replace the String using regexp in java

Consider the string
String s = "H_ello pe_rfec_t wor_ld"
I want to replace all '_' symbols on... no matter what, let`s say on '1', except those which are placed inside the 'pe_rfec_t'.
I could not find any solution even to just skip the word 'pe_rfec_t':
s = s.replaceAll("(?<=pe_rfec_t).*|.*(?=pe_rfec_t)", "1");
looks nice at a glance but results to:
11pe_rfec_t1 //instead of 1111111pe_rfec_t1111111
Ideally I need the following result:
Hello pe_rfec_t world
Could anyone help me please?
You can use alternation and captured group:
String str = "H_ello pe_rfec_t wor_ld";
String repl = str.replaceAll("(pe_rfec_t)|_", "$1");
//=> Hello pe_rfec_t world
RegEx Demo
Here in alternation we match first pe_rfec_t and capture it in group #1. In repalcement we put $1 (back-reference to group #1) back.

How to split by every space preceded by a dot or colon?

I've got a string in Java: Acidum acetylsalic. Acid.ascorb, Calcium which I want to split. The string has to be cut after every space preceded by a dot or colon: ,[space] or .[space]
In result I need three strings: Acidum acetylsalic, Acid.ascorb, Calcium
I know I need some regex and according to this and this I tried "\, |\. " but I doubt that's not how regex work.
Split by
"[,.] "
[,.] - character set with one comma or dot
The problem with your original regex is that you need to escape the dot once to make it a literal dot and a second time to escape the slash escaping it. It will also work if you change it to:
", |\\. "
Try this:
str.split("[\\.,]\\s")
....
Use split("(\.\s)|(\,\s)")
You need to encode special characters see https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Try this
String str="Acidum acetylsalic. Acid.ascorb, Calcium";
String[] resStr= str.split("[.,][\\s]");
for (String res : resStr) {
System.out.println(res);
}
Output :
Acidum acetylsalic
Acid.ascorb
Calcium

Categories

Resources