regex does not like out# - java

I write the following code o remove all hashtag words from my text:
public static void main(String[] args) {
System.out
.println(removeHashtag("Got an infection in my eye. Pharmacist thinks something bitten me. This wouldn't have happened under Simeone. Wenger a#sarcasm #wengerin"));
}
public static String removeHashtag(String commentstr) {
String arrWord[] = commentstr.split(" ");
String sentenceWithoutHash = commentstr;
System.out.println(sentenceWithoutHash);
for (int i = 0; i < arrWord.length; i++) {
if (arrWord[i].contains("#")) {
String regex = "\\s*\\" + arrWord[i] + "\\b\\s*";
sentenceWithoutHash = sentenceWithoutHash.replaceAll(regex, "");
}
}
return sentenceWithoutHash;
}
But this code does not work wih this text
Got an infection in my eye. Pharmacist thinks something bitten me. This wouldn't have happened under Simeone. Wenger out#sarcasm #wengerin"
it seems that regex does not like out#
Can anyone help?

You can use this regex to remove any word containing #:
String rep = str.replaceAll("\\s*\\w*#\\w*\\s*", "");
RegEx Demo

This will work as per your condition
((?:[^\s]+)?#[^\s]+)
Regex Demo
String x = str.replaceAll("((?:[^\\s]+)?#[^\\s]+)", "")

Related

Splitting string on spaces unless in double quotes but double quotes can have a preceding string attached

I need to split a string in Java (first remove whitespaces between quotes and then split at whitespaces.)
"abc test=\"x y z\" magic=\" hello \" hola"
becomes:
firstly:
"abc test=\"xyz\" magic=\"hello\" hola"
and then:
abc
test="xyz"
magic="hello"
hola
Scenario :
I am getting a string something like above from input and I want to break it into parts as above. One way to approach was first remove the spaces between quotes and then split at spaces. Also string before quotes complicates it. Second one was split at spaces but not if inside quote and then remove spaces from individual split. I tried capturing quotes with "\"([^\"]+)\"" but I'm not able to capture just the spaces inside quotes. I tried some more but no luck.
We can do this using a formal pattern matcher. The secret sauce of the answer below is to use the not-much-used Matcher#appendReplacement method. We pause at each match, and then append a custom replacement of anything appearing inside two pairs of quotes. The custom method removeSpaces() strips all whitespace from each quoted term.
public static String removeSpaces(String input) {
return input.replaceAll("\\s+", "");
}
String input = "abc test=\"x y z\" magic=\" hello \" hola";
Pattern p = Pattern.compile("\"(.*?)\"");
Matcher m = p.matcher(input);
StringBuffer sb = new StringBuffer("");
while (m.find()) {
m.appendReplacement(sb, "\"" + removeSpaces(m.group(1)) + "\"");
}
m.appendTail(sb);
String[] parts = sb.toString().split("\\s+");
for (String part : parts) {
System.out.println(part);
}
abc
test="xyz"
magic="hello"
hola
Demo
The big caveat here, as the above comments hinted at, is that we are really using a regex engine as a rudimentary parser. To see where my solution would fail fast, just remove one of the quotes by accident from a quoted term. But, if you are sure you input is well formed as you have showed us, this answer might work for you.
I wanted to mention the java 9's Matcher.replaceAll lambda extension:
// Find quoted strings and remove there whitespace:
s = Pattern.compile("\"[^\"]*\"").matcher(s)
.replaceAll(mr -> mr.group().replaceAll("\\s", ""));
// Turn the remaining whitespace in a comma and brace all.
s = '{' + s.trim().replaceAll("\\s+", ", ") + '}';
Probably the other answer is better but still I have written it so I will post it here ;) It takes a different approach
public static void main(String[] args) {
String test="abc test=\"x y z\" magic=\" hello \" hola";
Pattern pattern = Pattern.compile("([^\\\"]+=\\\"[^\\\"]+\\\" )");
Matcher matcher = pattern.matcher(test);
int lastIndex=0;
while(matcher.find()) {
String[] parts=matcher.group(0).trim().split("=");
boolean newLine=false;
for (String string : parts[0].split("\\s+")) {
if(newLine)
System.out.println();
newLine=true;
System.out.print(string);
}
System.out.println("="+parts[1].replaceAll("\\s",""));
lastIndex=matcher.end();
}
System.out.println(test.substring(lastIndex).trim());
}
Result is
abc
test="xyz"
magic="hello"
hola
It sounds like you want to write a basic parser/Tokenizer. My bet is that after you make something that can deal with pretty printing in this structure, you will soon want to start validating that there arn't any mis-matching "'s.
But in essence, you have a few stages for this particular problem, and Java has a built in tokenizer that can prove useful.
import java.util.LinkedList;
import java.util.List;
import java.util.StringTokenizer;
import java.util.stream.Collectors;
public class Q50151376{
private static class Whitespace{
Whitespace(){ }
#Override
public String toString() {
return "\n";
}
}
private static class QuotedString {
public final String string;
QuotedString(String string) {
this.string = "\"" + string.trim() + "\"";
}
#Override
public String toString() {
return string;
}
}
public static void main(String[] args) {
String test = "abc test=\"x y z\" magic=\" hello \" hola";
StringTokenizer tokenizer = new StringTokenizer(test, "\"");
boolean inQuotes = false;
List<Object> out = new LinkedList<>();
while (tokenizer.hasMoreTokens()) {
final String token = tokenizer.nextToken();
if (inQuotes) {
out.add(new QuotedString(token));
} else {
out.addAll(TokenizeWhitespace(token));
}
inQuotes = !inQuotes;
}
System.out.println(joinAsStrings(out));
}
private static String joinAsStrings(List<Object> out) {
return out.stream()
.map(Object::toString)
.collect(Collectors.joining());
}
public static List<Object> TokenizeWhitespace(String in){
List<Object> out = new LinkedList<>();
StringTokenizer tokenizer = new StringTokenizer(in, " ", true);
boolean ignoreWhitespace = false;
while (tokenizer.hasMoreTokens()){
String token = tokenizer.nextToken();
boolean whitespace = token.equals(" ");
if(!whitespace){
out.add(token);
ignoreWhitespace = false;
} else if(!ignoreWhitespace) {
out.add(new Whitespace());
ignoreWhitespace = true;
}
}
return out;
}
}

masking of email address in java

I am trying to mask email address with "*" but I am bad at regex.
input : nileshxyzae#gmail.com
output : nil********#gmail.com
My code is
String maskedEmail = email.replaceAll("(?<=.{3}).(?=[^#]*?.#)", "*");
but its giving me output nil*******e#gmail.com I am not getting whats getting wrong here. Why last character is not converted?
Also can someone explain meaning all these regex
Your look-ahead (?=[^#]*?.#) requires at least 1 character to be there in front of # (see the dot before #).
If you remove it, you will get all the expected symbols replaced:
(?<=.{3}).(?=[^#]*?#)
Here is the regex demo (replace with *).
However, the regex is not a proper regex for the task. You need a regex that will match each character after the first 3 characters up to the first #:
(^[^#]{3}|(?!^)\G)[^#]
See another regex demo, replace with $1*. Here, [^#] matches any character that is not #, so we do not match addresses like abc#example.com. Only those emails will be masked that have 4+ characters in the username part.
See IDEONE demo:
String s = "nileshkemse#gmail.com";
System.out.println(s.replaceAll("(^[^#]{3}|(?!^)\\G)[^#]", "$1*"));
If you're bad at regular expressions, don't use them :) I don't know if you've ever heard the quote:
Some people, when confronted with a problem, think
"I know, I'll use regular expressions." Now they have two problems.
(source)
You might get a working regular expression here, but will you understand it today? tomorrow? in six months' time? And will your colleagues?
An easy alternative is using a StringBuilder, and I'd argue that it's a lot more straightforward to understand what is going on here:
StringBuilder sb = new StringBuilder(email);
for (int i = 3; i < sb.length() && sb.charAt(i) != '#'; ++i) {
sb.setCharAt(i, '*');
}
email = sb.toString();
"Starting at the third character, replace the characters with a * until you reach the end of the string or #."
(You don't even need to use StringBuilder: you could simply manipulate the elements of email.toCharArray(), then construct a new string at the end).
Of course, this doesn't work correctly for email addresses where the local part is shorter than 3 characters - it would actually then mask the domain.
Your Look-ahead is kind of complicated. Try this code :
public static void main(String... args) throws Exception {
String s = "nileshkemse#gmail.com";
s= s.replaceAll("(?<=.{3}).(?=.*#)", "*");
System.out.println(s);
}
O/P :
nil********#gmail.com
I like this one because I just want to hide 4 characters, it also dynamically decrease the hidden chars to 2 if the email address is too short:
public static String maskEmailAddress(final String email) {
final String mask = "*****";
final int at = email.indexOf("#");
if (at > 2) {
final int maskLen = Math.min(Math.max(at / 2, 2), 4);
final int start = (at - maskLen) / 2;
return email.substring(0, start) + mask.substring(0, maskLen) + email.substring(start + maskLen);
}
return email;
}
Sample outputs:
my.email#gmail.com > my****il#gmail.com
info#mail.com > i**o#mail.com
//In Kotlin
val email = "nileshkemse#gmail.com"
val maskedEmail = email.replace(Regex("(?<=.{3}).(?=.*#)"), "*")
public static string GetMaskedEmail(string emailAddress)
{
string _emailToMask = emailAddress;
try
{
if (!string.IsNullOrEmpty(emailAddress))
{
var _splitEmail = emailAddress.Split(Char.Parse("#"));
var _user = _splitEmail[0];
var _domain = _splitEmail[1];
if (_user.Length > 3)
{
var _maskedUser = _user.Substring(0, 3) + new String(Char.Parse("*"), _user.Length - 3);
_emailToMask = _maskedUser + "#" + _domain;
}
else
{
_emailToMask = new String(Char.Parse("*"), _user.Length) + "#" + _domain;
}
}
}
catch (Exception) { }
return _emailToMask;
}

Java - End for loop at end of word

I'm trying to make a Minecraft Bukkit plugin, and it involves making hashtags and such. I have it so that when you do #hashtag-goes-here it'll highlight it. The only problem is, you must have another word after it (to have a space) for it to work. This is my code so far:
try{
for(int i = index; i < message.length(); i++){
String str = Character.toString(message.charAt(i));
String sbString = sb.toString().trim();
System.out.println(sbString);
if(str.equals(" ")){
str.replace(str, str + ChatColor.RESET);
String hName = sb.toString().replaceFirst("#", "").trim();
String newMessage = message.replaceAll("#", ChatColor.AQUA + "#").replace(str, ChatColor.RESET + str);
event.setMessage(newMessage);
logHashtag(event.getPlayer(), event.getMessage(), hName);
break;
}else{
sb.append(str);
}
}
}catch(Exception e){
throw new HashtagException("Failed to change hashtag colors in message!");
}
Edit: (Already answered like, last year, I know; this is so that people who read this know what I was asking) My question was so that it could work in all situations that the hashtag could be found in. Thanks to halfbit for helping me :)
If you want to find strings in a text and highlight them, then regular expressions can be quite useful. You might use code similar to the following:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class HashTagColorizer {
public static void main(String[] args) {
String AQUA = "<AQUA>", RESET = "<RESET>";
String message = "Aaa #hashtag-goes-here bbb #another-hashtag ccc";
Pattern pattern = Pattern.compile("#([A-Za-z0-9-]+)");
Matcher matcher = pattern.matcher(message);
StringBuilder sb = new StringBuilder(message.length());
int position = 0;
while (matcher.find(position)) {
sb.append(message.substring(position, matcher.start()));
sb.append(AQUA);
System.out.println("event for " + matcher.group(1));
sb.append(matcher.group().substring(1));
sb.append(RESET);
position = matcher.end();
}
sb.append(message.substring(position));
System.out.println(sb);
// Aaa <AQUA>hashtag-goes-here<RESET> bbb <AQUA>another-hashtag<RESET> ccc
}
}
Instead of doing your logic in the if on sb, do it in the else on the str (or in the sb after appending str, but you will repeat a lot of processing in the sections of the sb already processed).
You are welcome.
BTW:
str.replace(str, str + ChatColor.RESET);
does nothing because String instances are immutable

How to remove matched words from end of String

I want to remove the following words from end of String ‘PTE’, ‘LTD’, ‘PRIVATE’ and ‘LIMITED’
i tried the code but then i stuck. i tried this
String[] str = {"PTE", "LTD", "PRIVATE", "LIMITED"};
String company = "Basit LTD";
for(int i=0;i<str.length;i++) {
if (company.endsWith(str[i])) {
int position = company.lastIndexOf(str[i]);
company = company.substring(0, position);
}
}
System.out.println(company.replaceAll("\\s",""));
It worked. But suppose the company is Basit LIMITED PRIVATE LTD PTE or Basit LIMITED PRIVATE PTE LTD or any combination of four words in the end. Then the above code just remove the last name i.e., PTE or PRIVATE and so on, and the output is BasitLIMITEDPRIVATELTD.
I want output to be just Basit
How can i do it?
Thanks
---------------Edit---
Please note here the company name is just an example, it is not necessary that it is always the same. may be i have name like
String company = "Masood LIMITED LTD PTE PRIVATE"
or any name that can have the above mentioned words at the end.
Thanks
You can do this in single line. no need to loop through. just use String#replaceAll(regex, str).
company = company.replaceAll("PTE$*?|LTD$*?|PRIVATE$*?|LIMITED$*?","");
If you place the unwanted words in the map it will be ommitted in the resultant string
HashMap map = new HashMap();
map.put("PTE", "");
map.put("LTD", "");
map.put("PRIVATE", "");
map.put("LIMITED", "");
String company = "Basit LTD PRIVATE PTE";
String words[] = company.split(" ");
String resultantStr = "";
for(int k = 0; k < words.length; k++){
if(map.get(words[k]) == null) {
resultantStr += words[k] + " ";
}
}
resultantStr = resultantStr.trim();
System.out.println(" Trimmed String: "+ resultantStr);
If you want to remove these suffixes only at the end of the string, then you could introduce a while loop:
String[] str = {"PTE", "LTD", "PRIVATE", "LIMITED"};
boolean foundSuffix = true;
String company = "Basit LTD";
while (foundSuffix) {
foundSuffix = false;
for(int i=0;i<str.length;i++) {
if (company.endsWith(str[i])) {
foundSuffix = true;
int position = company.lastIndexOf(str[i]);
company = company.substring(0, position);
}
}
}
System.out.println(company.replaceAll("\\s",""));
If you don't mind transforming PTE Basit LIMITED INC to Basit (and also remove the first PTE), then replaceAll should work, as explained by others.
I was trying to do exactly same thing for one of my projects. I wrote this code few days earlier. Now I was exactly trying to find a much better way to do it, that's how I found this Question. But after seeing other answers I decided to share my version of the code.
Collection<String> stopWordSet = Arrays.asList("PTE", "LTD", "PRIVATE", "LIMITED");
String company = "Basit LTD"; //Or Anything
String[] tokens = company.split("[\#\]\\\_\^\[\"\#\ \!\&\'\`\$\%\*\+\(\)\.\/\,\-\;\~\:\}\|\{\?\>\=\<]+");
Stack<String> tokenStack = new Stack<>();
tokenStack.addAll(Arrays.asList(tokens));
while (!tokenStack.isEmpty()) {
String token = tokenStack.peek();
if (stopWordSet.contains(token))
tokenStack.pop();
else
break;
}
String formattedCompanyName = StringUtils.join(tokenStack.toArray());
Try this :
public static void main(String a[]) {
String[] str = {"PTE", "LTD", "PRIVATE", "LIMITED"};
String company = "Basit LIMITED PRIVATE LTD PTE";
for(int i=0;i<str.length;i++) {
company = company.replaceAll(str[i], "");
}
System.out.println(company.replaceAll("\\s",""));
}
All you need is to use trim() and call your function recursively, Or each time you remove a sub string from the end, reset your i to 0.
public class StringMatchRemove {
public static void main(String[] args) {
String str="my name is noorus khan";
String search="noorus";
String newString="";
String word=str.replace(search," ");
StringTokenizer st = new StringTokenizer(word," ");
while(st.hasMoreTokens())
{
newString = newString + st.nextToken() + " ";
}
System.out.println(newString);
}
first using the replace method we get word=my name is ..... khan (Note: here(.) represents the space). Now we should have to remove these spaces for that we are creating a new string adding all the token simply.
Output: my name is khan

Split a quoted string with a delimiter

I want to split a string with a delimiter white space. but it should handle quoted strings intelligently. E.g. for a string like
"John Smith" Ted Barry
It should return three strings John Smith, Ted and Barry.
After messing around with it, you can use Regex for this. Run the equivalent of "match all" on:
((?<=("))[\w ]*(?=("(\s|$))))|((?<!")\w+(?!"))
A Java Example:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class Test
{
public static void main(String[] args)
{
String someString = "\"Multiple quote test\" not in quotes \"inside quote\" \"A work in progress\"";
Pattern p = Pattern.compile("((?<=(\"))[\\w ]*(?=(\"(\\s|$))))|((?<!\")\\w+(?!\"))");
Matcher m = p.matcher(someString);
while(m.find()) {
System.out.println("'" + m.group() + "'");
}
}
}
Output:
'Multiple quote test'
'not'
'in'
'quotes'
'inside quote'
'A work in progress'
The regular expression breakdown with the example used above can be viewed here:
http://regex101.com/r/wM6yT9
With all that said, regular expressions should not be the go to solution for everything - I was just having fun. This example has a lot of edge cases such as the handling unicode characters, symbols, etc. You would be better off using a tried and true library for this sort of task. Take a look at the other answers before using this one.
Try this ugly bit of code.
String str = "hello my dear \"John Smith\" where is Ted Barry";
List<String> list = Arrays.asList(str.split("\\s"));
List<String> resultList = new ArrayList<String>();
StringBuilder builder = new StringBuilder();
for(String s : list){
if(s.startsWith("\"")) {
builder.append(s.substring(1)).append(" ");
} else {
resultList.add((s.endsWith("\"")
? builder.append(s.substring(0, s.length() - 1))
: builder.append(s)).toString());
builder.delete(0, builder.length());
}
}
System.out.println(resultList);
well, i made a small snipet that does what you want and some more things. since you did not specify more conditions i did not go through the trouble. i know this is a dirty way and you can probably get better results with something that is already made. but for the fun of programming here is the example:
String example = "hello\"John Smith\" Ted Barry lol\"Basi German\"hello";
int wordQuoteStartIndex=0;
int wordQuoteEndIndex=0;
int wordSpaceStartIndex = 0;
int wordSpaceEndIndex = 0;
boolean foundQuote = false;
for(int index=0;index<example.length();index++) {
if(example.charAt(index)=='\"') {
if(foundQuote==true) {
wordQuoteEndIndex=index+1;
//Print the quoted word
System.out.println(example.substring(wordQuoteStartIndex, wordQuoteEndIndex));//here you can remove quotes by changing to (wordQuoteStartIndex+1, wordQuoteEndIndex-1)
foundQuote=false;
if(index+1<example.length()) {
wordSpaceStartIndex = index+1;
}
}else {
wordSpaceEndIndex=index;
if(wordSpaceStartIndex!=wordSpaceEndIndex) {
//print the word in spaces
System.out.println(example.substring(wordSpaceStartIndex, wordSpaceEndIndex));
}
wordQuoteStartIndex=index;
foundQuote = true;
}
}
if(foundQuote==false) {
if(example.charAt(index)==' ') {
wordSpaceEndIndex = index;
if(wordSpaceStartIndex!=wordSpaceEndIndex) {
//print the word in spaces
System.out.println(example.substring(wordSpaceStartIndex, wordSpaceEndIndex));
}
wordSpaceStartIndex = index+1;
}
if(index==example.length()-1) {
if(example.charAt(index)!='\"') {
//print the word in spaces
System.out.println(example.substring(wordSpaceStartIndex, example.length()));
}
}
}
}
this also checks for words that were not separated with a space after or before the quotes, such as the words "hello" before "John Smith" and after "Basi German".
when the string is modified to "John Smith" Ted Barry the output is three strings,
1) "John Smith"
2) Ted
3) Barry
The string in the example is hello"John Smith" Ted Barry lol"Basi German"hello and prints
1)hello
2)"John Smith"
3)Ted
4)Barry
5)lol
6)"Basi German"
7)hello
Hope it helps
This is my own version, clean up from http://pastebin.com/aZngu65y (posted in the comment).
It can take care of Unicode. It will clean up all excessive spaces (even in quote) - this can be good or bad depending on the need. No support for escaped quote.
private static String[] parse(String param) {
String[] output;
param = param.replaceAll("\"", " \" ").trim();
String[] fragments = param.split("\\s+");
int curr = 0;
boolean matched = fragments[curr].matches("[^\"]*");
if (matched) curr++;
for (int i = 1; i < fragments.length; i++) {
if (!matched)
fragments[curr] = fragments[curr] + " " + fragments[i];
if (!fragments[curr].matches("(\"[^\"]*\"|[^\"]*)"))
matched = false;
else {
matched = true;
if (fragments[curr].matches("\"[^\"]*\""))
fragments[curr] = fragments[curr].substring(1, fragments[curr].length() - 1).trim();
if (fragments[curr].length() != 0)
curr++;
if (i + 1 < fragments.length)
fragments[curr] = fragments[i + 1];
}
}
if (matched) {
return Arrays.copyOf(fragments, curr);
}
return null; // Parameter failure (double-quotes do not match up properly).
}
Sample input for comparison:
"sdfskjf" sdfjkhsd "hfrif ehref" "fksdfj sdkfj fkdsjf" sdf sfssd
asjdhj sdf ffhj "fdsf fsdjh"
日本語 中文 "Tiếng Việt" "English"
dsfsd
sdf " s dfs fsd f " sd f fs df fdssf "日本語 中文"
"" "" ""
" sdfsfds " "f fsdf
(2nd line is empty, 3rd line is spaces, last line is malformed).
Please judge with your own expected output, since it may varies, but the baseline is that, the 1st case should return [sdfskjf, sdfjkhsd, hfrif ehref, fksdfj sdkfj fkdsjf, sdf, sfssd].
commons-lang has a StrTokenizer class to do this for you, and there is also java-csv library.
Example with StrTokenizer:
String params = "\"John Smith\" Ted Barry"
// Initialize tokenizer with input string, delimiter character, quote character
StrTokenizer tokenizer = new StrTokenizer(params, ' ', '"');
for (String token : tokenizer.getTokenArray()) {
System.out.println(token);
}
Output:
John Smith
Ted
Barry

Categories

Resources