Regular expressions: all words after my current one are gone - java

I need to remove all strings from my text file, such as:
flickr:user=32jdisffs
flickr:user=acssd
flickr:user=asddsa89
I'm currently using fields[i] = fields[i].replaceAll(" , flickr:user=.*", "");
however the issue with this is approach is that any word after flickr:user= is removed from the content, even after the space.
thanks

You probably need
replaceAll("flickr:user=[0-9A-Za-z]+", "");

flickr:user=\w+ should do it:
String noFlickerIdsHere = stringWithIds.replaceAll("flickr:user=\\w+", "");
Reference:
\w = A word character: [a-zA-Z_0-9]

Going by the question as stated, chances are that you want:
fields[i] = fields[i].replaceAll(" , flickr:user=[^ ]* ", ""); // or " "
This will match the string, including the value of user up to but not including the first space, followed by a space, and replace it either by a blank string, or a single space. However this will (barring the comma) net you an empty result with the input you showed. Is that really what you want?
I'm also not sure where the " , " at the beginning fits into the example you showed.
The reason for your difficulties is that an unbounded .* will match everything from that point up until the end of the input (even if that amounts to nothing; that's what the * is for). For a line-based regular expression parser, that's to the end of the line.

Related

Replace a nth character using regex in Java

I'm trying to learn regex in Java.
So far, I've been trying some little mini challenges and I'm wondering if there is a way to define a nth character.
For instance, let's say I have this string: todayiwasnotagoodday
If I want to replace the third (fourth or seventh) character, how I can define a regex in order to change an specific "index", for this example the 'd' for an empty space "".
I've been searching about it, but so far my implementations match from the first element to the third: ^[a-z]{3}
¿Is it possible to define this regex?
Thanks in advance.
If you want to replace the third character with a space via regex, you could try a regex replace all:
String input = "todayiwasnotagoodday";
String output = input.replaceAll("^(.{2}).(.*)$", "$1 $2");
System.out.println(output); // to ayiwasnotagoodday
Note that you could also avoid regex here, and just use substring operations:
String output = input.substring(0, 2) + " " + input.substring(3);
System.out.println(output); // to ayiwasnotagoodday

How to check and replace a sequence of characters in a String?

Here what the program is expectiong as the output:
if originalString = "CATCATICATAMCATCATGREATCATCAT";
Output should be "I AM GREAT".
The code must find the sequence of characters (CAT in this case), and remove them. Plus, the resulting String must have spaces in between words.
String origString = remixString.replace("CAT", "");
I figured out I have to use String.replace, But what could be the logic for finding out if its not cat and producing the resulting string with spaces in between the words.
First off, you probably want to use the replaceAll method instead, to make sure you replace all occurrences of "CAT" within the String. Then, you want to introduce spaces, so instead of an empty String, replace "CAT" with " " (space).
As pointed out by the comment below, there might be multiple spaces between words - so we use a regular expression to replace multiple instances of "CAT" with a single space. The '+' symbol means "one or more",.
Finally, trim the String to get rid of leading and trailing white space.
remixString.replaceAll("(CAT)+", " ").trim()
You can use replaceAll which accepts a regular expression:
String remixString = "CATCATICATAMCATCATGREATCATCAT";
String origString = remixString.replaceAll("(CAT)+", " ").trim();
Note: the naming of replace and replaceAll is very confusing. They both replace all instances of the matching string; the difference is that replace takes a literal text as an argument, while replaceAll takes a regular expression.
Maybe this will help
String result = remixString.replaceAll("(CAT){1,}", " ");

Search on a particular line using Regular Expression in Java

I am new with Regular Expression and might be my question is very basic one.
I want to create a regular expression that can search an expression on a particular line number.
eg.
I have data
"\nerferf erferfre erferf 12545" +
"\ndsf erf" +
"\nsdsfd refrf refref" +
"\nerferf erferfre erferf 12545" +
"\ndsf erf" +
"\nsdsfd refrf refref" +
"\nerferf erferfre erferf 12545" +
"\ndsf erf" +
"\nsdsfd refrf refref" +
"\nerferf erferfre erferf 12545" +
And I want to search the number 1234 on 7th Line. It may or may not be present on other lines also.
I have tried with
"\\n.*\\n.*\\n.*\\n.*\\n.*\\n.*\\d{4}"
but am not getting the result.
Please help me out with the regular expression.
Firstly, your newline character should be placed at the end of the lines. That way, picturing a particular line would be easier. Below explanation is based on this modification.
Now, to get to 7th line, you would first need to skip the first 6 line, that you can do with {n,m} quantifier. You don't need to write .*\n 6 times. So, that would be like this:
(.*\n){6}
And then you are at 7th line, where you can match your required digit. That part would be something like this:
.*?1234
And then match rest of the text, using .*
So, your final regex would look like:
(?s)(.*\n){6}.*?1234.*
So, just use String#matches(regex) method with this regex.
P.S. (?s) is used to enable single-line matching. Since dot(.) by default, does not matches the newline character.
To print something you matched, you can use capture groups:
(?s)(?:.*\n){6}.*?(1234).*
This will capture 1234 if matched in group 1. Although it seems unusual, that you capture an exact string that you are matching - like capturing 1234 is no sense here, as you know you are matching 1234, and not against \\d, in which case you might be interested in exactly what are those digits.
Try
Pattern p = Pattern.compile("^(\\n.*){6}\\n.*\\d{4}" );
System.out.println(p.matcher(s).find());
This problem is better not solved with regex alone. Start by splitting the string on a newline character, to get an array of lines:
String[] lines = data.split("\\n");
Then, to execute the regex on line 7:
try {
String line7 = lines[6];
// do something with it
} catch (IndexOutOfBoundsException ex) {
System.error.println("Line not found");
}
Hope this is a start for you.
Edit: I'm not a pro in Regex but I would try with this one:
"(\\n.*){5}(.*)"
Sorry if this isn't the correct Java syntax but this should capture 5 new lines + data first, so that's six lines gone, and the data itself should be available in the second capture group (including newline). If you want to exclude the newline in front:
"(\\n.*){5}\\n(.*)"
You can use:
(^.*\r\n)(^.*\r\n)(^.*\r\n)(^.*\r\n)(^.*\r\n)(^.*\r\n)(^.*)(1234)

Removing all standalone occurences of a word from a string with regular expressions in Java

Need advice on how to replace a sub-string like: #sometext, but not replace "#someothertext#somemail.com" sub-string.
For example, when I've got a string something like:
An example with #sometext and also with "#someothertext#somemail.com" sometextafter
And the result, after replacing sub-strings in string above should look like:
An example with and also with "#someothertext#somemail.com" sometextafter
After getting string from a field, I'm using:
String textMod = someText.replaceAll("( |^)[^\"]#[^#]+?( |$)","");
someText = textMod + "#\"" + someone.getEmail() + "\" ";
And then I'm setting this string into field.
You can do a regex on a standalone occurence this way
\b#sometext\b
Putting the \b in front and in the back of the #sometext will make sure that it's a standalone word, not part of another word like #someothertext#sometext.com. Then if it's found the result will be put inside $match, now you can do whatever you want with $match
Hope this helps
From https://docs.oracle.com/javase/tutorial/essential/regex/bounds.html
The \b in the pattern indicates a word boundary, so only the distinct
* word "web" is matched, and not a word partial like "webbing" or "cobweb"
if (preg_match("/\bweb\b/i", "PHP is the web scripting language of choice.")) {
echo "A match was found.";
}
^ PHP example but you get the point
If there is always a space before and behind the tags to replace, this might suffice.
/\s(#\w+)\s/g
Try this
(?<!\w)#[^#\s]+(?!\S)
See it here on Regexr
Match on a # but only if there is no word character \w before (?<!\w). Then match a sequence of characters that are not # and not whitespace \s but only if its not followed by a non whitespace \S
(?<!\w) is called a negative lookbehind assertion
[^#\s] is called a negated character class, means match anything that is not part of the class
(?!\S) is a negative lookahead assertion
This should correspond to your needs:
str = str.replaceAll("#\w+[^#]", "");
(c#, regex based)
//match #xxx sequences, but only if i can look back and NOT see a #xxx immediately preceding me, and if I don't end with a #
string input = #"[An example with #hello and also with ""##hello#somemail.com"" sometext #lastone";
var pattern = #"(?<!#\w+)(?>#\w+)(?!#)";
var matches = Regex.Matches(input, pattern);
Simply adding spaces before and after "#sometext" would not work if "#sometext" is at the start or end of a sentence. However, just adding a pattern checking for start or end of sentence would not work either, as when you match "#sometext " at the start of a sentence and leave a space " ", this will make the resulting string look strange. Same goes for the end of a sentence.
We need to split the regex replace in to two actions, and perform two seperate regex replaces:
str = str.replaceAll(" #sometext ", " ");
str = str.replaceAll("^#sometext | #sometext$|(?:#sometext ){2,}", "");
^ means start of line, $ means end of line.
EDIT: Added corner case handling of when several #sometext's are after each other.
myString = myString.replaceAll(" #hello ", " ");
If #hello is a single word, then it has spaces before and after, right? So you should find all #hellos with space before and after and replace it with a space.
If you need to remove not only #hellos and all words which are starting with # and not containing other #, use this:
myString = myString.replaceAll(" #[^#]+? ", " ");
[^#] is any symbol except #. +? means match at least one character until reaching the first space.
If you want to remove words with only alphanumeric characters, use \\w instead of [^#]
EDIT:
Yeah, ohaal's right. To make it match at the start and the end of string use this pattern:
( |^)#[^#]+?( |$)
myString = myString.replaceAll("( |^)#hello( |$)", " ");

Text cleaning and replacement: delete \n from a text in Java

I'm cleaning an incoming text in my Java code. The text includes a lot of "\n", but not as in a new line, but literally "\n". I was using replaceAll() from the String class, but haven't been able to delete the "\n".
This doesn't seem to work:
String string;
string = string.replaceAll("\\n", "");
Neither does this:
String string;
string = string.replaceAll("\n", "");
I guess this last one is identified as an actual new line, so all the new lines from the text would be removed.
Also, what would be an effective way to remove different patterns of wrong text from a String. I'm using regular expressions to detect them, stuff like HTML reserved characters, etc. and replaceAll, but everytime I use replaceAll, the whole String is read, right?
UPDATE: Thanks for your great answers. I' ve extended this question here:
Text replacement efficiency
I'm asking specifically about efficiency :D
Hooknc is right. I'd just like to post a little explanation:
"\\n" translates to "\n" after the compiler is done (since you escape the backslash). So the regex engine sees "\n" and thinks new line, and would remove those (and not the literal "\n" you have).
"\n" translates to a real new line by the compiler. So the new line character is send to the regex engine.
"\\\\n" is ugly, but right. The compiler removes the escape sequences, so the regex engine sees "\\n". The regex engine sees the two backslashes and knows that the first one escapes it so that translates to checking for the literal characters '\' and 'n', giving you the desired result.
Java is nice (it's the language I work in) but having to think to basically double-escape regexes can be a real challenge. For extra fun, it seems StackOverflow likes to try to translate backslashes too.
I think you need to add a couple more slashies...
String string;
string = string.replaceAll("\\\\n", "");
Explanation:
The number of slashies has to do with the fact that "\n" by itself is a controlled character in Java.
So to get the real characters of "\n" somewhere we need to use "\n". Which if printed out with give us: "\"
You're looking to replace all "\n" in your file. But you're not looking to replace the control "\n". So you tried "\n" which will be converted into the characters "\n". Great, but maybe not so much. My guess is that the replaceAll method will actually create a Regular Expression now using the "\n" characters which will be misread as the control character "\n".
Whew, almost done.
Using replaceAll("\\n", "") will first convert "\\n" -> "\n" which will be used by the Regular Expression. The "\n" will then be used in the Regular Expression and actually represents your text of "\n". Which is what you're looking to replace.
Instead of String.replaceAll(), which uses regular expressions, you might be better off using String.replace(), which does simple string substitution (if you are using at least Java 1.5).
String replacement = string.replace("\\n", "");
should do what you want.
string = string.replaceAll(""+(char)10, " ");
Try this. Hope it helps.
raw = raw.replaceAll("\t", "");
raw = raw.replaceAll("\n", "");
raw = raw.replaceAll("\r", "");
The other answers have sufficiently covered how to do this with replaceAll, and how you need to escape backslashes as necessary.
Since 1.5., there is also String.replace(CharSequence, CharSequence) that performs literal string replacement. This can greatly simplify many problem of string replacements, because there is no need to escape any regular expression metacharacters like ., *, |, and yes, \ itself.
Thus, given a string that can contain the substring "\n" (not '\n'), we can delete them as follows:
String before = "Hi!\\n How are you?\\n I'm \n good!";
System.out.println(before);
// Hi!\n How are you?\n I'm
// good!
String after = before.replace("\\n", "");
System.out.println(after);
// Hi! How are you? I'm
// good!
Note that if you insist on using replaceAll, you can prevent the ugliness by using Pattern.quote:
System.out.println(
before.replaceAll(Pattern.quote("\\n"), "")
);
// Hi! How are you? I'm
// good!
You should also use Pattern.quote when you're given an arbitrary string that must be matched literally instead of as a regular expression pattern.
I used this solution to solve that problem:
String replacement = str.replaceAll("[\n\r]", "");
Normally \n works fine. Otherwise you can opt for multiple replaceAll statements.
first apply one replaceAll on the text, and then reapply replaceAll again on the text. Should do what you are looking for.
I believe replaceAll() is an expensive operation. The below solution will probably perform better:
String temp = "Hi \n Wssup??";
System.out.println(temp);
StringBuilder result = new StringBuilder();
StringTokenizer t = new StringTokenizer(temp, "\n");
while (t.hasMoreTokens()) {
result.append(t.nextToken().trim()).append("");
}
String result_of_temp = result.toString();
System.out.println(result_of_temp);

Categories

Resources