Text cleaning and replacement: delete \n from a text in Java

Text cleaning and replacement: delete \n from a text in Java - java

I'm cleaning an incoming text in my Java code. The text includes a lot of "\n", but not as in a new line, but literally "\n". I was using replaceAll() from the String class, but haven't been able to delete the "\n".
This doesn't seem to work:
String string;
string = string.replaceAll("\\n", "");
Neither does this:
String string;
string = string.replaceAll("\n", "");
I guess this last one is identified as an actual new line, so all the new lines from the text would be removed.
Also, what would be an effective way to remove different patterns of wrong text from a String. I'm using regular expressions to detect them, stuff like HTML reserved characters, etc. and replaceAll, but everytime I use replaceAll, the whole String is read, right?
UPDATE: Thanks for your great answers. I' ve extended this question here:
Text replacement efficiency
I'm asking specifically about efficiency :D

Hooknc is right. I'd just like to post a little explanation:
"\\n" translates to "\n" after the compiler is done (since you escape the backslash). So the regex engine sees "\n" and thinks new line, and would remove those (and not the literal "\n" you have).
"\n" translates to a real new line by the compiler. So the new line character is send to the regex engine.
"\\\\n" is ugly, but right. The compiler removes the escape sequences, so the regex engine sees "\\n". The regex engine sees the two backslashes and knows that the first one escapes it so that translates to checking for the literal characters '\' and 'n', giving you the desired result.
Java is nice (it's the language I work in) but having to think to basically double-escape regexes can be a real challenge. For extra fun, it seems StackOverflow likes to try to translate backslashes too.

I think you need to add a couple more slashies...
String string;
string = string.replaceAll("\\\\n", "");
Explanation:
The number of slashies has to do with the fact that "\n" by itself is a controlled character in Java.
So to get the real characters of "\n" somewhere we need to use "\n". Which if printed out with give us: "\"
You're looking to replace all "\n" in your file. But you're not looking to replace the control "\n". So you tried "\n" which will be converted into the characters "\n". Great, but maybe not so much. My guess is that the replaceAll method will actually create a Regular Expression now using the "\n" characters which will be misread as the control character "\n".
Whew, almost done.
Using replaceAll("\\n", "") will first convert "\\n" -> "\n" which will be used by the Regular Expression. The "\n" will then be used in the Regular Expression and actually represents your text of "\n". Which is what you're looking to replace.

Instead of String.replaceAll(), which uses regular expressions, you might be better off using String.replace(), which does simple string substitution (if you are using at least Java 1.5).
String replacement = string.replace("\\n", "");
should do what you want.

string = string.replaceAll(""+(char)10, " ");

Try this. Hope it helps.
raw = raw.replaceAll("\t", "");
raw = raw.replaceAll("\n", "");
raw = raw.replaceAll("\r", "");

The other answers have sufficiently covered how to do this with replaceAll, and how you need to escape backslashes as necessary.
Since 1.5., there is also String.replace(CharSequence, CharSequence) that performs literal string replacement. This can greatly simplify many problem of string replacements, because there is no need to escape any regular expression metacharacters like ., *, |, and yes, \ itself.
Thus, given a string that can contain the substring "\n" (not '\n'), we can delete them as follows:
String before = "Hi!\\n How are you?\\n I'm \n good!";
System.out.println(before);
// Hi!\n How are you?\n I'm
// good!
String after = before.replace("\\n", "");
System.out.println(after);
// Hi! How are you? I'm
// good!
Note that if you insist on using replaceAll, you can prevent the ugliness by using Pattern.quote:
System.out.println(
before.replaceAll(Pattern.quote("\\n"), "")
);
// Hi! How are you? I'm
// good!
You should also use Pattern.quote when you're given an arbitrary string that must be matched literally instead of as a regular expression pattern.

I used this solution to solve that problem:
String replacement = str.replaceAll("[\n\r]", "");

Normally \n works fine. Otherwise you can opt for multiple replaceAll statements.
first apply one replaceAll on the text, and then reapply replaceAll again on the text. Should do what you are looking for.

I believe replaceAll() is an expensive operation. The below solution will probably perform better:
String temp = "Hi \n Wssup??";
System.out.println(temp);
StringBuilder result = new StringBuilder();
StringTokenizer t = new StringTokenizer(temp, "\n");
while (t.hasMoreTokens()) {
result.append(t.nextToken().trim()).append("");
}
String result_of_temp = result.toString();
System.out.println(result_of_temp);

Related

how to convert one line containing several sentences into lines according to dot(.) [duplicate]

I am wondering if I am going about splitting a string on a . the right way? My code is:
String[] fn = filename.split(".");
return fn[0];
I only need the first part of the string, that's why I return the first item. I ask because I noticed in the API that . means any character, so now I'm stuck.

split() accepts a regular expression, so you need to escape . to not consider it as a regex meta character. Here's an example :
String[] fn = filename.split("\\.");
return fn[0];

I see only solutions here but no full explanation of the problem so I decided to post this answer
Problem
You need to know few things about text.split(delim). split method:
accepts as argument regular expression (regex) which describes delimiter on which we want to split,
if delim exists at end of text like in a,b,c,, (where delimiter is ,) split at first will create array like ["a" "b" "c" "" ""] but since in most cases we don't really need these trailing empty strings it also removes them automatically for us. So it creates another array without these trailing empty strings and returns it.
You also need to know that dot . is special character in regex. It represents any character (except line separators but this can be changed with Pattern.DOTALL flag).
So for string like "abc" if we split on "." split method will
create array like ["" "" "" ""],
but since this array contains only empty strings and they all are trailing they will be removed (like shown in previous second point)
which means we will get as result empty array [] (with no elements, not even empty string), so we can't use fn[0] because there is no index 0.
Solution
To solve this problem you simply need to create regex which will represents dot. To do so we need to escape that .. There are few ways to do it, but simplest is probably by using \ (which in String needs to be written as "\\" because \ is also special there and requires another \ to be escaped).
So solution to your problem may look like
String[] fn = filename.split("\\.");
Bonus
You can also use other ways to escape that dot like
using character class split("[.]")
wrapping it in quote split("\\Q.\\E")
using proper Pattern instance with Pattern.LITERAL flag
or simply use split(Pattern.quote(".")) and let regex do escaping for you.

Split uses regular expressions, where '.' is a special character meaning anything. You need to escape it if you actually want it to match the '.' character:
String[] fn = filename.split("\\.");
(one '\' to escape the '.' in the regular expression, and the other to escape the first one in the Java string)
Also I wouldn't suggest returning fn[0] since if you have a file named something.blabla.txt, which is a valid name you won't be returning the actual file name. Instead I think it's better if you use:
int idx = filename.lastIndexOf('.');
return filename.subString(0, idx);

the String#split(String) method uses regular expressions.
In regular expressions, the "." character means "any character".
You can avoid this behavior by either escaping the "."
filename.split("\\.");
or telling the split method to split at at a character class:
filename.split("[.]");
Character classes are collections of characters. You could write
filename.split("[-.;ld7]");
and filename would be split at every "-", ".", ";", "l", "d" or "7". Inside character classes, the "." is not a special character ("metacharacter").

As DOT( . ) is considered as a special character and split method of String expects a regular expression you need to do like this -
String[] fn = filename.split("\\.");
return fn[0];
In java the special characters need to be escaped with a "\" but since "\" is also a special character in Java, you need to escape it again with another "\" !

String str="1.2.3";
String[] cats = str.split(Pattern.quote("."));

Wouldn't it be more efficient to use
filename.substring(0, filename.indexOf("."))
if you only want what's up to the first dot?

Usually its NOT a good idea to unmask it by hand. There is a method in the Pattern class for this task:
java.util.regex
static String quote(String s)

The split must be taking regex as a an argument... Simply change "." to "\\."

The solution that worked for me is the following
String[] fn = filename.split("[.]");

Note: Further care should be taken with this snippet, even after the dot is escaped!
If filename is just the string ".", then fn will still end up to be of 0 length and fn[0] will still throw an exception!
This is, because if the pattern matches at least once, then split will discard all trailing empty strings (thus also the one before the dot!) from the array, leaving an empty array to be returned.

Using ApacheCommons it's simplest:
File file = ...
FilenameUtils.getBaseName(file.getName());
Note, it also extracts a filename from full path.

split takes a regex as argument. So you should pass "\." instead of "." because "." is a metacharacter in regex.

Remove everything from a string upto a certain character and optionally a string if it follows too

I am looking to write a regex that can remove any characters upto the first &emsp and if there is a (new section) following &emsp then remove that as well. But the following regex doesn't seem to work. Why? How do I correct this?
String removeEmsp =" “[<centd>[</centd>]§ 431:10A–126 (new section)[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.";
Pattern removeEmspPattern1 = Pattern.compile("(.*( (\\(new section\\)))?)(.*)", Pattern.MULTILINE);
System.out.println(removeEmspPattern1.matcher(removeEmsp).replaceAll("$2"));

Have you tried String Split? This creates an array of strings from a string, based on a deliminator.
Once you have the string split, just select the elements of the array that you need for print statement.
Read more here

Your regex is very long and I do not want to debug it. However the tip is that some characters have special meaning in regular expressions. For example & means "and". Squire brackets allow defining characters groups etc. Such characters must be escaped if you want them to be interpreted as just characters and not regex commands. To escape special character you have to write \ in front of it. But \ is escape character for java too, so it should be duplicate.
For example to replace ampersand by letter A you should write str.replaceAll("\\&", "A")
Now you have all information you need. Try to start from simpler regex and then expand it to what you need. Good luck.
EDIT
BTW parsing XML and/or HTML using regular expressions is possible but is highly not recommended. Use special parser for such formats.

Try this:
String removeEmsp =" “[<centd>[</centd>]§ 431:10A–126 (new section)[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.";
System.out.println(removeEmsp.replaceFirst("^.*?\\ (\\(new\\ssection\\))?", ""));
System.out.println(removeEmsp.replaceAll("^.*?\\ (\\(new\\ssection\\))?", ""));
Output:
[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.
[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.
It will remove everything up to " " and optionally, the following "(new section)" text if any.

How can I write a regex in Java that will perform a .replaceFirst on a group that is not in a comment?

So I need to return modified String where it replaces the first instance of a token with another token while skipping comments. Here's an example of what I'm talking about:
This whole quote is one big String
-- I don't want to replace this ##
But I want to replace this ##!
Being a former .NET developer, I thought this was easy. I'd just do a negative lookbehind like this:
(?<!--.*)##
But then I learned Java can't do this. So upon learning that the curly braces are okay, I tried this:
(?<!--.{0,9001})##
That didn't throw an exception, but it did match the ## in the comment.
When I test this regex with a Java regex tester, it works as expected. About the only thing I can think of is that I'm using Java 1.5. Is it possible that Java 1.5 has a bug in its regex engine? Assuming it does, how do I get Java 1.5 to do what I want it to do without breaking up my string and reassembling it?
EDIT I changed the # to the -- operator since it looks like the regex will be more complex with two chars instead of one. I originally did not reveal that I was modifying a query in order to avoid off topic discussion on "Well you shouldn't modify queries that way!" I have a very good reason for doing this. Please don't discuss query modification good practices. Thanks

You really don't need a negative look-behind here. You can do it without that too.
It would be like this:
String str = "I don't want to replace this ##";
str = str.replaceAll("^([^#].*?)##", "$1");
So, it replaces first occurrence of ## in the string that does not start with # with the part of the string before ##. So, ## is removed. Here replaceAll works because it uses a reluctant quantifier - .*?. So, it will automatically stop at the first ##.
As correctly pointed out by #nhahtdh in the comment, that this might fail, if your comment is at the end of the line. So, you can rather use this one:
String str = "I don't want to # replace this ##";
str = str.replaceAll("^([^#]*?)##", "$1");
This one will work for any case. And in the given example case, it won't replace the ##, as it is a part of the comment.
If your comment start is denoted by two characters, then negated character class won't work. You would need to use negative look-ahead like this:
String str = "This whole quote ## is one big String -- asdf ##\n" +
"-- I don't want to replace this ##\n" +
"But I want to replace this ##!";
str = str.replaceAll("(?m)^(((?!--).)*?)##", "$1");
System.out.println(str);
Output:
This whole quote is one big String -- asdf ##
-- I don't want to replace this ##
But I want to replace this !
(?m) at the beginning of the pattern is used to enable MULTILINE mode of matching, so the ^ will match the start of each line, rather than the start of the entire expression.

You can use something like this:
String string = "This whole quote is one big String\n" +
"# I don't want to replace this ##\n" +
"And I also # don't want to replace this ##\n" +
"But I want to replace this ##!\n" +
"But not this ##!";
Matcher m =
Pattern.compile (
"^((?:[^##]|#[^#]|#[^\n]*)*)##", Pattern.MULTILINE).
matcher (string);
StringBuffer result = new StringBuffer ();
if (m.find ())
m.appendReplacement (result, "$1FOO");
m.appendTail (result);
System.out.println (result.toString ());

Java replace " with \"

I am trying to replace string containing " with \" , below is the program I tried
String s="\"/test /string\"";
s = s.replaceAll("\"", "\\\"");
System.out.println(s);
But I get the same output as the string "/test /string". Why is my replace function is not working. If I do
s = s.replaceAll("\"", "\\\\\"");
then I get the output I want \"/test /string\" . Why is the former dint work , even though in code I am trying to replace " with \"

You're using String.replaceAll, which takes a regular expression as its inputs, including the replacement. As documented in Match.replaceAll():
Note that backslashes (\) and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string.
You're really just trying to do a straight replace with no regexes involved, so use String.replace instead:
s = s.replace("\"", "\\\"");

Regular expressions: all words after my current one are gone

I need to remove all strings from my text file, such as:
flickr:user=32jdisffs
flickr:user=acssd
flickr:user=asddsa89
I'm currently using fields[i] = fields[i].replaceAll(" , flickr:user=.*", "");
however the issue with this is approach is that any word after flickr:user= is removed from the content, even after the space.
thanks

You probably need
replaceAll("flickr:user=[0-9A-Za-z]+", "");

flickr:user=\w+ should do it:
String noFlickerIdsHere = stringWithIds.replaceAll("flickr:user=\\w+", "");
Reference:
\w = A word character: [a-zA-Z_0-9]

Going by the question as stated, chances are that you want:
fields[i] = fields[i].replaceAll(" , flickr:user=[^ ]* ", ""); // or " "
This will match the string, including the value of user up to but not including the first space, followed by a space, and replace it either by a blank string, or a single space. However this will (barring the comma) net you an empty result with the input you showed. Is that really what you want?
I'm also not sure where the " , " at the beginning fits into the example you showed.
The reason for your difficulties is that an unbounded .* will match everything from that point up until the end of the input (even if that amounts to nothing; that's what the * is for). For a line-based regular expression parser, that's to the end of the line.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Text cleaning and replacement: delete \n from a text in Java - java

Instead of String.replaceAll(), which uses regular expressions, you might be better off using String.replace(), which does simple string substitution (if you are using at least Java 1.5). String replacement = string.replace("\\n", ""); should do what you want.

string = string.replaceAll(""+(char)10, " ");

Try this. Hope it helps. raw = raw.replaceAll("\t", ""); raw = raw.replaceAll("\n", ""); raw = raw.replaceAll("\r", "");

I used this solution to solve that problem: String replacement = str.replaceAll("[\n\r]", "");

Normally \n works fine. Otherwise you can opt for multiple replaceAll statements. first apply one replaceAll on the text, and then reapply replaceAll again on the text. Should do what you are looking for.

Related

how to convert one line containing several sentences into lines according to dot(.) [duplicate]

Remove everything from a string upto a certain character and optionally a string if it follows too

How can I write a regex in Java that will perform a .replaceFirst on a group that is not in a comment?

Java replace " with \"

Regular expressions: all words after my current one are gone

Categories

Resources