How can I replace non-printable Unicode characters in Java?

How can I replace non-printable Unicode characters in Java? - java

The following will replace ASCII control characters (shorthand for [\x00-\x1F\x7F]):
my_string.replaceAll("\\p{Cntrl}", "?");
The following will replace all ASCII non-printable characters (shorthand for [\p{Graph}\x20]), including accented characters:
my_string.replaceAll("[^\\p{Print}]", "?");
However, neither works for Unicode strings. Does anyone has a good way to remove non-printable characters from a unicode string?

my_string.replaceAll("\\p{C}", "?");
See more about Unicode regex. java.util.regexPattern/String.replaceAll supports them.

Op De Cirkel is mostly right. His suggestion will work in most cases:
myString.replaceAll("\\p{C}", "?");
But if myString might contain non-BMP codepoints then it's more complicated. \p{C} contains the surrogate codepoints of \p{Cs}. The replacement method above will corrupt non-BMP codepoints by sometimes replacing only half of the surrogate pair. It's possible this is a Java bug rather than intended behavior.
Using the other constituent categories is an option:
myString.replaceAll("[\\p{Cc}\\p{Cf}\\p{Co}\\p{Cn}]", "?");
However, solitary surrogate characters not part of a pair (each surrogate character has an assigned codepoint) will not be removed. A non-regex approach is the only way I know to properly handle \p{C}:
StringBuilder newString = new StringBuilder(myString.length());
for (int offset = 0; offset < myString.length();)
{
int codePoint = myString.codePointAt(offset);
offset += Character.charCount(codePoint);
// Replace invisible control characters and unused code points
switch (Character.getType(codePoint))
{
case Character.CONTROL: // \p{Cc}
case Character.FORMAT: // \p{Cf}
case Character.PRIVATE_USE: // \p{Co}
case Character.SURROGATE: // \p{Cs}
case Character.UNASSIGNED: // \p{Cn}
newString.append('?');
break;
default:
newString.append(Character.toChars(codePoint));
break;
}
}

methods below for your goal
public static String removeNonAscii(String str)
{
return str.replaceAll("[^\\x00-\\x7F]", "");
}
public static String removeNonPrintable(String str) // All Control Char
{
return str.replaceAll("[\\p{C}]", "");
}
public static String removeSomeControlChar(String str) // Some Control Char
{
return str.replaceAll("[\\p{Cntrl}\\p{Cc}\\p{Cf}\\p{Co}\\p{Cn}]", "");
}
public static String removeFullControlChar(String str)
{
return removeNonPrintable(str).replaceAll("[\\r\\n\\t]", "");
}

You may be interested in the Unicode categories "Other, Control" and possibly "Other, Format" (unfortunately the latter seems to contain both unprintable and printable characters).
In Java regular expressions you can check for them using \p{Cc} and \p{Cf} respectively.

I have used this simple function for this:
private static Pattern pattern = Pattern.compile("[^ -~]");
private static String cleanTheText(String text) {
Matcher matcher = pattern.matcher(text);
if ( matcher.find() ) {
text = text.replace(matcher.group(0), "");
}
return text;
}
Hope this is useful.

Based on the answers by Op De Cirkel and noackjr, the following is what I do for general string cleaning: 1. trimming leading or trailing whitespaces, 2. dos2unix, 3. mac2unix, 4. removing all "invisible Unicode characters" except whitespaces:
myString.trim.replaceAll("\r\n", "\n").replaceAll("\r", "\n").replaceAll("[\\p{Cc}\\p{Cf}\\p{Co}\\p{Cn}&&[^\\s]]", "")
Tested with Scala REPL.

I propose it remove the non printable characters like below instead of replacing it
private String removeNonBMPCharacters(final String input) {
StringBuilder strBuilder = new StringBuilder();
input.codePoints().forEach((i) -> {
if (Character.isSupplementaryCodePoint(i)) {
strBuilder.append("?");
} else {
strBuilder.append(Character.toChars(i));
}
});
return strBuilder.toString();
}

Supported multilanguage
public static String cleanUnprintableChars(String text, boolean multilanguage)
{
String regex = multilanguage ? "[^\\x00-\\xFF]" : "[^\\x00-\\x7F]";
// strips off all non-ASCII characters
text = text.replaceAll(regex, "");
// erases all the ASCII control characters
text = text.replaceAll("[\\p{Cntrl}&&[^\r\n\t]]", "");
// removes non-printable characters from Unicode
text = text.replaceAll("\\p{C}", "");
return text.trim();
}

I have redesigned the code for phone numbers +9 (987) 124124
Extract digits from a string in Java
public static String stripNonDigitsV2( CharSequence input ) {
if (input == null)
return null;
if ( input.length() == 0 )
return "";
char[] result = new char[input.length()];
int cursor = 0;
CharBuffer buffer = CharBuffer.wrap( input );
int i=0;
while ( i< buffer.length() ) { //buffer.hasRemaining()
char chr = buffer.get(i);
if (chr=='u'){
i=i+5;
chr=buffer.get(i);
}
if ( chr > 39 && chr < 58 )
result[cursor++] = chr;
i=i+1;
}
return new String( result, 0, cursor );
}

Related

How to check if String contains Latin letters without regex

I want to check if String contains only Latin letters but also can contains numbers and other symbols like: _/+), etc.
String utm_source=google should pass, utm_source=google&2019_and_2020! should pass too. But utm_ресурс=google should not pass (coz cyrillic letters). I know code with regex, but how can i do it without using regex and classic for loop, maybe with Streams and Character class?

Use this code
public static boolean isValidUsAscii (String s) {
return Charset.forName("US-ASCII").newEncoder().canEncode(s);
}

For restricted "latin" (no é etcetera), it must be either US-ASCII (7 bits), or ISO-8859-1 but without accented letters.
boolean isBasicLatin(String s) {
return s.codePoints().allMatch(cp -> cp < 128 || (cp < 256 && !isLetter(cp)));
}

Less of a neat single line approach but really all you need to do is check whether the numeric value of the character is within certain limits like so:
public boolean isQwerty(String text) {
int length = text.length();
for(int i = 0; i < length; i++) {
char character = text.charAt(i);
int ascii = character;
if(ascii<32||ascii>126) {
return false;
}
}
return true;
}
Test Run
ä returns false
abc returns true

Split the String by \ which contains following string "abc\u12345. "

Before posting I tried using string split("\u") or \\u or \u it does not work, reason being is that \u is considered as unicode character while in this case it's not.

as already mentioned \u12345 is a unicode character and therefore handled as a single symbol.
If you have these in your string its already too late. If you get this from a file or over network you could read your input and escape each \ or \u you encounter before storing it in your string variable and working on it.
if you elaborate the context of your task a little more, perhaps we could find other solutions for you.

Java understands it as Unicode Character so, right thing to do will be to update the source to read it properly and avoid passing Unicode to java if not needed. One workaround way could be to convert the entire string into a character Array and check if character is greater than 128 and if yes, I append the rest of the array in a seperate StringBuilder. See of it below helps :
public static void tryMee(String input)
{
StringBuilder b1 = new StringBuilder();
StringBuilder b2 = new StringBuilder();
boolean isUni = false;
for (char c : input.toCharArray())
{
if (c >= 128)
{
b2.append("\\u").append(String.format("%04X", (int) c));
isUni = true;
}
else if(isUni) b2.append(c);
else b1.append(c);
}
System.out.println("B1: "+b1);
System.out.println("B2: "+b2);
}

Try this. You did not escape properly
split("\\\\u")
or
split(Pattern.quote("\\u"))

import java.util.Arrays;
public class Example {
public static void main (String[]args){
String str = "abc\u12345";
// first replace \\u with something else, for example with -u
char [] chars = str.toCharArray();
StringBuilder sb = new StringBuilder();
for(char c: chars){
if(c >= 128){
sb.append("-u").append(Integer.toHexString(c | 0x10000).substring(1) );
}else{
sb.append(c);
}
}
String replaced = sb.toString();
// now you can split by -u
String [] splited = sb.toString().split("-u");
System.out.println(replaced);
System.out.println(Arrays.toString(splited));
}
}

Java String index out of range error

I'm running into some issues with some java code that I do not know how to fix. I was wondering if I could get some help with figuring out why I keep getting
java.lang.StringIndexOutOfBoundsException: String index out of range: 1
Here's the code snippet where the problem is popping up (its part of a larger package for an assignment..) :
public class MyMapper extends Mapper {
#Override
//method takes docName and data as string
public void map(String documentID, String document) {
//this string array hold all the delimiters for our split
//String[] separators = {",", ".", "!", "?", ";", ":", "-", "' "," "};
//splits the string 'document' according to delimiters
String[] words = document.split(",|\\.|\\!|\\?|\\;|\\:|\\-|\\' |\\ |\\'.");
// for each word in String[] words, check that each word is legitimate
for (String word : words) {
if (isAlpha(word)){
//System.out.println(word);
emit(word.substring(0, 1).toUpperCase() , "1");
}
else;
}
}
// private helper method to check that each word is legitimate (alphas-only)
private boolean isAlpha(String name) {
char[] chars = name.toCharArray();
for (char c : chars) {
if(!Character.isLetter(c)) {
return false;
}
}
return true;
}
}
What I am trying to do is take in a document (stored in string form through bufferedReader) and seize the first letter of each word in the doc, and capitalize them.
***** Updated Code*****
I decided to go with the suggested check for the empty "word" in my private helper method. Everything works now.
Here is the updated code for documentation purposes:
// private helper method to check that each word is legitimate (alphas-only)
private boolean isAlpha(String name) {
if (name.equals(""))
return false;
char[] chars = name.toCharArray();
for (char c : chars) {
if(!Character.isLetter(c)) {
return false;
}
}
return true;

Looks like sometimes your word is empty. Make a check first to see that you've got something to work with:
if (isAlpha(word)){
if(!word.isEmpty()){ //you could also use 'if(word.length == 0)'
emit(word.substring(0, 1).toUpperCase() , "1");
}
}
Alternatively, make that check in your isAlpha() method.

If your word is empty just return a false from your isAlpha() like this
private boolean isAlpha(String name) {
if (name.equals(""))
return false;
char[] chars = name.toCharArray();
for (char c : chars) {
if(!Character.isLetter(c)) {
return false;
}
}
return true;
}
}

For some strings, your split regex can produce empty strings, for example in the not-at-all unusual case that a comma is followed by a space, e.g., the string document = "Some words, with comma."; will be split into [Some, words, , with, comma].
Instead of enumerating all the non-word characters that you can think of, I suggest using the \W character class (non-alphanumeric character) and also allowing multiple of those, i.e. words = document.split("\\W+");. This gives you [Some, words, with, comma].
If you need more control about the characters to split by and don't want to use a character class, you can still put the characters into [...]+ to shorten the regex and to split by groups of those, too, using words = document.split("[|.!?,;:' -]+"). (Inside [...], you do not need to escape all of those, as long as the - is last, so it's unambiguous.)

Would something like this do?
String text = "es saß ein wiesel, auf einem kiesel.";
String[] parts = text.split("\\s+");
StringBuilder resultingString = new StringBuilder();
for (String part : parts) {
part = Character.toUpperCase(part.charAt(0))
+ part.substring(1, part.length());
resultingString.append(part + " ");
}
text = resultingString.toString().substring(0,
resultingString.length() - 1);
System.out.println(text);

Reading new line as two characters

I have written a small program
class Test {
public static void main(String[] args) {
String s = "\n";
System.out.println(s.length());
for (int i = 0; i < s.length(); i++) {
System.out.println(s.charAt(i));
}
}
}
The program gives the length as 1 and treats \n as single new line character.
My requirement is to treat \n as normal string so with 2 characters (First character \ and second character n), what can be done to achieve it?
NOTE: 1) We can't change the string to add additional escape character.
2) We don't want to use any additional 3rd Party library

You can use the StringEscapeUtils utility class from commons-lang.
String s = "\n";
s = StringEscapeUtils.escapeJava(s);
System.out.println(s.length());
for (int i = 0; i < s.length(); i++) {
System.out.println(s.charAt(i));
}
Output:
2
\
n
If you absolutely can't use a library like commons-lang, then you can write your own method to do it. You can browse through the code of the above class to see an example of how you can escape the string to account for different special characters.

As far as I know, you can't. The issue is that "\n" is one character. The single backslash is an escape.
char ch = '\n'; // <-- not two characters. it's one.

It's as simple as that:
Once you go past the line
String s = "\n";
s will contain a single new line character, and there's nothing you can do about it.
You can obviously create a new String and replace all new line characters by "\n", but I don't think that's what you wanted.

We can't change the string to add additional escape character.
I guess that is not possible because \n has a special meaning when used in String.
Escape the backslash with double backslash like this \\n
This shall give you length as 2
String str = "\\n";
System.out.println(str.length());
Or try using apache commons-lang's
StringEscapeUtils#escapeJava()

You could search through the string and replace all character versions of /n with //n.
String s = convertNewLineChars("\n");
public String convertNewLineChars(String s)
{
//for each character in string, replace '\n' with \\n
}
Edit
Use an enum for all your possible special characters
public enum SpecialCharacter
{
NEWLINE('\n', "\\\\n"), //see note at the bottom of the answer for why
RETURN('\r', "\\\\r"); //there are four backslashes.
private char character;
private String charAsString;
private SpecialCharacter(char character, String charAsString)
{
this.character = character;
this.charAsString = charAsString;
}
public char getCharacter()
{
return this.character;
}
public String getCharAsString()
{
return this.charAsString;
}
public static SpecialCharacter[] getAllCharacters()
{
return new SpecialCharacter[] {NEWLINE, RETURN}; //etc...
}
}
Create a static method for removing these characters
public static String removeSpecialCharacters(String s)
{
String returnString = s;
for (SpecialCharacter character : SpecialCharacter.getAllCharacters())
{
returnString = returnString.replaceAll("["+character.getCharacter()+"]", character.getCharAsString());
}
return returnString;
}
Then you can say something like:
String s = removeSpecialCharacters("\nfdafhoean\noasd\r\rjfoi");
System.out.println(s);
This will work for any SpecialCharacter you add to the enum.
*Note that replaceAll() will consume the extra backslash... if you simply call System.out.println(SpecialCharacter.NEWLINE.getCharAsString()); you will receive the output of \\n

Just use another \ character infront of the \n to convert \n (new line) to \n two characters

Removing duplicate same characters in a row

I am trying to create a method which will either remove all duplicates from a string or only keep the same 2 characters in a row based on a parameter.
For example:
helllllllo -> helo
or
helllllllo -> hello - This keeps double letters
Currently I remove duplicates by doing:
private String removeDuplicates(String word) {
StringBuffer buffer = new StringBuffer();
for (int i = 0; i < word.length(); i++) {
char letter = word.charAt(i);
if (buffer.length() == 0 && letter != buffer.charAt(buffer.length() - 1)) {
buffer.append(letter);
}
}
return buffer.toString();
}
If I want to keep double letters I was thinking of having a method like private String removeDuplicates(String word, boolean doubleLetter)
When doubleLetter is true it will return hello not helo
I'm not sure of the most efficient way to do this without duplicating a lot of code.

why not just use a regex?
public class RemoveDuplicates {
public static void main(String[] args) {
System.out.println(new RemoveDuplicates().result("hellllo", false)); //helo
System.out.println(new RemoveDuplicates().result("hellllo", true)); //hello
}
public String result(String input, boolean doubleLetter){
String pattern = null;
if(doubleLetter) pattern = "(.)(?=\\1{2})";
else pattern = "(.)(?=\\1)";
return input.replaceAll(pattern, "");
}
}
(.) --> matches any character and puts in group 1.
?= --> this is called a positive lookahead.
?=\\1 --> positive lookahead for the first group
So overall, this regex looks for any character that is followed (positive lookahead) by itself. For example aa or bb, etc. It is important to note that only the first character is part of the match actually, so in the word 'hello', only the first l is matched (the part (?=\1) is NOT PART of the match). So the first l is replaced by an empty String and we are left with helo, which does not match the regex
The second pattern is the same thing, but this time we look ahead for TWO occurrences of the first group, for example helllo. On the other hand 'hello' will not be matched.
Look here for a lot more: Regex
P.S. Fill free to accept the answer if it helped.

try
String s = "helllllllo";
System.out.println(s.replaceAll("(\\w)\\1+", "$1"));
output
helo

Taking this previous SO example as a starting point, I came up with this:
String str1= "Heelllllllllllooooooooooo";
String removedRepeated = str1.replaceAll("(\\w)\\1+", "$1");
System.out.println(removedRepeated);
String keepDouble = str1.replaceAll("(\\w)\\1{2,}", "$1");
System.out.println(keepDouble);
It yields:
Helo
Heelo
What it does:
(\\w)\\1+ will match any letter and place it in a regex capture group. This group is later accessed through the \\1+. Meaning that it will match one or more repetitions of the previous letter.
(\\w)\\1{2,} is the same as above the only difference being that it looks after only characters which are repeated more than 2 times. This leaves the double characters untouched.
EDIT:
Re-read the question and it seems that you want to replace multiple characters by doubles. To do that, simply use this line:
String keepDouble = str1.replaceAll("(\\w)\\1+", "$1$1");

Try this, this will be most efficient way[Edited after comment]:
public static String removeDuplicates(String str) {
int checker = 0;
StringBuffer buffer = new StringBuffer();
for (int i = 0; i < str.length(); ++i) {
int val = str.charAt(i) - 'a';
if ((checker & (1 << val)) == 0)
buffer.append(str.charAt(i));
checker |= (1 << val);
}
return buffer.toString();
}
I am using bits to identify uniqueness.
EDIT:
Whole logic is that if a character has been parsed then its corrresponding bit is set and next time when that character comes up then it will not be added in String Buffer the corresponding bit is already set.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How can I replace non-printable Unicode characters in Java? - java

my_string.replaceAll("\\p{C}", "?"); See more about Unicode regex. java.util.regexPattern/String.replaceAll supports them.

You may be interested in the Unicode categories "Other, Control" and possibly "Other, Format" (unfortunately the latter seems to contain both unprintable and printable characters). In Java regular expressions you can check for them using \p{Cc} and \p{Cf} respectively.

I have used this simple function for this: private static Pattern pattern = Pattern.compile("[^ -~]"); private static String cleanTheText(String text) { Matcher matcher = pattern.matcher(text); if ( matcher.find() ) { text = text.replace(matcher.group(0), ""); } return text; } Hope this is useful.

Related

How to check if String contains Latin letters without regex

Split the String by \ which contains following string "abc\u12345. "

Java String index out of range error

Reading new line as two characters

Removing duplicate same characters in a row

Categories

Resources