Identify valid characters of a character array - java

I am receiving a character array where it has valid and invalid characters. I need to only retrieve the characters that are valid. How can I do this ?
char[]ch = str.toCharacterArray();
for(int i=0;i<ch.length.i++){
if(!Character.isLetter(ch[i]))
system.out.println("not valid");
else
system.out.println("valid");
}
I don't think the above code works because I get invalid for all the valid and invalid characters in the character array.
by meaning characters I am expecting all alphanumeric and special characters
Note: I am getting values from the server, therefore the character array contains valid and invalid characters.

try following method:
// assume input is not too large string
public static String extractPrintableChars(final String input) {
String out = input;
if(input != null) {
char[] charArr = input.toCharArray();
StringBuilder sb = new StringBuilder();
for (char ch : charArr) {
if(Character.isDefined(ch) && !Character.isISOControl(ch)) {
sb.append(ch);
}
}
out = sb.toString();
}
return out;
}

Have a look into Matcher class available in java. I don't think it to be wise to loop over all the characters until there is a max Cap for it.
/[^0-9]+/g, ''
with above regEx it will wipe out all the charecters other then numeric with noting. Modify it as per your need.
regards
Punith

Related

How to validate a string containing only some special characters is allowed in java?

I want to validate a string in java. It should not allow any special characters, example #._ at dot underscore. Only these 3 characters should be allowed. No space is allowed. And only English alphabets allowed.
You can use regex (Regular Expressions):
public static boolean isValid(String s)
{
return s.matches("[a-zA-Z#._]*");
}
Explanation:
[a-zA-Z#._]: Matches a single character that is either in the English
alphabet, or the three special characters '#', '.' and '_'.
*: Matches the previous expression between zero and unlimited times.
Note: If empty strings are not valid either, use [a-zA-Z#._]+ as the regex
instead, or additionally test whether s.isEmpty() or has length of 0.
I'll assume you're reading user input with a Scanner object, and can use the Character wrapper class methods. This is pretty much as easy to understand as it gets:
Scanner stdin = new Scanner(System.in);
String userInput = stdin.nextLine();
boolean valid = true;
for(int i = 0; i < userInput.length; i++){
if(!Character.isDigit(ch) && !Character.isLetter(ch) && ch != '#' && ch !='_' && ch != '.'){
valid = false;
break;
}
}
// Code that deals with result.

most efficient way to check if a string contains specific characters

I have a string that should contain only specific characters: {}()[]
I've created a validate method that checks if the string contains forbidden characters (by forbidden characters I mean everything that is not {}()[] )
Here is my code:
private void validate(String string) {
char [] charArray = string.toCharArray();
for (Character c : charArray) {
if (!"{}()[]".contains(c.toString())){
throw new IllegalArgumentException("The string contains forbidden characters");
}
}
}
I'm wondering if there are better ways to do it since my approach doesn't seem right.
If I took the way you implement this, I would personally modify it like below:
private static void validate(String str) {
for (char c : str.toCharArray()) {
if ("{}()[]".indexOf(c) < 0){
throw new IllegalArgumentException("The string contains forbidden characters");
}
}
}
The changes are as follows:
Not declaring a temporary variable for the char array.
Using indexOf to find a character instead of converting c to String to use .contains().
Looping on the primitive char since you no longer need
toString().
Not naming the parameter string as this can cause confusion and is not good practice.
Note: contains calls indexOf(), so this does also technically save you a method call each iteration.
I'd suggest using Stream if you are using Java 8.
This allow you omit char to String boxing stuff.
private void validate_stream(String str) {
if(str.chars().anyMatch(a -> a==125||a==123||a==93||a==91||a==41||a==40))
throw new IllegalArgumentException("The string contains forbidden characters");
}
The numbers are ASCII codes for forbidden characters, you can replace them with chars if you want:
(a -> a=='{'||a=='}'||a=='['||a==']'||a=='('||a==')')
I hope this works for you: I have added my code along with your code.
I have used a regex pattern, where \\ escapes brackets, which has special meaning in regex. And use matches method of string, it try to matches the given string value with given reg ex pattern. In this case as we used not(!), if we give string like "{}()[]as", it satisfies the if not condition and prints "not matched", otherwise if we give string like "{}()[]", else case will will print. You can change this how you like by throwing exception.
private static void validate(String string)
{
String pattern = "\\{\\}\\(\\)\\[\\]";
if(!string.matches(pattern)) {
System.out.println("not matched:"+string);
}
else {
System.out.println("data matched:"+string);
}
char [] charArray = string.toCharArray();
for (Character c : charArray) {
if (!"{}()[]".contains(c.toString())){
throw new IllegalArgumentException("The string contains forbidden characters");
}
}
}
All the brackets are Meta characters, referenced here:
http://tutorials.jenkov.com/java-regex/index.html

Split the String by \ which contains following string "abc\u12345. "

Before posting I tried using string split("\u") or \\u or \u it does not work, reason being is that \u is considered as unicode character while in this case it's not.
as already mentioned \u12345 is a unicode character and therefore handled as a single symbol.
If you have these in your string its already too late. If you get this from a file or over network you could read your input and escape each \ or \u you encounter before storing it in your string variable and working on it.
if you elaborate the context of your task a little more, perhaps we could find other solutions for you.
Java understands it as Unicode Character so, right thing to do will be to update the source to read it properly and avoid passing Unicode to java if not needed. One workaround way could be to convert the entire string into a character Array and check if character is greater than 128 and if yes, I append the rest of the array in a seperate StringBuilder. See of it below helps :
public static void tryMee(String input)
{
StringBuilder b1 = new StringBuilder();
StringBuilder b2 = new StringBuilder();
boolean isUni = false;
for (char c : input.toCharArray())
{
if (c >= 128)
{
b2.append("\\u").append(String.format("%04X", (int) c));
isUni = true;
}
else if(isUni) b2.append(c);
else b1.append(c);
}
System.out.println("B1: "+b1);
System.out.println("B2: "+b2);
}
Try this. You did not escape properly
split("\\\\u")
or
split(Pattern.quote("\\u"))
import java.util.Arrays;
public class Example {
public static void main (String[]args){
String str = "abc\u12345";
// first replace \\u with something else, for example with -u
char [] chars = str.toCharArray();
StringBuilder sb = new StringBuilder();
for(char c: chars){
if(c >= 128){
sb.append("-u").append(Integer.toHexString(c | 0x10000).substring(1) );
}else{
sb.append(c);
}
}
String replaced = sb.toString();
// now you can split by -u
String [] splited = sb.toString().split("-u");
System.out.println(replaced);
System.out.println(Arrays.toString(splited));
}
}

Java Get first character values for a string

I have inputs like
AS23456SDE
MFD324FR
I need to get First Character values like
AS, MFD
There should no first two or first 3 characters input can be changed. Need to get first characters before a number.
Thank you.
Edit : This is what I have tried.
public static String getPrefix(String serial) {
StringBuilder prefix = new StringBuilder();
for(char c : serial.toCharArray()){
if(Character.isDigit(c)){
break;
}
else{
prefix.append(c);
}
}
return prefix.toString();
}
Here is a nice one line solution. It uses a regex to match the first non numeric characters in the string, and then replaces the input string with this match.
public String getFirstLetters(String input) {
return new String("A" + input).replaceAll("^([^\\d]+)(.*)$", "$1")
.substring(1);
}
System.out.println(getFirstLetters("AS23456SDE"));
System.out.println(getFirstLetters("1AS123"));
Output:
AS
(empty)
A simple solution could be like this:
public static void main (String[]args) {
String str = "MFD324FR";
char[] characters = str.toCharArray();
for(char c : characters){
if(Character.isDigit(c))
break;
else
System.out.print(c);
}
}
Use the following function to get required output
public String getFirstChars(String str){
int zeroAscii = '0'; int nineAscii = '9';
String result = "";
for (int i=0; i< str.lenght(); i++){
int ascii = str.toCharArray()[i];
if(ascii >= zeroAscii && ascii <= nineAscii){
result = result + str.toCharArray()[i];
}else{
return result;
}
}
return str;
}
pass your string as argument
I think this can be done by a simple regex which matches digits and java's string split function. This Regex based approach will be more efficient than the methods using more complicated regexs.
Something as below will work
String inp = "ABC345.";
String beginningChars = inp.split("[\\d]+",2)[0];
System.out.println(beginningChars); // only if you want to print.
The regex I used "[\\d]+" is escaped for java already.
What it does?
It matches one or more digits (d). d matches digits of any language in unicode, (so it matches japanese and arabian numbers as well)
What does String beginningChars = inp.split("[\\d]+",2)[0] do?
It applies this regex and separates the string into string arrays where ever a match is found. The [0] at the end selects the first result from that array, since you wanted the starting chars.
What is the second parameter to .split(regex,int) which I supplied as 2?
This is the Limit parameter. This means that the regex will be applied on the string till 1 match is found. Once 1 match is found the string is not processed anymore.
From the Strings javadoc page:
The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.
This will be efficient if your string is huge.
Possible other regex if you want to split only on english numerals
"[0-9]+"
public static void main(String[] args) {
String testString = "MFD324FR";
int index = 0;
for (Character i : testString.toCharArray()) {
if (Character.isDigit(i))
break;
index++;
}
System.out.println(testString.substring(0, index));
}
this prints the first 'n' characters before it encounters a digit (i.e. integer).

How can I replace non-printable Unicode characters in Java?

The following will replace ASCII control characters (shorthand for [\x00-\x1F\x7F]):
my_string.replaceAll("\\p{Cntrl}", "?");
The following will replace all ASCII non-printable characters (shorthand for [\p{Graph}\x20]), including accented characters:
my_string.replaceAll("[^\\p{Print}]", "?");
However, neither works for Unicode strings. Does anyone has a good way to remove non-printable characters from a unicode string?
my_string.replaceAll("\\p{C}", "?");
See more about Unicode regex. java.util.regexPattern/String.replaceAll supports them.
Op De Cirkel is mostly right. His suggestion will work in most cases:
myString.replaceAll("\\p{C}", "?");
But if myString might contain non-BMP codepoints then it's more complicated. \p{C} contains the surrogate codepoints of \p{Cs}. The replacement method above will corrupt non-BMP codepoints by sometimes replacing only half of the surrogate pair. It's possible this is a Java bug rather than intended behavior.
Using the other constituent categories is an option:
myString.replaceAll("[\\p{Cc}\\p{Cf}\\p{Co}\\p{Cn}]", "?");
However, solitary surrogate characters not part of a pair (each surrogate character has an assigned codepoint) will not be removed. A non-regex approach is the only way I know to properly handle \p{C}:
StringBuilder newString = new StringBuilder(myString.length());
for (int offset = 0; offset < myString.length();)
{
int codePoint = myString.codePointAt(offset);
offset += Character.charCount(codePoint);
// Replace invisible control characters and unused code points
switch (Character.getType(codePoint))
{
case Character.CONTROL: // \p{Cc}
case Character.FORMAT: // \p{Cf}
case Character.PRIVATE_USE: // \p{Co}
case Character.SURROGATE: // \p{Cs}
case Character.UNASSIGNED: // \p{Cn}
newString.append('?');
break;
default:
newString.append(Character.toChars(codePoint));
break;
}
}
methods below for your goal
public static String removeNonAscii(String str)
{
return str.replaceAll("[^\\x00-\\x7F]", "");
}
public static String removeNonPrintable(String str) // All Control Char
{
return str.replaceAll("[\\p{C}]", "");
}
public static String removeSomeControlChar(String str) // Some Control Char
{
return str.replaceAll("[\\p{Cntrl}\\p{Cc}\\p{Cf}\\p{Co}\\p{Cn}]", "");
}
public static String removeFullControlChar(String str)
{
return removeNonPrintable(str).replaceAll("[\\r\\n\\t]", "");
}
You may be interested in the Unicode categories "Other, Control" and possibly "Other, Format" (unfortunately the latter seems to contain both unprintable and printable characters).
In Java regular expressions you can check for them using \p{Cc} and \p{Cf} respectively.
I have used this simple function for this:
private static Pattern pattern = Pattern.compile("[^ -~]");
private static String cleanTheText(String text) {
Matcher matcher = pattern.matcher(text);
if ( matcher.find() ) {
text = text.replace(matcher.group(0), "");
}
return text;
}
Hope this is useful.
Based on the answers by Op De Cirkel and noackjr, the following is what I do for general string cleaning: 1. trimming leading or trailing whitespaces, 2. dos2unix, 3. mac2unix, 4. removing all "invisible Unicode characters" except whitespaces:
myString.trim.replaceAll("\r\n", "\n").replaceAll("\r", "\n").replaceAll("[\\p{Cc}\\p{Cf}\\p{Co}\\p{Cn}&&[^\\s]]", "")
Tested with Scala REPL.
I propose it remove the non printable characters like below instead of replacing it
private String removeNonBMPCharacters(final String input) {
StringBuilder strBuilder = new StringBuilder();
input.codePoints().forEach((i) -> {
if (Character.isSupplementaryCodePoint(i)) {
strBuilder.append("?");
} else {
strBuilder.append(Character.toChars(i));
}
});
return strBuilder.toString();
}
Supported multilanguage
public static String cleanUnprintableChars(String text, boolean multilanguage)
{
String regex = multilanguage ? "[^\\x00-\\xFF]" : "[^\\x00-\\x7F]";
// strips off all non-ASCII characters
text = text.replaceAll(regex, "");
// erases all the ASCII control characters
text = text.replaceAll("[\\p{Cntrl}&&[^\r\n\t]]", "");
// removes non-printable characters from Unicode
text = text.replaceAll("\\p{C}", "");
return text.trim();
}
I have redesigned the code for phone numbers +9 (987) 124124
Extract digits from a string in Java
public static String stripNonDigitsV2( CharSequence input ) {
if (input == null)
return null;
if ( input.length() == 0 )
return "";
char[] result = new char[input.length()];
int cursor = 0;
CharBuffer buffer = CharBuffer.wrap( input );
int i=0;
while ( i< buffer.length() ) { //buffer.hasRemaining()
char chr = buffer.get(i);
if (chr=='u'){
i=i+5;
chr=buffer.get(i);
}
if ( chr > 39 && chr < 58 )
result[cursor++] = chr;
i=i+1;
}
return new String( result, 0, cursor );
}

Categories

Resources