Check if a String contains encoded characters - java

Hello I am looking for a way to detect if a string has being encoded
For example
String name = "Hellä world";
String encoded = new String(name.getBytes("utf-8"), "iso8859-1");
The output of this encoded variable is:
Hellä world
As you can see there is an A with grave and another symbol. Is there a way to check if the output contains encoded characters?

Sounds like you want to check if a string that was decoded from bytes in latin1 could have been decoded in UTF-8, too. That's easy because illegal byte sequences are replaced by the character \ufffd:
String recoded = new String(encoded.getBytes("iso-8859-1"), "UTF-8");
return recoded.indexOf('\uFFFD') == -1; // No replacement character found

Your question doesn't make sense. A java String is a list of characters. They don't have an encoding until you convert them into bytes, at which point you need to specify one (although you will see a lot of code that uses the platform default, which is what e.g. String.getBytes() with no argument does).
I suggest you read this http://kunststube.net/encoding/.

String name = "Hellä world";
String encoded = new String(name.getBytes("utf-8"), "iso8859-1");
This code is just a character corruption bug. You take a UTF-16 string, transcode it to UTF-8, pretend it is ISO-8859-1 and transcode it back to UTF-16, resulting in incorrectly encoded characters.

If I correctly understood your question, this code may help you. The function isEncoded check if its parameter could be encoded as ascii or if it contains non ascii-chars.
public boolean isEncoded(String text){
Charset charset = Charset.forName("US-ASCII");
String checked=new String(text.getBytes(charset),charset);
return !checked.equals(text);
}
#Test
public void testAscii() throws Exception{
Assert.assertFalse(isEncoded("Hello world"));
}
#Test
public void testNonAscii() throws Exception{
Assert.assertTrue(isEncoded("Hellä world"));
}
You can also check for other charset changing charset var or moving it to a parameter.

I'm not really sure what are you trying to do or what is your problem.
This line doesn't make any sense:
String encoded = new String(name.getBytes("utf-8"), "iso8859-1");
You are encoding your name into "UTF-8" and then trying to decode as "iso8859-1".
If you what to encode your name as "iso8859-1" just do name.getBytes("iso8859-1").
Please tell us what is the problem you encountered so that we can help more.

You can check that your string is encoded or not by this code
public boolean isEncoded(String input) {
char[] charArray = input.toCharArray();
for (int i = 0, charArrayLength = charArray.length; i < charArrayLength; i++) {
Character c = charArray[i];
if (Character.getType(c) == Character.OTHER_LETTER)){
return true;
}
}
return false;
}

Related

How to check a "byte[] b" (holding a file content) is encoded in base64 or not..? in java

Here i have a code which accept a file content in byte array, i want to check whether its in base64 format or not,before converting it to base64 and returning.. can anyone help me out here
import sun.misc.BASE64Encoder;
public static String encodeInByteArray(byte[] b)
{
BASE64Encoder encoder = new BASE64Encoder();
return encoder.encode(b);
}
Below is the code which i tried to check for base64 format:
import sun.misc.BASE64Encoder;
import java.util.regex.Pattern;
public class Encoder
{
public static String encodeInByteArray(byte[] b)
{
String regex =
"([A-Za-z0-9+/]{4})*"+
"([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)";
Pattern patron = Pattern.compile(regex);
String s=b.toString();
if (!patron.matcher(s).matches()){
BASE64Encoder encoder = new BASE64Encoder();
return encoder.encode(b);
}
else
return s;
}
public static void main(String [] args) throws FileNotFoundException
{
FileInputStream fs= new FileInputStream("Sample.pdf");
String s= fs.toString();
byte[] b = s.getBytes();
encodeInByteArray(b);
}
}
Calling b.toString() doesn't do what you might expect - the resulting string will be something like [B#106d69c, because arrays don't override toString. (In a similar vein, calling fs.toString() won't give you the contents of the file as a string).
To get a String from a byte[], use the constructor:
new String(b)
But you probably want to specify a particular charset, e.g.:
new String(b, StandardCharsets.ISO_8859_1)
otherwise you may get different results, depending upon your JVM's configuration.
First solution you could parse the file, or parse the file part way (to save resources) and determine if a line is base64 encoded. See this answer for the String base64 encoding check.
How to check whether the string is base64 encoded or not
A second solution would be is that if you have complete control over the file saving and encoding, you could place a byte at the head or tail of the file indicated if its base64 encoded or not, which should be faster then the above solution.
You can use Base64.isBase64(byte[] arrayOctet) from apache's commons-codec.
Be aware that whitespaces are valid at the moment as stated in the documentation.

base64 encoding issue, java

I'm using apache library for encoding to base64. But this time problem is very typical. I've a b64 encoded string.
MIIHSjCCBjKgAwIBAgIQQuw1emUfNRlPD/euDuzBjDANBgkqhkiG9w0BAQUFADCB"+
"5TELMAkGA1UEBhMCRVMxIDAeBgkqhkiG9w0BCQEWEWFjQGFjYWJvZ2FjaWEub3Jn
Its the part of certificate (.CER) file. I am just decoding it and again encoding it but result is little bit different. Resultant string is,
"MIIHSjCCBjKgAwIBAgIQQuw1emUfNRlPD/euDuzBjDANBgkqhkiG9w0BAQUFADA"+ "/5TELMAkGA1UEBhMCRVMxIDAeBgkqhkiG9w0BCQEWEWFjQGFjYWJvZ2FjaWEub3Jn"
The difference is at the end of the first line and starting of the second line. CB are replaced by A/.
This change invalidates my certificate. Where the problem can be ?
The problem is in your intermediate string conversion. If you use only byte array, everything is fine.
public static void main(String args[]) {
String partOfCer = "MIIHSjCCBjKgAwIBAgIQQuw1emUfNRlPD/euDuzBjDANBgkqhkiG9w0BAQUFADCB" + "5TELMAkGA1UEBhMCRVMxIDAeBgkqhkiG9w0BCQEWEWFjQGFjYWJvZ2FjaWEub3Jn";
byte[] dec1_byte = Base64.decodeBase64(partOfCer.getBytes());
// String dec1 = new String(dec1_byte);
byte[] newBytes = Base64.encodeBase64(dec1_byte);
String newStr = new String(newBytes);
System.out.println(partOfCer);
System.out.println(newStr);
System.out.println(partOfCer.equals(newStr));
}

Get Multilingual Data from ByteBuffer

I am receiving ByteBuffers in an UDP Java application.
Now the data in this ByteBuffer can be any string in any language or any special chars separated by zero.
I use following code to get Strings from it.
public String getString() {
byte[] remainingBytes = new byte[this.byteBuffer.remaining()];
this.byteBuffer.slice().get(remainingBytes);
String dataString = new String(remainingBytes);
int stringEnd = dataString.indexOf(0);
if(stringEnd == -1) {
return null;
} else {
dataString = dataString.substring(0, stringEnd);
this.byteBuffer.position(this.byteBuffer.position() + dataString.getBytes().length + 1);
return dataString;
}
}
These strings are stored in MySQL DB with everything set as UTF8.
IF i run application in Windows then special chars like ® are displayed but chinese are not.
On adding VM argument -Dfile.encoding=UTF8 chinese are displayed but chars like ® are shown as ?? etc.
Please Help.
Edit:
Input Strings in UDP packet are variable-length byte field, encoded in UTF-8, terminated by 0x00
For JDBC also i use useUnicode=true&characterEncoding=UTF-8
String dataString = new String(remainingBytes); is wrong. You should almost never do that. You should find out what encoding was used to put the bytes into the UDP packet, and use the same encoding on that line:
String dataString = new String(remainingBytes, encoding); // e.g. "UTF-8"
Edit: based on your updated question, encoding should be "UTF-8"
Not sure, but dataString contains only data till this zero, because stringEnd shows on first zero postion but not behind.
dataString = dataString.substring(0, stringEnd+1);
or
char specChar = dataString.substring(stringEnd, stringEnd+1); and it should return only special character, but as I said in the biggining, not sure...

How best to convert a byte[] array to a string buffer

I have a number of byte[] array variables I need to convert to string buffers.
is there a method for this type of conversion ?
Thanks
Thank you all for your responses..However I didn't make myself clear....
I'm using some byte[] arrays pre-defined as public static "under" the class declaration
for my java program. these "fields" are reused during the "life" of the process.
As the program issues status messages, (written to a file) I've defined a string buffer
(mesg_data) that used to format a status message.
So as the program executes
I tried msg2 = String(byte_array2)
I get a compiler error:
cannot find symbol
symbol : method String(byte[])
location: class APPC_LU62.java.LU62XnsCvr
convrsID = String(conversation_ID) ;
example:
public class LU62XnsCvr extends Object
.
.
static String convrsID ;
static byte[] conversation_ID = new byte[8] ;
So I can't use a "dynamic" define of a string variable because the same variable is used
in multiple occurances.
I hope I made myself clear
Thanks ever so much
Guy
String s = new String(myByteArray, "UTF-8");
StringBuilder sb = new StringBuilder(s);
There is a constructor that a byte array and encoding:
byte[] bytes = new byte[200];
//...
String s = new String(bytes, "UTF-8");
In order to translate bytes to characters you need to specify encoding: the scheme by which sequences (typically of length 1,2 or 3) of 0-255 values (that is: sequence of bytes) are mapped to characters. UTF-8 is probably the best bet as a default.
You can turn it to a String directly
byte[] bytearray
....
String mystring = new String(bytearray)
and then to convert to a StringBuffer
StringBuffer buffer = new StringBuffer(mystring)
You may use
str = new String(bytes)
By thewhat the code above does is to create a java String (i.e. UTF-16) with the default platform character encoding.
If the byte array was created from a string encoded in the platform default character encoding this will work well.
If not you need to specify the correct character encoding (Charset) as
String str = new String (byte [] bytes, Charset charset)
It depends entirely on the character encoding, but you want:
String value = new String(bytes, "US-ASCII");
This would work for US-ASCII values.
See Charset for other valid character encodings (e.g., UTF-8)

Java InputStream encoding/charset

Running the following (example) code
import java.io.*;
public class test {
public static void main(String[] args) throws Exception {
byte[] buf = {-27};
InputStream is = new ByteArrayInputStream(buf);
BufferedReader r = new BufferedReader(
new InputStreamReader(is, "ISO-8859-1"));
String s = r.readLine();
System.out.println("test.java:9 [byte] (char)" + (char)s.getBytes()[0] +
" (int)" + (int)s.getBytes()[0]);
System.out.println("test.java:10 [char] (char)" + (char)s.charAt(0) +
" (int)" + (int)s.charAt(0));
System.out.println("test.java:11 string below");
System.out.println(s);
System.out.println("test.java:13 string above");
}
}
gives me this output
test.java:9 [byte] (char)? (int)63
test.java:10 [char] (char)? (int)229
test.java:11 string below
?
test.java:13 string above
How do I retain the correct byte value (-27) in the line-9 printout? And consequently receive the expected output of the System.out.println(s) command (å).
If you want to retain byte values, don't use a Reader at all, ideally. To represent arbitrary binary data in text and convert it back to binary data later, you should use base16 or base64 encoding.
However, to explain what's going on, when you call s.getBytes() that's using the default character encoding, which apparently doesn't include Unicode character U+00E5.
If you call s.getBytes("ISO-8859-1") everywhere instead of s.getBytes() I suspect you'll get back the right byte value... but relying on ISO-8859-1 for this is kinda dirty IMO.
As noted, getBytes() (no-arguments) uses the Java platform default encoding, which may not be ISO-8859-1. Simply printing it should work, provided your terminal and the default encoding match and support the character. For instance, on my system, the terminal and default Java encoding are both UTF-8. The fact that you're seeing a '?' indicates that yours don't match or å is not supported.
If you want to manually encode to UTF-8 on your system, do:
String s = r.readLine();
byte[] utf8Bytes = s.getBytes("UTF-8");
It should give a byte array with {-61, -91}.

Categories

Resources