How do I convert UTF-8 in hex to its code point?

How do I convert UTF-8 in hex to its code point? - java

I have a String e2 80 99 which is a Hex representation of a UTF-8 character. The string represents
U+2019 ’ e2 80 99 RIGHT SINGLE QUOTATION MARK
I want to convert e2 80 99 to its corresponding Unicode code point which is U+2019 or even ' (single quotation).
How do I do it?

Basically you need to get a String representation of the character encoded with utf-8, then get the first character of the resulting String (or first + second if the resulting character is represented as two surrogates in UTF-16). This is a proof of concept:
public static void main(String[] args) throws Exception {
// Convert your representation of a char into a String object:
String utf8char = "e2 80 99";
String[] strNumbers = utf8char.split(" ");
byte[] rawChars = new byte[strNumbers.length];
int index = 0;
for(String strNumber: strNumbers) {
rawChars[index++] = (byte)(int)Integer.valueOf(strNumber, 16);
}
String utf16Char = new String(rawChars, Charset.forName("UTF-8"));
// get the resulting characters (Java Strings are "encoded" in UTF16)
int codePoint = utf16Char.charAt(0);
if(Character.isSurrogate(utf16Char.charAt(0))) {
codePoint = Character.toCodePoint(utf16Char.charAt(0), utf16Char.charAt(1));
}
System.out.println("code point: " + Integer.toHexString(codePoint));
}

Related

remove unicode characters in given example and print relevant data [duplicate]

This question already has answers here:
Remove diacritical marks (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ) from Unicode chars
(12 answers)
Closed 3 days ago.
Is there a better way for getting rid of accents and making those letters regular apart from using String.replaceAll() method and replacing letters one by one?
Example:
Input: orčpžsíáýd
Output: orcpzsiayd
It doesn't need to include all letters with accents like the Russian alphabet or the Chinese one.

Start with java.text.Normalizer.
string = Normalizer.normalize(string, Normalizer.Form.NFD);
// or Normalizer.Form.NFKD for a more "compatible" deconstruction
This will separate all of the accent marks from most characters. Then, you just need to compare each character against being a letter and throw out the ones that aren't.
string = string.replaceAll("[^\\p{ASCII}]", "");
If your text is in Unicode, you should use this instead:
string = string.replaceAll("\\p{M}", "");
For Unicode, \\P{M} matches the base glyph and \\p{M} (lowercase) matches each accent.
Thanks to GarretWilson for the pointer and regular-expressions.info for the great Unicode guide.
It is important to note that Normalizer by itself is insufficient to remove diacritics. For example, the following will not replace the accented é with the unaccented e:
import static java.text.Normalizer.normalize;
import static java.text.Normalizer.Form.*;
public class T {
public static void main( final String[] args ) {
final var text = "Brévis";
System.out.println(
normalize( text, NFD ) + " " +
normalize( text, NFC ) + " " +
normalize( text, NFKD ) + " " +
normalize( text, NFKC )
);
}
}

As of 2011 you can use Apache Commons StringUtils.stripAccents(input) (since 3.0):
String input = StringUtils.stripAccents("Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ");
System.out.println(input);
// Prints "This is a funky String"
Note:
The accepted answer (Erick Robertson's) doesn't work for Ø or Ł. Apache Commons 3.5 doesn't work for Ø either, but it does work for Ł. After reading the Wikipedia article for Ø, I'm not sure it should be replaced with "O": it's a separate letter in Norwegian and Danish, alphabetized after "z". It's a good example of the limitations of the "strip accents" approach.

The solution by #virgo47 is very fast, but approximate. The accepted answer uses Normalizer and a regular expression. I wondered what part of the time was taken by Normalizer versus the regular expression, since removing all the non-ASCII characters can be done without a regex:
import java.text.Normalizer;
public class Strip {
public static String flattenToAscii(String string) {
StringBuilder sb = new StringBuilder(string.length());
string = Normalizer.normalize(string, Normalizer.Form.NFD);
for (char c : string.toCharArray()) {
if (c <= '\u007F') sb.append(c);
}
return sb.toString();
}
}
Small additional speed-ups can be obtained by writing into a char[] and not calling toCharArray(), although I'm not sure that the decrease in code clarity merits it:
public static String flattenToAscii(String string) {
char[] out = new char[string.length()];
string = Normalizer.normalize(string, Normalizer.Form.NFD);
int j = 0;
for (int i = 0, n = string.length(); i < n; ++i) {
char c = string.charAt(i);
if (c <= '\u007F') out[j++] = c;
}
return new String(out);
}
This variation has the advantage of the correctness of the one using Normalizer and some of the speed of the one using a table. On my machine, this one is about 4x faster than the accepted answer, and 6.6x to 7x slower that #virgo47's (the accepted answer is about 26x slower than #virgo47's on my machine).

EDIT: If you're not stuck with Java <6 and speed is not critical and/or translation table is too limiting, use answer by David. The point is to use Normalizer (introduced in Java 6) instead of translation table inside the loop.
While this is not "perfect" solution, it works well when you know the range (in our case Latin1,2), worked before Java 6 (not a real issue though) and is much faster than the most suggested version (may or may not be an issue):
/**
* Mirror of the unicode table from 00c0 to 017f without diacritics.
*/
private static final String tab00c0 = "AAAAAAACEEEEIIII" +
"DNOOOOO\u00d7\u00d8UUUUYI\u00df" +
"aaaaaaaceeeeiiii" +
"\u00f0nooooo\u00f7\u00f8uuuuy\u00fey" +
"AaAaAaCcCcCcCcDd" +
"DdEeEeEeEeEeGgGg" +
"GgGgHhHhIiIiIiIi" +
"IiJjJjKkkLlLlLlL" +
"lLlNnNnNnnNnOoOo" +
"OoOoRrRrRrSsSsSs" +
"SsTtTtTtUuUuUuUu" +
"UuUuWwYyYZzZzZzF";
/**
* Returns string without diacritics - 7 bit approximation.
*
* #param source string to convert
* #return corresponding string without diacritics
*/
public static String removeDiacritic(String source) {
char[] vysl = new char[source.length()];
char one;
for (int i = 0; i < source.length(); i++) {
one = source.charAt(i);
if (one >= '\u00c0' && one <= '\u017f') {
one = tab00c0.charAt((int) one - '\u00c0');
}
vysl[i] = one;
}
return new String(vysl);
}
Tests on my HW with 32bit JDK show that this performs conversion from àèéľšťč89FDČ to aeelstc89FDC 1 million times in ~100ms while Normalizer way makes it in 3.7s (37x slower). In case your needs are around performance and you know the input range, this may be for you.
Enjoy :-)

System.out.println(Normalizer.normalize("àèé", Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", ""));
worked for me. The output of the snippet above gives "aee" which is what I wanted, but
System.out.println(Normalizer.normalize("àèé", Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", ""));
didn't do any substitution.

Depending on the language, those might not be considered accents (which change the sound of the letter), but diacritical marks
https://en.wikipedia.org/wiki/Diacritic#Languages_with_letters_containing_diacritics
"Bosnian and Croatian have the symbols č, ć, đ, š and ž, which are considered separate letters and are listed as such in dictionaries and other contexts in which words are listed according to alphabetical order."
Removing them might be inherently changing the meaning of the word, or changing the letters into completely different ones.

I have faced the same issue related to Strings equality check, One of the comparing string has
ASCII character code 128-255.
i.e., Non-breaking space - [Hex - A0] Space [Hex - 20].
To show Non-breaking space over HTML. I have used the following spacing entities. Their character and its bytes are like &emsp is very wide space[ ]{-30, -128, -125}, &ensp is somewhat wide space[ ]{-30, -128, -126}, &thinsp is narrow space[ ]{32} , Non HTML Space {}
String s1 = "My Sample Space Data", s2 = "My Sample Space Data";
System.out.format("S1: %s\n", java.util.Arrays.toString(s1.getBytes()));
System.out.format("S2: %s\n", java.util.Arrays.toString(s2.getBytes()));
Output in Bytes:
S1: [77, 121, 32, 83, 97, 109, 112, 108, 101, 32, 83, 112, 97, 99, 101, 32, 68, 97, 116, 97]
S2: [77, 121, -30, -128, -125, 83, 97, 109, 112, 108, 101, -30, -128, -125, 83, 112, 97, 99, 101, -30, -128, -125, 68, 97, 116, 97]
Use below code for Different Spaces and their Byte-Codes: wiki for List_of_Unicode_characters
String spacing_entities = "very wide space,narrow space,regular space,invisible separator";
System.out.println("Space String :"+ spacing_entities);
byte[] byteArray =
// spacing_entities.getBytes( Charset.forName("UTF-8") );
// Charset.forName("UTF-8").encode( s2 ).array();
{-30, -128, -125, 44, -30, -128, -126, 44, 32, 44, -62, -96};
System.out.println("Bytes:"+ Arrays.toString( byteArray ) );
try {
System.out.format("Bytes to String[%S] \n ", new String(byteArray, "UTF-8"));
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
➩ ASCII transliterations of Unicode string for Java. unidecode
String initials = Unidecode.decode( s2 );
➩ using Guava: Google Core Libraries for Java.
String replaceFrom = CharMatcher.WHITESPACE.replaceFrom( s2, " " );
For URL encode for the space use Guava laibrary.
String encodedString = UrlEscapers.urlFragmentEscaper().escape(inputString);
➩ To overcome this problem used String.replaceAll() with some RegularExpression.
// \p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
s2 = s2.replaceAll("\\p{Zs}", " ");
s2 = s2.replaceAll("[^\\p{ASCII}]", " ");
s2 = s2.replaceAll(" ", " ");
➩ Using java.text.Normalizer.Form.
This enum provides constants of the four Unicode normalization forms that are described in Unicode Standard Annex #15 — Unicode Normalization Forms and two methods to access them.
s2 = Normalizer.normalize(s2, Normalizer.Form.NFKC);
Testing String and outputs on different approaches like ➩ Unidecode, Normalizer, StringUtils.
String strUni = "Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ Æ,Ø,Ð,ß";
// This is a funky String AE,O,D,ss
String initials = Unidecode.decode( strUni );
// Following Produce this o/p: Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ Æ,Ø,Ð,ß
String temp = Normalizer.normalize(strUni, Normalizer.Form.NFD);
Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
temp = pattern.matcher(temp).replaceAll("");
String input = org.apache.commons.lang3.StringUtils.stripAccents( strUni );
Using Unidecode is the best choice, My final Code shown below.
public static void main(String[] args) {
String s1 = "My Sample Space Data", s2 = "My Sample Space Data";
String initials = Unidecode.decode( s2 );
if( s1.equals(s2)) { //[ , ] %A0 - %2C - %20 « http://www.ascii-code.com/
System.out.println("Equal Unicode Strings");
} else if( s1.equals( initials ) ) {
System.out.println("Equal Non Unicode Strings");
} else {
System.out.println("Not Equal");
}
}

I suggest Junidecode . It will handle not only 'Ł' and 'Ø', but it also works well for transcribing from other alphabets, such as Chinese, into Latin alphabet.

One of the best way using regex and Normalizer if you have no library is :
public String flattenToAscii(String s) {
if(s == null || s.trim().length() == 0)
return "";
return Normalizer.normalize(s, Normalizer.Form.NFD).replaceAll("[\u0300-\u036F]", "");
}
This is more efficient than replaceAll("[^\p{ASCII}]", "")) and if you don't need diacritics (just like your example).
Otherwise, you have to use the p{ASCII} pattern.
Regards.

Since this solution is already available in StringUtils.stripAccents() at Maven Repository and working for Ł as mentioned by #DavidS.
But I need this to be working for both Ø and Ł So modified as below. May be help full for others too.
Update
This is modified version of StringUtils.stripAccents(String obj), that contains old functionality along with handling both Ø and Ł chars.
public static String stripAccents(final String input) {
if (input == null) {
return null;
}
final StringBuilder decomposed = new StringBuilder(Normalizer.normalize(input, Normalizer.Form.NFD));
for (int i = 0; i < decomposed.length(); i++) {
if (decomposed.charAt(i) == '\u0141') {
decomposed.setCharAt(i, 'L');
} else if (decomposed.charAt(i) == '\u0142') {
decomposed.setCharAt(i, 'l');
}else if (decomposed.charAt(i) == '\u00D8') {
decomposed.setCharAt(i, 'O');
}else if (decomposed.charAt(i) == '\u00F8') {
decomposed.setCharAt(i, 'o');
}
}
// Note that this doesn't correctly remove ligatures...
return Pattern.compile("\\p{InCombiningDiacriticalMarks}+").matcher(decomposed).replaceAll("");
}
Input string Ł Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ Ø ø
output string L This is a funky String O o

#David Conrad solution is the fastest I tried using the Normalizer, but it does have a bug. It basically strips characters which are not accents, for example Chinese characters and other letters like æ, are all stripped.
The characters that we want to strip are non spacing marks, characters which don't take up extra width in the final string. These zero width characters basically end up combined in some other character. If you can see them isolated as a character, for example like this `, my guess is that it's combined with the space character.
public static String flattenToAscii(String string) {
char[] out = new char[string.length()];
String norm = Normalizer.normalize(string, Normalizer.Form.NFD);
int j = 0;
for (int i = 0, n = norm.length(); i < n; ++i) {
char c = norm.charAt(i);
int type = Character.getType(c);
//Log.d(TAG,""+c);
//by Ricardo, modified the character check for accents, ref: http://stackoverflow.com/a/5697575/689223
if (type != Character.NON_SPACING_MARK){
out[j] = c;
j++;
}
}
//Log.d(TAG,"normalized string:"+norm+"/"+new String(out));
return new String(out);
}

I think the best solution is converting each char to HEX and replace it with another HEX. It's because there are 2 Unicode typing:
Composite Unicode
Precomposed Unicode
For example "Ồ" written by Composite Unicode is different from "Ồ" written by Precomposed Unicode. You can copy my sample chars and convert them to see the difference.
In Composite Unicode, "Ồ" is combined from 2 char: Ô (U+00d4) and ̀ (U+0300)
In Precomposed Unicode, "Ồ" is single char (U+1ED2)
I have developed this feature for some banks to convert the info before sending it to core-bank (usually don't support Unicode) and faced this issue when the end-users use multiple Unicode typing to input the data. So I think, converting to HEX and replace it is the most reliable way.

A fast and safer way
public static String removeDiacritics(String str) {
if (str == null)
return null;
if (str.isEmpty())
return "";
int len = str.length();
StringBuilder sb
= new StringBuilder(len);
//iterate string codepoints
for (int i = 0; i < len; ) {
int codePoint = str.codePointAt(i);
int charCount
= Character.charCount(codePoint);
if (charCount > 1) {
for (int j = 0; j < charCount; j++)
sb.append(str.charAt(i + j));
i += charCount;
continue;
}
else if (codePoint <= 127) {
sb.append((char)codePoint);
i++;
continue;
}
sb.append(
java.text.Normalizer
.normalize(
Character.toString((char)codePoint),
java.text.Normalizer.Form.NFD)
.charAt(0));
i++;
}
return sb.toString();
}

Faced the same issue, here's solution using Kotlin extension
val String.stripAccents: String
get() = Regex("\\p{InCombiningDiacriticalMarks}+")
.replace(
Normalizer.normalize(this, Normalizer.Form.NFD),
""
)
usage
val textWithoutAccents = "some accented string".stripAccents

In case anyone is strugling to do this in kotlin, this code works like a charm. To avoid inconsistencies I also use .toUpperCase and Trim(). then i cast this function:
fun stripAccents(s: String):String{
if (s == null) {
return "";
}
val chars: CharArray = s.toCharArray()
var sb = StringBuilder(s)
var cont: Int = 0
while (chars.size > cont) {
var c: kotlin.Char
c = chars[cont]
var c2:String = c.toString()
//these are my needs, in case you need to convert other accents just Add new entries aqui
c2 = c2.replace("Ã", "A")
c2 = c2.replace("Õ", "O")
c2 = c2.replace("Ç", "C")
c2 = c2.replace("Á", "A")
c2 = c2.replace("Ó", "O")
c2 = c2.replace("Ê", "E")
c2 = c2.replace("É", "E")
c2 = c2.replace("Ú", "U")
c = c2.single()
sb.setCharAt(cont, c)
cont++
}
return sb.toString()
}
to use these fun cast the code like this:
var str: String
str = editText.text.toString() //get the text from EditText
str = str.toUpperCase().trim()
str = stripAccents(str) //call the function

Converting text to binary in java

I want to convert every character of a String to a new binary String. Here is what I do :
public static void main(String args[]) {
String MESSAGE = "%";
String binaryResult = "";
for (char c : MESSAGE.toCharArray()){
binaryResult += Integer.toBinaryString( (int) c);
}
System.err.println(binaryResult);
}
For exemple with the input : "%", I get the following output : "100101"
My problem is that the leading "0" is deleted ...
I want to have : "0100101". Does anyone have ideas?

What you're really saying is "How can I pad my binary string representation of a character to 7 digits"?
Replace this line:
binaryResult += Integer.toBinaryString( (int) c);
With these:
String binString = Integer.toBinaryString( (int) c );
binaryResult += ("0000000" + binString).substring(binString.length());
This presumes that you only have 7-bit characters... if you need more, then add 0's to the "00000" string to match the length of string (with padded 0s) you want.

I would suggest a couple of changes to your existing code. Since, you are concatenating to an string, inside a loop, this would cause the creation of a bunch of new string objects since they are immutable. The problem may be solved by use of a StringBuilder.
public static void main(String args[]) {
String MESSAGE = "%";
StringBuilder binaryResult = new StringBuilder();
for (char c : MESSAGE.toCharArray()) {
StringBuilder curValue = new StringBuilder(Integer.toBinaryString((int)c));
// calculate padding 0 bits to fill to 8 bits
int paddingLength = 8 - curValue.length();
char[] paddingArr = new char[paddingLength];
Arrays.fill(paddingArr, '0');
// insert padding bytes to the front
curValue.insert(0, paddingArr);
// add to stringbuilder for `MESSAGE`
binaryResult.append(curValue);
}
System.err.println(binaryResult.toString());
}

how to detect base64 encoded strings? [duplicate]

I want to decode a Base64 encoded string, then store it in my database. If the input is not Base64 encoded, I need to throw an error.
How can I check if a string is Base64 encoded?

You can use the following regular expression to check if a string constitutes a valid base64 encoding:
^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)?$
In base64 encoding, the character set is [A-Z, a-z, 0-9, and + /]. If the rest length is less than 4, the string is padded with '=' characters.
^([A-Za-z0-9+/]{4})* means the string starts with 0 or more base64 groups.
([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$ means the string ends in one of three forms: [A-Za-z0-9+/]{4}, [A-Za-z0-9+/]{3}= or [A-Za-z0-9+/]{2}==.

If you are using Java, you can actually use commons-codec library
import org.apache.commons.codec.binary.Base64;
String stringToBeChecked = "...";
boolean isBase64 = Base64.isArrayByteBase64(stringToBeChecked.getBytes());
[UPDATE 1] Deprecation Notice
Use instead
Base64.isBase64(value);
/**
* Tests a given byte array to see if it contains only valid characters within the Base64 alphabet. Currently the
* method treats whitespace as valid.
*
* #param arrayOctet
* byte array to test
* #return {#code true} if all bytes are valid characters in the Base64 alphabet or if the byte array is empty;
* {#code false}, otherwise
* #deprecated 1.5 Use {#link #isBase64(byte[])}, will be removed in 2.0.
*/
#Deprecated
public static boolean isArrayByteBase64(final byte[] arrayOctet) {
return isBase64(arrayOctet);
}

Well you can:
Check that the length is a multiple of 4 characters
Check that every character is in the set A-Z, a-z, 0-9, +, / except for padding at the end which is 0, 1 or 2 '=' characters
If you're expecting that it will be base64, then you can probably just use whatever library is available on your platform to try to decode it to a byte array, throwing an exception if it's not valid base 64. That depends on your platform, of course.

As of Java 8, you can simply use java.util.Base64 to try and decode the string:
String someString = "...";
Base64.Decoder decoder = Base64.getDecoder();
try {
decoder.decode(someString);
} catch(IllegalArgumentException iae) {
// That string wasn't valid.
}

Try like this for PHP5
//where $json is some data that can be base64 encoded
$json=some_data;
//this will check whether data is base64 encoded or not
if (base64_decode($json, true) == true)
{
echo "base64 encoded";
}
else
{
echo "not base64 encoded";
}
Use this for PHP7
//$string parameter can be base64 encoded or not
function is_base64_encoded($string){
//this will check if $string is base64 encoded and return true, if it is.
if (base64_decode($string, true) !== false){
return true;
}else{
return false;
}
}

var base64Rejex = /^(?:[A-Z0-9+\/]{4})*(?:[A-Z0-9+\/]{2}==|[A-Z0-9+\/]{3}=|[A-Z0-9+\/]{4})$/i;
var isBase64Valid = base64Rejex.test(base64Data); // base64Data is the base64 string
if (isBase64Valid) {
// true if base64 formate
console.log('It is base64');
} else {
// false if not in base64 formate
console.log('it is not in base64');
}

Try this:
public void checkForEncode(String string) {
String pattern = "^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(string);
if (m.find()) {
System.out.println("true");
} else {
System.out.println("false");
}
}

It is impossible to check if a string is base64 encoded or not. It is only possible to validate if that string is of a base64 encoded string format, which would mean that it could be a string produced by base64 encoding (to check that, string could be validated against a regexp or a library could be used, many other answers to this question provide good ways to check this, so I won't go into details).
For example, string flow is a valid base64 encoded string. But it is impossible to know if it is just a simple string, an English word flow, or is it base 64 encoded string ~Z0

There are many variants of Base64, so consider just determining if your string resembles the varient you expect to handle. As such, you may need to adjust the regex below with respect to the index and padding characters (i.e. +, /, =).
class String
def resembles_base64?
self.length % 4 == 0 && self =~ /^[A-Za-z0-9+\/=]+\Z/
end
end
Usage:
raise 'the string does not resemble Base64' unless my_string.resembles_base64?

Check to see IF the string's length is a multiple of 4. Aftwerwards use this regex to make sure all characters in the string are base64 characters.
\A[a-zA-Z\d\/+]+={,2}\z
If the library you use adds a newline as a way of observing the 76 max chars per line rule, replace them with empty strings.

/^([A-Za-z0-9+\/]{4})*([A-Za-z0-9+\/]{4}|[A-Za-z0-9+\/]{3}=|[A-Za-z0-9+\/]{2}==)$/
this regular expression helped me identify the base64 in my application in rails, I only had one problem, it is that it recognizes the string "errorDescripcion", I generate an error, to solve it just validate the length of a string.

For Flutter, I tested couple of the above comments and translated that into dart function as follows
static bool isBase64(dynamic value) {
if (value.runtimeType == String){
final RegExp rx = RegExp(r'^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)?$',
multiLine: true,
unicode: true,
);
final bool isBase64Valid = rx.hasMatch(value);
if (isBase64Valid == true) {return true;}
else {return false;}
}
else {return false;}
}

In Java below code worked for me:
public static boolean isBase64Encoded(String s) {
String pattern = "^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)?$";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(s);
return m.find();
}

This works in Python:
import base64
def IsBase64(str):
try:
base64.b64decode(str)
return True
except Exception as e:
return False
if IsBase64("ABC"):
print("ABC is Base64-encoded and its result after decoding is: " + str(base64.b64decode("ABC")).replace("b'", "").replace("'", ""))
else:
print("ABC is NOT Base64-encoded.")
if IsBase64("QUJD"):
print("QUJD is Base64-encoded and its result after decoding is: " + str(base64.b64decode("QUJD")).replace("b'", "").replace("'", ""))
else:
print("QUJD is NOT Base64-encoded.")
Summary: IsBase64("string here") returns true if string here is Base64-encoded, and it returns false if string here was NOT Base64-encoded.

C#
This is performing great:
static readonly Regex _base64RegexPattern = new Regex(BASE64_REGEX_STRING, RegexOptions.Compiled);
private const String BASE64_REGEX_STRING = #"^[a-zA-Z0-9\+/]*={0,3}$";
private static bool IsBase64(this String base64String)
{
var rs = (!string.IsNullOrEmpty(base64String) && !string.IsNullOrWhiteSpace(base64String) && base64String.Length != 0 && base64String.Length % 4 == 0 && !base64String.Contains(" ") && !base64String.Contains("\t") && !base64String.Contains("\r") && !base64String.Contains("\n")) && (base64String.Length % 4 == 0 && _base64RegexPattern.Match(base64String, 0).Success);
return rs;
}

There is no way to distinct string and base64 encoded, except the string in your system has some specific limitation or identification.

This snippet may be useful when you know the length of the original content (e.g. a checksum). It checks that encoded form has the correct length.
public static boolean isValidBase64( final int initialLength, final String string ) {
final int padding ;
final String regexEnd ;
switch( ( initialLength ) % 3 ) {
case 1 :
padding = 2 ;
regexEnd = "==" ;
break ;
case 2 :
padding = 1 ;
regexEnd = "=" ;
break ;
default :
padding = 0 ;
regexEnd = "" ;
}
final int encodedLength = ( ( ( initialLength / 3 ) + ( padding > 0 ? 1 : 0 ) ) * 4 ) ;
final String regex = "[a-zA-Z0-9/\\+]{" + ( encodedLength - padding ) + "}" + regexEnd ;
return Pattern.compile( regex ).matcher( string ).matches() ;
}

If the RegEx does not work and you know the format style of the original string, you can reverse the logic, by regexing for this format.
For example I work with base64 encoded xml files and just check if the file contains valid xml markup. If it does not I can assume, that it's base64 decoded. This is not very dynamic but works fine for my small application.

This works in Python:
def is_base64(string):
if len(string) % 4 == 0 and re.test('^[A-Za-z0-9+\/=]+\Z', string):
return(True)
else:
return(False)

Try this using a previously mentioned regex:
String regex = "^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$";
if("TXkgdGVzdCBzdHJpbmc/".matches(regex)){
System.out.println("it's a Base64");
}
...We can also make a simple validation like, if it has spaces it cannot be Base64:
String myString = "Hello World";
if(myString.contains(" ")){
System.out.println("Not B64");
}else{
System.out.println("Could be B64 encoded, since it has no spaces");
}

if when decoding we get a string with ASCII characters, then the string was
not encoded
(RoR) ruby solution:
def encoded?(str)
Base64.decode64(str.downcase).scan(/[^[:ascii:]]/).count.zero?
end
def decoded?(str)
Base64.decode64(str.downcase).scan(/[^[:ascii:]]/).count > 0
end

Function Check_If_Base64(ByVal msgFile As String) As Boolean
Dim I As Long
Dim Buffer As String
Dim Car As String
Check_If_Base64 = True
Buffer = Leggi_File(msgFile)
Buffer = Replace(Buffer, vbCrLf, "")
For I = 1 To Len(Buffer)
Car = Mid(Buffer, I, 1)
If (Car < "A" Or Car > "Z") _
And (Car < "a" Or Car > "z") _
And (Car < "0" Or Car > "9") _
And (Car <> "+" And Car <> "/" And Car <> "=") Then
Check_If_Base64 = False
Exit For
End If
Next I
End Function
Function Leggi_File(PathAndFileName As String) As String
Dim FF As Integer
FF = FreeFile()
Open PathAndFileName For Binary As #FF
Leggi_File = Input(LOF(FF), #FF)
Close #FF
End Function

import java.util.Base64;
public static String encodeBase64(String s) {
return Base64.getEncoder().encodeToString(s.getBytes());
}
public static String decodeBase64(String s) {
try {
if (isBase64(s)) {
return new String(Base64.getDecoder().decode(s));
} else {
return s;
}
} catch (Exception e) {
return s;
}
}
public static boolean isBase64(String s) {
String pattern = "^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(s);
return m.find();
}

For Java flavour I actually use the following regex:
"([A-Za-z0-9+]{4})*([A-Za-z0-9+]{3}=|[A-Za-z0-9+]{2}(==){0,2})?"
This also have the == as optional in some cases.
Best!

I try to use this, yes this one it's working
^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)?$
but I added on the condition to check at least the end of the character is =
string.lastIndexOf("=") >= 0

Encode only specific characters in String

I have to encode only some special characters in a string to numeric value.
Say,
String name = "test $#";
I want to encode only characters $ and # in the above string. I tried using below code but it did not work out.
String encode = URLEncoder.encode(StringEscapeUtils.escapeJava(name), "UTF-8");
The encoded value will be like, for white space the encoded value is &#160

What about to split that String (by string#split method - with space as regex), from Array, which it returns you can use last item and you will get there symbols, what you need :)
String name = "test $#";
String nameSplittedArr = name.split(" ");
String yourChars = nameSplittedArr[nameSplittedArr.length-1]; //indexes from zero
That should works :)

As per the comments, I think you are after a customized encoding function. Something like:
public static String EncodeString(String text) {
StringBuffer sb = new StringBuffer();
for (char c : text.toCharArray()) {
if (Character.isLetterOrDigit(c)) {
sb.append(c);
} else {
sb.append("&#" + (int)c + ";");
}
}
return sb.toString();
}
An example of this is here.

How to convert ASCII to hexadecimal values in java

How to convert ASCII to hexadecimal values in java.
For example:
ASCII: 31 32 2E 30 31 33
Hex: 12.013

You did not convert ASCII to hexadecimal value. You had char values in hexadecimal, and you wanted to convert it to a String is how I'm interpreting your question.
String s = new String(new char[] {
0x31, 0x32, 0x2E, 0x30, 0x31, 0x33
});
System.out.println(s); // prints "12.013"
If perhaps you're given the string, and you want to print its char as hex, then this is how to do it:
for (char ch : "12.013".toCharArray()) {
System.out.print(Integer.toHexString(ch) + " ");
} // prints "31 32 2e 30 31 33 "
You can also use the %H format string:
for (char ch : "12.013".toCharArray()) {
System.out.format("%H ", ch);
} // prints "31 32 2E 30 31 33 "

It's not entirely clear what you are asking, since your "hex" string is actually in decimal. I believe you are trying to take an ASCII string representing a double and to get its value in the form of a double, in which case using Double.parseDouble should be sufficient for your needs. If you need to output a hex string of the double value, then you can use Double.toHexString. Note you need to catch NumberFormatException, whenever you invoke one of the primitive wrapper class's parse functions.
byte[] ascii = {(byte)0x31, (byte)0x32, (byte)0x2E, (byte)0x30, (byte)0x31, (byte)0x33};
String decimalstr = new String(ascii,"US-ASCII");
double val = Double.parseDouble(decimalstr);
String hexstr = Double.toHexString(val);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How do I convert UTF-8 in hex to its code point? - java

I have a String e2 80 99 which is a Hex representation of a UTF-8 character. The string represents U+2019 ’ e2 80 99 RIGHT SINGLE QUOTATION MARK I want to convert e2 80 99 to its corresponding Unicode code point which is U+2019 or even ' (single quotation). How do I do it?

Related

remove unicode characters in given example and print relevant data [duplicate]

Converting text to binary in java

how to detect base64 encoded strings? [duplicate]

Encode only specific characters in String

How to convert ASCII to hexadecimal values in java

Categories

Resources