Binary to text in Java - java

I have a String with binary data in it (1110100) I want to get the text out so I can print it (1110100 would print "t"). I tried this, it is similar to what I used to transform my text to binary but it's not working at all:
public static String toText(String info)throws UnsupportedEncodingException{
byte[] encoded = info.getBytes();
String text = new String(encoded, "UTF-8");
System.out.println("print: "+text);
return text;
}
Any corrections or suggestions would be much appreciated.
Thanks!

You can use Integer.parseInt with a radix of 2 (binary) to convert the binary string to an integer:
int charCode = Integer.parseInt(info, 2);
Then if you want the corresponding character as a string:
String str = new Character((char)charCode).toString();

This is my one (Working fine on Java 8):
String input = "01110100"; // Binary input as String
StringBuilder sb = new StringBuilder(); // Some place to store the chars
Arrays.stream( // Create a Stream
input.split("(?<=\\G.{8})") // Splits the input string into 8-char-sections (Since a char has 8 bits = 1 byte)
).forEach(s -> // Go through each 8-char-section...
sb.append((char) Integer.parseInt(s, 2)) // ...and turn it into an int and then to a char
);
String output = sb.toString(); // Output text (t)
and the compressed method printing to console:
Arrays.stream(input.split("(?<=\\G.{8})")).forEach(s -> System.out.print((char) Integer.parseInt(s, 2)));
System.out.print('\n');
I am sure there are "better" ways to do this but this is the smallest one you can probably get.

I know the OP stated that their binary was in a String format but for the sake of completeness I thought I would add a solution to convert directly from a byte[] to an alphabetic String representation.
As casablanca stated you basically need to obtain the numerical representation of the alphabetic character. If you are trying to convert anything longer than a single character it will probably come as a byte[] and instead of converting that to a string and then using a for loop to append the characters of each byte you can use ByteBuffer and CharBuffer to do the lifting for you:
public static String bytesToAlphabeticString(byte[] bytes) {
CharBuffer cb = ByteBuffer.wrap(bytes).asCharBuffer();
return cb.toString();
}
N.B. Uses UTF char set
Alternatively using the String constructor:
String text = new String(bytes, 0, bytes.length, "ASCII");

public static String binaryToText(String binary) {
return Arrays.stream(binary.split("(?<=\\G.{8})"))/* regex to split the bits array by 8*/
.parallel()
.map(eightBits -> (char)Integer.parseInt(eightBits, 2))
.collect(
StringBuilder::new,
StringBuilder::append,
StringBuilder::append
).toString();
}

Here is the answer.
private String[] splitByNumber(String s, int size) {
return s.split("(?<=\\G.{"+size+"})");
}

The other way around (Where "info" is the input text and "s" the binary version of it)
byte[] bytes = info.getBytes();
BigInteger bi = new BigInteger(bytes);
String s = bi.toString(2);

Look at the parseInt function. You may also need a cast and the Character.toString function.

Also you can use alternative solution without streams and regular expressions (based on casablanca's answer):
public static String binaryToText(String binaryString) {
StringBuilder stringBuilder = new StringBuilder();
int charCode;
for (int i = 0; i < binaryString.length(); i += 8) {
charCode = Integer.parseInt(binaryString.substring(i, i + 8), 2);
String returnChar = Character.toString((char) charCode);
stringBuilder.append(returnChar);
}
return stringBuilder.toString();
}
you just need to append the specified character as a string to character sequence.

Related

How to convert between UTF-8 and native String in Java?

Just like the picture, I'd like to convert between the encoded UTF-8 String and Native String in Java.
Would anyone some suggestions? Thanks a lot!
ps.
For example,
String a = "这是一个例子,this is a example";
String b = null;
// block A: processing a, and let b = "这是一个例子,this is a example"
How to implement the "block A"?
Apache Commons Lang StringEscapeUtils.unescapeXml(...) is what you want. Depending on where your original string came from, one of the HTML variants may be more appropriate.
Use like so:
String a = "这是一个例子,this is a example";
String b = StringEscapeUtils.unescapeXml(a);
// block A: processing a, and let b = "这是一个例子,this is a example"
System.out.println(a);
System.out.println(b);
Output:
这是一个例子,this is a example
这是一个例子,this is a example
There are methods for converting the other way also.
You can use Charset. See the documentation here
Charset.forName("UTF-8").encode(text)
Or
you can also use getBytes() method of 'java.lang.String' Class
text.getBytes(Charset.forName("UTF-8"));
documentation:
public byte[] getBytes(Charset charset)
Encodes this String into a sequence of bytes using the given charset,
storing the result into a
new byte array.
This method always replaces malformed-input and unmappable-character
sequences with this charset's default replacement byte array. The
CharsetEncoder class should be used when more control over the
encoding process is required.
Parameters: charset - The Charset to be used to encode the String
Returns: The resultant byte array
Since:
1.6
To the right are hexadecimal numeric HTML entities.
Now the apache commons library has a StringEscapeUtils which can convert from that to String, but the reverse is not obvious (= should be tried, might give named entities).
public static void main(String[] args) throws InterruptedException {
String a = "这是一个例子,this is a example";
String b = fromHtmlEntities(a);
System.out.println(b);
String a2 = toHtmlEntities(b);
System.out.println(a2.equals(a));
System.out.println(a);
System.out.println(a2);
}
public static String fromHtmlEntities(String s) {
Pattern numericEntityPattern = Pattern.compile("\\&#[Xx]([0-9A-Fa-f]{1,6});");
Matcher m = numericEntityPattern.matcher(s);
StringBuffer sb = new StringBuffer();
while (m.find()) {
int codePoint = Integer.parseInt(m.group(1), 16);
String replacement = new String(new int[] { codePoint }, 0, 1);
m.appendReplacement(sb, replacement);
}
m.appendTail(sb);
return sb.toString();
}
// Uses java 8
public static String toHtmlEntities(String s) {
int[] codePoints = s.codePoints().flatMap(
(cp) -> cp < 128 // ASCII?
? IntStream.of(cp)
: String.format("&#x%X;", cp).codePoints())
.toArray();
return new String(codePoints, 0, codePoints.length);
}

Decode an escaped string from VBScript in Java

I tried to decode the following string,
String str = "AT%26amp%3BT%20Network%20Client%20%u2013%20IBM";
System.out.println(StringEscapeUtils.unescapeHtml(str));
try {
System.out.println("res:"+java.net.URLDecoder.decode(str, "UTF-8"));
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Both methods fail as below,
AT%26amp%3BT%20Network%20Client%20%u2013%20IBM
Exception in thread "main" java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in escape (%) pattern - For input string: "u2"
at java.net.URLDecoder.decode(URLDecoder.java:173)
at decrypt.DecryptHtml.main(DecryptHtml.java:19)
The source of the string is a VBS script that uses the Escape function. How can I decode this string?
Unfortunately, from reading the documentation, it appears that Microsoft Has Done It Again (tm): "non standard xxx", where here "xxx" is "escaping format".
Specifically, in the documentation of the VBScript function, it is said that:
[...]Unicode characters that have a value greater than 255 are stored using the %uxxxx format.
(Hey, MS: there is no such thing as "Unicode characters"; those are called code points)
Great. So you need your own decoding function.
Fortunately, we use Java. And since this proprietary escape sequence only covers Unicode code points in the Basic Multilingual Plane (U+0000 to U+FFFF), and since char is a UTF-16 code unit, and since there is a 1 to 1 mapping between BMP and UTF-16, this makes our job a little easier.
Here is the code:
public final class MSUnescaper
{
private static final char PERCENT = '%';
private static final char NONSTANDARD_PCT_ESCAPE = 'u';
private MSUnescaper()
{
}
public static String unescape(final String input)
{
final StringBuilder sb = new StringBuilder(input.length());
final CharBuffer buf = CharBuffer.wrap(input);
char c;
while (buf.hasRemaining()) {
c = buf.get();
if (c != PERCENT) {
sb.append(c);
continue;
}
if (!buf.hasRemaining())
throw new IllegalArgumentException();
c = buf.get();
sb.append(c == NONSTANDARD_PCT_ESCAPE
? msEscape(buf) : standardEscape(buf, c));
}
return sb.toString();
}
private static char standardEscape(final CharBuffer buf, final char c)
{
if (!buf.hasRemaining())
throw new IllegalArgumentException();
final char[] array = { c, buf.get() };
return (char) Integer.parseInt(new String(array), 16);
}
private static char msEscape(final CharBuffer buf)
{
if (buf.remaining() < 4)
throw new IllegalArgumentException();
final char[] array = new char[4];
buf.get(array);
return (char) Integer.parseInt(new String(array), 16);
}
public static void main(final String... args)
{
final String input = "AT%26amp%3BT%20Network%20Client%20%u2013%20IBM";
System.out.println(unescape(input));
}
}
Output:
AT&T Network Client – IBM
String str = "AT%26amp%3BT%20Network%20Client%20%[here]u[here]2013%20IBM"
I think this string is invalid. %u20 is not valid charecter.
If you remove u from your string you can encode it.
For reference: w3schools html url encodeing

Decode a string in Java

How do I properly decode the following string in Java
http%3A//www.google.ru/search%3Fhl%3Dru%26q%3Dla+mer+powder%26btnG%3D%u0420%A0%u0421%u045F%u0420%A0%u0421%u2022%u0420%A0%u0421%u2018%u0420%u040E%u0420%u0453%u0420%A0%u0421%u201D+%u0420%A0%u0420%u2020+Google%26lr%3D%26rlz%3D1I7SKPT_ru
When I use URLDecoder.decode() I the following error
java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in escape (%) pattern - For input string: "u0"
Thanks,
Dave
According to Wikipedia, "there exist a non-standard encoding for Unicode characters: %uxxxx, where xxxx is a Unicode value".
Continuing: "This behavior is not specified by any RFC and has been rejected by the W3C".
Your URL contains such tokens, and the Java URLDecoder implementation doesn't support those.
%uXXXX encoding is non-standard, and was actually rejected by W3C, so it's natural, that URLDecoder does not understand it.
You can make small function, which will fix it by replacing each occurrence of %uXXYY with %XX%YY in your encoded string. Then you can procede and decode the fixed string normally.
we started with Vartec's solution but found out additional issues. This solution works for UTF-16, but it can be changed to return UTF-8. The replace all is left for clarity reasons and you can read more at http://www.cogniteam.com/wiki/index.php?title=DecodeEncodeJavaScript
static public String unescape(String escaped) throws UnsupportedEncodingException
{
// This code is needed so that the UTF-16 won't be malformed
String str = escaped.replaceAll("%0", "%u000");
str = str.replaceAll("%1", "%u001");
str = str.replaceAll("%2", "%u002");
str = str.replaceAll("%3", "%u003");
str = str.replaceAll("%4", "%u004");
str = str.replaceAll("%5", "%u005");
str = str.replaceAll("%6", "%u006");
str = str.replaceAll("%7", "%u007");
str = str.replaceAll("%8", "%u008");
str = str.replaceAll("%9", "%u009");
str = str.replaceAll("%A", "%u00A");
str = str.replaceAll("%B", "%u00B");
str = str.replaceAll("%C", "%u00C");
str = str.replaceAll("%D", "%u00D");
str = str.replaceAll("%E", "%u00E");
str = str.replaceAll("%F", "%u00F");
// Here we split the 4 byte to 2 byte, so that decode won't fail
String [] arr = str.split("%u");
Vector<String> vec = new Vector<String>();
if(!arr[0].isEmpty())
{
vec.add(arr[0]);
}
for (int i = 1 ; i < arr.length ; i++) {
if(!arr[i].isEmpty())
{
vec.add("%"+arr[i].substring(0, 2));
vec.add("%"+arr[i].substring(2));
}
}
str = "";
for (String string : vec) {
str += string;
}
// Here we return the decoded string
return URLDecoder.decode(str,"UTF-16");
}
After having had a good look at the solution presented by #ariy I created a Java based solution that is also resilient against encoded characters that have been chopped into two parts (i.e. half of the encoded character is missing). This happens in my usecase where I need to decode long urls that are sometimes chopped at 2000 chars length. See What is the maximum length of a URL in different browsers?
public class Utils {
private static Pattern validStandard = Pattern.compile("%([0-9A-Fa-f]{2})");
private static Pattern choppedStandard = Pattern.compile("%[0-9A-Fa-f]{0,1}$");
private static Pattern validNonStandard = Pattern.compile("%u([0-9A-Fa-f][0-9A-Fa-f])([0-9A-Fa-f][0-9A-Fa-f])");
private static Pattern choppedNonStandard = Pattern.compile("%u[0-9A-Fa-f]{0,3}$");
public static String resilientUrlDecode(String input) {
String cookedInput = input;
if (cookedInput.indexOf('%') > -1) {
// Transform all existing UTF-8 standard into UTF-16 standard.
cookedInput = validStandard.matcher(cookedInput).replaceAll("%00%$1");
// Discard chopped encoded char at the end of the line (there is no way to know what it was)
cookedInput = choppedStandard.matcher(cookedInput).replaceAll("");
// Handle non standard (rejected by W3C) encoding that is used anyway by some
// See: https://stackoverflow.com/a/5408655/114196
if (cookedInput.contains("%u")) {
// Transform all existing non standard into UTF-16 standard.
cookedInput = validNonStandard.matcher(cookedInput).replaceAll("%$1%$2");
// Discard chopped encoded char at the end of the line
cookedInput = choppedNonStandard.matcher(cookedInput).replaceAll("");
}
}
try {
return URLDecoder.decode(cookedInput,"UTF-16");
} catch (UnsupportedEncodingException e) {
// Will never happen because the encoding is hardcoded
return null;
}
}
}

Transform an unicode plain text to common String

I got an unicode string from an external server like this:
005400610020007400650020007400ED0020007400FA0020003F0020003A0029
and I have to decode it using java. I know that the '\u' prefix make the magic (i.e. '\u0054' -> 'T'), but I don't know how transform it to use as a common string.
Thanks in advance.
Edit: Thanks to everybody. All the answers work, but I had to choose only one :(
Again, thanks.
It looks like a UTF-16 encoding. Here is a method to transform it:
public static String decode(String hexCodes, String encoding) throws UnsupportedEncodingException {
if (hexCodes.length() % 2 != 0)
throw new IllegalArgumentException("Illegal input length");
byte[] bytes = new byte[hexCodes.length() / 2];
for (int i = 0; i < bytes.length; i++)
bytes[i] = (byte) Integer.parseInt(hexCodes.substring(2 * i, 2 * i + 2), 16);
return new String(bytes, encoding);
}
public static void main(String[] args) throws UnsupportedEncodingException {
String hexCodes = "005400610020007400650020007400ED0020007400FA0020003F0020003A0029";
System.out.println(decode(hexCodes, "UTF-16"));
}
}
Your example returns "Ta te tí tú ? :)"
You can simply split the String in Strings of length 4 and then use Integer.parseInt(s, 16) to get the numeric value. Cast that to a char and build a String out of it. For the above example you will get:
Ta te tí tú ? :)
It can be interpreted as UTF-16 or as UCS2 (a sequence of codepoints coded in 2 bytes, hexadecimal representation), it's equivalent as long as we do not fall outside the BMP.
An alternative parsing method:
public static String mydecode(String hexCode) {
StringBuilder sb = new StringBuilder();
for(int i=0;i<hexCode.length();i+=4)
sb.append((char)Integer.parseInt(hexCode.substring(i,i+4),16));
return sb.toString();
}
public static void main(String[] args) {
String hexCodes = "005400610020007400650020007400ED0020007400FA0020003F0020003A0029";
System.out.println(mydecode(hexCodes));
}

Java UTF-8 to ASCII conversion with supplements

we are accepting all sorts of national characters in UTF-8 string on the input, and we need to convert them to ASCII string on the output for some legacy use. (we don't accept Chinese and Japanese chars, only European languages)
We have a small utility to get rid of all the diacritics:
public static final String toBaseCharacters(final String sText) {
if (sText == null || sText.length() == 0)
return sText;
final char[] chars = sText.toCharArray();
final int iSize = chars.length;
final StringBuilder sb = new StringBuilder(iSize);
for (int i = 0; i < iSize; i++) {
String sLetter = new String(new char[] { chars[i] });
sLetter = Normalizer.normalize(sLetter, Normalizer.Form.NFC);
try {
byte[] bLetter = sLetter.getBytes("UTF-8");
sb.append((char) bLetter[0]);
} catch (UnsupportedEncodingException e) {
}
}
return sb.toString();
}
The question is how to replace all the german sharp s (ß, Đ, đ) and other characters that get through the above normalization method, with their supplements (in case of ß, supplement would probably be "ss" and in case od Đ supplement would be either "D" or "Dj").
Is there some simple way to do it, without million of .replaceAll() calls?
So for example: Đonardan = Djonardan, Blaß = Blass and so on.
We can replace all "problematic" chars with empty space, but would like to avoid this to make the output as similar to the input as possible.
Thank you for your answers,
Bozo
You want to use ICU4J. It includes the com.ibm.icu.text.Transliterator class, which apparently can do what you are looking for.
Here's my converter which uses lucene...
private final KeywordTokenizer keywordTokenizer = new KeywordTokenizer(new StringReader(""));
private final ASCIIFoldingFilter asciiFoldingFilter = new ASCIIFoldingFilter(keywordTokenizer);
private final TermAttribute termAttribute = (TermAttribute) asciiFoldingFilter.getAttribute(TermAttribute.class);
public String process(String line)
{
if (line != null)
{
try
{
keywordTokenizer.reset(new StringReader(line));
if (asciiFoldingFilter.incrementToken())
{
return termAttribute.term();
}
}
catch (IOException e)
{
logger.warn("Failed to parse: " + line, e);
}
}
return null;
}
I'm using something like this:
Transliterator transliterator = Transliterator.getInstance("Any-Latin; Upper; Lower; NFD; [:Nonspacing Mark:] Remove; NFC", Transliterator.FORWARD);
Is there some simple way to do it, without million of .replaceAll() calls?
If you just support European, Latin-based languages, around 100 should be enough; that's definitely doable: Grab the Unicode charts for Latin-1 Supplement and Latin Extended-A and get the String.replace party started. :-)

Categories

Resources