java base64.deocde to python3 - java

I have a piece of java code
enter code here
byte[] random1 = Base64.getDecoder().decode(arr.getString(2));
byte[] test1 = "/bCN99cbY13kwEf+wnRErg".getBytes(StandardCharsets.ISO_8859_1);
System.out.println(test1.length); // #22
System.out.println(Base64.getDecoder().decode(test1).length); // #16
I am trying to use python3 and I get an error.
text = bytes("/bCN99cbY13kwEf+wnRErg", encoding='iso-8859-1')
print(len(base64.b64decode(text)))
# Traceback (most recent call last):
# File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\base64.py", line
# 546, in decodebytes
# return binascii.a2b_base64(s)
# binascii.Error: Incorrect padding
How can I use python3 to implement the functions on the java code to achieve decode length=16

The problem is that your Base64 encoded array is missing the padding, which is not always required.
It is no problem in Java as the Decoder does not require it:
The Base64 padding character '=' is accepted and interpreted as the end of the encoded byte data, but is not required.
In contrast, the Python b64decode(s) method requires the padding and throws an error if it is missing.
A binascii.Error exception is raised if s is incorrectly padded.
I found a simple solution in this answer: Always adding the maximum padding at the end of the byte array (two equal signs ==). The b64decode(s) method simply ignores the padding if it is too long and so it always works.
You only have to change your code slightly for it to work:
text = bytes("/bCN99cbY13kwEf+wnRErg", encoding='iso-8859-1')
padded_text = text + b'=='
result = base64.b64decode(padded_text)
print(len(result))
The output is 16 identical to the Java output.

Related

Encoding issue with JLine

Jline is a module for intercepting user input at a console before the user presses Enter. It uses JNA or similar wizardry.
I'm doing a few experiments with it and I'm getting encoding problems when I input more "exotic" Unicode characters. The OS here is W10 and I'm using Cygwin. Also this is in Groovy but should be obvious to Java people.
def terminal = org.jline.terminal.TerminalBuilder.builder().jna( true ).system( true ).build()
terminal.enterRawMode()
// NB the Terminal I get is class org.jline.terminal.impl.PosixSysTerminal
def reader = terminal.reader()
def bytes = [] // NB class ArrayList
int readInt = -1
while( readInt != 13 && readInt != 10 ) {
readInt = reader.read()
byte convertedByte = (byte)readInt
// see what the binary looks like:
String binaryString = String.format("%8s", Integer.toBinaryString( convertedByte & 0xFF)).replace(' ', '0')
println "binary |$binaryString|"
bytes << (byte)readInt // NB means "append to list"
println ">>> read |$readInt| byte |$convertedByte|"
}
// strip final byte (13 or 10)
bytes = bytes[0..-2]
println "z bytes $bytes, class ${bytes.class.name}"
def response = new String( (byte[])bytes.toArray(), 'UTF-8' )
// to get proper out encoding for Cygwin I then need to do this (I have no idea why!)
def psOut = new PrintStream(System.out, true, 'UTF-8' )
psOut.print( "using PrintStream: |$response|" )
This works fine with one-byte Unicode, and letters like "é" (2-bytes) get handled fine. But it goes wrong with "ẃ":
ẃ --> Unicode U+1E83
UTF-8 HEX: 0xE1 0xBA 0x83 (e1ba83)
BINARY: 11100001:10111010:10000011
Actually the binary it puts out when you enter "ẃ" is 11100001:10111010:10010010.
This translates to U+1E92, which is another Polish character, "Ẓ". And that is indeed what gets printed out in the response String.
Unfortunately the JLine package hands you this reader, which is class org.jline.utils.NonBlocking$NonBlockingInputStreamReader... So I don't really know what I can do to investigate its encoding (I presume UTF-8) or somehow modify it... Can anyone explain what the problem is?
As far as I can tell this relates to a Cygwin-specific problem, as asked and then answered by me a year ago.
There is a solution, in my answer to the question I asked directly after this one... which correctly deals with Unicode input, even when outside the Basic Multilingual Plane, using JLine, ... and using a Cygwin console ... hopefully.

Java bridge code error while converting chinese characters : 'utf-8' codec can't decode byte 0xc0 in position 0: invalid start byte

We are receiving data in different encoding format, currently we are using below mentioned java encodings
https://docs.oracle.com/javase/1.5.0/docs/guide/intl/encoding.doc.html
we are moving to python so changing this encoding logic into python.
As python is not supporting encoding for Chinese character which is equivalent to java encoding Cp935 we are using
javabridge code as below
`
class String:
new_fn = javabridge.make_new("java/lang/String", "([BLjava/lang/String;)V")
def __init__(self, i, s):
self.new_fn(i, s)
toString = javabridge.make_method("toString", "()Ljava/lang/String;", "Retrieve the string value")
array = numpy.array(list(fielddata) , numpy.uint16)
strobject = String(array,encoding)
convertedstr = strobject.toString() `
however we are getting the error
'utf-8' codec can't decode byte 0xc0 in position 0: invalid start byte
looking for the help or alternative way of doing this in python.
class JavaEncoder:
# creating new method for java bridge
new_fn = javabridge.make_new("java/lang/String", "([BLjava/lang/String;)V")
def __init__(self, i, s):
i[i == 0] = 64
self.new_fn(i, s)
# creating toString method of JAVA
toString = javabridge.make_method("toString", "()Ljava/lang/String;", "Retrieve the integer value")
While converting data using JAVABRIDGE if field is having size 1 and data contains 00 then numpy.uint8 convert this into 0 considering this as integer because of which, while converting data, we are getting encoding error to avoid this we added above code 64 is space (40 EBCDIC/20 ASCII space) in uint8.

Base64 encode gives different result on linux CentOS terminal and in Java

I am trying to generate some random password on Linux CentOS and store it in database as base64. Password is 'KQ3h3dEN' and when I convert it with 'echo KQ3h3dEN | base64' as a result I will get 'S1EzaDNkRU4K'.
I have function in java:
public static String encode64Base(String stringToEncode)
{
byte[] encodedBytes = Base64.getEncoder().encode(stringToEncode.getBytes());
String encodedString = new String(encodedBytes, "UTF-8");
return encodedString;
}
And result of encode64Base("KQ3h3dEN") is 'S1EzaDNkRU4='.
So, it is adding "K" instead of "=" in this example. How to ensure that I will always get same result when using base64 on linux and base64 encoding in java?
UPDATE: Updated question as I didn't noticed "K" at the end of linux encoded string. Also, here are few more examples:
'echo KQ3h3dENa | base64' => result='S1EzaDNkRU5hCg==', but it should be 'S1EzaDNkRU5h'
echo KQ3h3dENaa | base64' => result='S1EzaDNkRU5hYQo=', but it should be 'S1EzaDNkRU5hYQ=='
Found solution after few hours of experimenting. It seems like new line was added to the string I wanted to encode. Solution would be :
echo -n KQ3h3dEN | base64
Result will be the same as with java base64 encode.
Padding
The '==' sequence indicates that the last group contained only one byte, and '=' indicates that it contained two bytes.
In theory, the padding character is not needed for decoding, since the number of missing bytes can be calculated from the number of Base64 digits. In some implementations, the padding character is mandatory, while for others it is not used.
So it depends on tools and libraries you use. If base64 with padding is the same as without padding for them, there is no problem. As an insurance you can use on linux tool that generates base64 with padding.
Use withoutPadding() of Base64.Encoder class to get Base64.Encoder instance which encodes without adding any padding character at the end.
check the link :
https://docs.oracle.com/javase/8/docs/api/java/util/Base64.Encoder.html#withoutPadding

Converting string to byte[] returns wrong value (encoding?)

I read a byte[] from a file and convert it to a String:
byte[] bytesFromFile = Files.readAllBytes(...);
String stringFromFile = new String(bytesFromFile, "UTF-8");
I want to compare this to another byte[] I get from a web service:
String stringFromWebService = webService.getMyByteString();
byte[] bytesFromWebService = stringFromWebService.getBytes("UTF-8");
So I read a byte[] from a file and convert it to a String and I get a String from my web service and convert it to a byte[]. Then I do the following tests:
// works!
org.junit.Assert.assertEquals(stringFromFile, stringFromWebService);
// fails!
org.junit.Assert.assertArrayEquals(bytesFromFile, bytesFromWebService);
Why does the second assertion fail?
Other answers have covered the likely fact that the file is not UTF-8 encoded giving rise to the symptoms described.
However, I think the most interesting aspect of this is not that the byte[] assert fails, but that the assert that the string values are the same passes. I'm not 100% sure why this is, but I think the following trawl through the source code might give us the answer:
Looking at how new String(bytesFromFile, "UTF-8"); works - we see that the constructor calls through to StringCoding.decode()
This in turn, if supplied with tht UTF-8 character set, calls through to StringDecoder.decode()
This calls through to CharsetDecoder.decode() which decides what to do if the character is unmappable (which I guess will be the case if a non-UTF-8 character is presented)
In this case it uses an action defined by
private CodingErrorAction unmappableCharacterAction
= CodingErrorAction.REPORT;
Which means that it still reports the character it has decoded, even though it's technically unmappable.
I think this means that even when the code gets an umappable character, it substitutes its best guess - so I'm guessing that its best guess is correct and hence the String representations are the same under comparison, but the byte[] are no longer the same.
This hypothesis is kind of supported by the fact that the catch block for CharacterCodingException in StringCoding.decode() says:
} catch (CharacterCodingException x) {
// Substitution is always enabled,
// so this shouldn't happen
I don't understand it fully, but here's what I get so fare:
The problem is that the data contains some bytes which are not valid UTF-8 bytes as I know by the following check:
// returns false for my data!
public static boolean isValidUTF8(byte[] input) {
CharsetDecoder cs = Charset.forName("UTF-8").newDecoder();
try {
cs.decode(ByteBuffer.wrap(input));
return true;
}
catch(CharacterCodingException e){
return false;
}
}
When I change the encoding to ISO-8859-1 everything works fine. The strange thing (which a don't understand yet) is why my conversion (new String(bytesFromFile, "UTF-8");) doesn't throw any exception (like my isValidUTF8 method), although the data is not valid UTF-8.
However, I think I will go another and encode my byte[] in a Base64 string as I don't want more trouble with encoding.
The real problem in your code is that you don't know what the real file encoding.
When you read the string from the web service you get a sequence of chars; when you convert the string from chars to bytes the conversion is made right because you specify how to transform char in bytes with a specific encoding ("UFT-8"). when you read a text file you face a different problem. You have a sequence of bytes that needs to be converted to chars. In order to do it properly you must know how the chars where converted to bytes i.e. what is the file encoding. For files (unless specified) it's a platform constants; on windows the file are encoded in win1252 (which is very close to ISO-8859-1); on linux/unix it depends, I think UTF8 is the default.
By the way the web service call did a decond operation under the hood; the http call use an header taht defins how chars are encoded, i.e. how to read the bytes form the socket and transform then to chars. So calling a SOAP web service gives you back an xml (which can be marshalled into a Java object) with all the encoding operations done properly.
So if you must read chars from a File you must face the encoding issue; you can use BASE64 as you stated but you lose one of the main benefits of text files: the are human readable, easing debugging and developing.

Java and c++ encrypted results not matching

I have a existing c++ code which will encrypt a string. Now I did the same encryption in .
Some of the encrypted strings are matching . Some are mismatching in one or two characters.
I am unable to figure out why it is happening. I ran both the codes in debug mode until they call their libraries both have the same key, salt, iv string to be encrypted.
I know that even if a single byte padding change will modify encrypted string drastically. But here I am just seeing a one or two characters change. Here is a sample (Bold characters in between stars is the part mis matching)
java:
U2FsdGVkX18xMjM0NTY3OGEL9nxFlHrWvodMqar82NT53krNkqat0rrgeV5FAJFs1vBsZIJPZ08DJVrQ*Pw*yV15HEoyECBeAZ6MTeN+ZYHRitKanY5jiRU2J0KP0Fzola
C++:
U2FsdGVkX18xMjM0NTY3OGEL9nxFlHrWvodMqar82NT53krNkqat0rrgeV5FAJFs1vBsZIJPZ08DJVrQ*jQ*yV15HEoyECBeAZ6MTeN+ZYHRitKanY5jiRU2J0KP0Fzola
I am using AES encryption. provider is SunJCE version 1.6. I tried changing provider to Bouncy Castle. Even then result is same.
Added One More sample:
C++:
U2FsdGVkX18xMjM0NTY3O*I*/BMu11HkHgnkx+dLPDU1lbfRwb+aCRrwkk7e9dy++MK+/94dKLPXaZDDlWlA3gdUNyh/Fxv*oF*STgl3QgpS0XU=
java:
U2FsdGVkX18xMjM0NTY3O*D*/BMu11HkHgnkx+dLPDU1lbfRwb+aCRrwkk7e9dy++MK+/94dKLPXaZDDlWlA3gdUNyh/Fxv*j9*STgl3QgpS0XU=
UPDATE:
As per the comments I feel base 64 encryption is the culprit. I am using Latin-1 char set in both places. Anything else that I can check
Sigh!!
The problem almost certainly is that after you encrypt the data and receive the encrypted data as a byte string, you are doing some sort of character conversion on the data before sending it through Base-64 conversion.
Note that if you encrypt the strings "ABC_D_EFG" and "ABC_G_EFG", the encrypted output will be completely different starting with the 4th character, and continuing to the end. In other words, the Base-64 outputs would be something like (using made-up values):
U2FsdGVkX18xMj
and
U2FsdGXt91mJpz
The fact that, in the above examples, only two isolated Base-64 characters (one byte) are messed up in each case pretty much proves that the corruption occurs AFTER encryption.
The output of an encryption process is a byte sequence, not a character sequence. The corruption observed is consistent with erroneously interpreting the bytes as characters and attempting to perform a code page conversion on them, prior to feeding them into the Base-64 converter. The output from the encryptor should be fed directly into the Base-64 converter without any conversions.
You say you are using the "Latin-1 char set in both places", a clear sign that you are doing some conversion you should not be doing -- there should be no need to muck with char sets.
First a bit of code:
import javax.xml.bind.DatatypeConverter;
...
public static void main(String[] args) {
String s1j = "U2FsdGVkX18xMjM0NTY3OGEL9nxFlHrWvodMqar82NT53krNkqat0rrgeV5FAJFs1vBsZIJPZ08DJVrQPwyV15HEoyECBeAZ6MTeN+ZYHRitKanY5jiRU2J0KP0Fzola";
String s1c = "U2FsdGVkX18xMjM0NTY3OGEL9nxFlHrWvodMqar82NT53krNkqat0rrgeV5FAJFs1vBsZIJPZ08DJVrQjQyV15HEoyECBeAZ6MTeN+ZYHRitKanY5jiRU2J0KP0Fzola";
byte[] bytesj = DatatypeConverter.parseBase64Binary(s1j);
byte[] bytesc = DatatypeConverter.parseBase64Binary(s1c);
int nmax = Math.max(bytesj.length, bytesc.length);
int nmin = Math.min(bytesj.length, bytesc.length);
for (int i = 0; i < nmax; ++i) {
if (i >= nmin) {
boolean isj = i < bytesj.length;
byte b = isj? bytesj[i] : bytesc[i];
System.out.printf("%s [%d] %x%n", (isj? "J" : "C++"), i, (int)b & 0xFF);
} else {
byte bj = bytesj[i];
byte bc = bytesc[i];
if (bj != bc) {
System.out.printf("[%d] J %x != C++ %x%n", i, (int)bj & 0xFF, (int)bc & 0xFF);
}
}
}
}
This delivers
[60] J 3f != C++ 8d
Now 0x3f is the code of the question mark.
The error is, that 0x80 - 0xBF are in Latin-1, officially ISO-8859-1, control characters.
Windows Latin-1, officially Windows-1252, uses these codes for other characters.
Hence you should use "Windows-1252" or "Cp1252" (Code-Page) in Java.
Blundly
In the encryption the original bytes in the range 0x80 .. 0xBF were replaced with a question mark because of some translation to ISO-8859-1 instead of Windows-1252 to byte[].

Categories

Resources