Java Base64 MIME decoding/encoding throws away delimiters

Java Base64 MIME decoding/encoding throws away delimiters - java

I have a Base64-encoded string that looks like "data:image/png;base64,iVBORw0K". I'm trying to decode it back to binary and later encode it again into Base64 using java.util.Base64. Strangely, after decoding and encoding again, I would lose the delimiters and get back "dataimage/pngbase64iVBORw0I=".
This is how I do the decoding and encoding (written in Scala, but you get the idea):
import java.util.Base64
val b64mime = "data:image/png;base64,iVBORw0K"
val decoder = Base64.getMimeDecoder
val encoder = Base64.getMimeEncoder
println(encoder.encodeToString(decoder.decode(b64mime)))
Here is an example: https://scalafiddle.io/sf/TJY7eeg/0
This also happens with javax.xml.bind.DatatypeConverter. What am I doing wrong? Is this the expected behavior?

The string you are trying to deal with looks like an example of a "data:" URL as specified in RFC 2397
The correct way to deal with one of these is parse it into its components, and then decode only the component that is base64 encoded. Here is the syntax
dataurl := "data:" [ mediatype ] [ ";base64" ] "," data
mediatype := [ type "/" subtype ] *( ";" parameter )
data := *urlchar
parameter := attribute "=" value
So this says that everything up to the comma in your example is non-base64 data. You cannot simply treat the whole string as base64 because it contains characters that are not valid in any standard variant of the base64 encoding scheme.
This Q&A talks about RFC 2397 parsers in Java:
Any RFC 2397 Data URI Parser for Java?

Base64 doesnt have those characters in it. It looks like the decoder is ignoring those invalid characters.
# decoder.decode(";")
res10: Array[Byte] = Array()
However if you just decode the last part you get what you want.
# decoder.decode("iVBORw0K")
res9: Array[Byte] = Array(-119, 80, 78, 71, 13, 10)
# encoder.encodeToString(res9)
res12: String = "iVBORw0K"

Related

Illegal base64 character "a" using java.util.Base64 from within Scala

Suppose I have the following Base64 encoded String from a github API call to a file:
LyoKICogQ29weXJpZ2h0IDIwMTkgY29tLmdpdGh1Yi50aGVvcnlkdWRlcwog
KgogKiBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNp
b24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKICogeW91IG1heSBub3QgdXNlIHRo
aXMgZmlsZSBleGNlcHQgaW4gY29tcGxpYW5jZSB3aXRoIHRoZSBMaWNlbnNl
LgogKiBZb3UgbWF5IG9idGFpbiBhIGNvcHkgb2YgdGhlIExpY2Vuc2UgYXQK
ICoKICogICAgIGh0dHA6Ly93d3cuYXBhY2hlLm9yZy9saWNlbnNlcy9MSUNF
TlNFLTIuMAogKgogKiBVbmxlc3MgcmVxdWlyZWQgYnkgYXBwbGljYWJsZSBs
YXcgb3IgYWdyZWVkIHRvIGluIHdyaXRpbmcsIHNvZnR3YXJlCiAqIGRpc3Ry
aWJ1dGVkIHVuZGVyIHRoZSBMaWNlbnNlIGlzIGRpc3RyaWJ1dGVkIG9uIGFu
ICJBUyBJUyIgQkFTSVMsCiAqIFdJVEhPVVQgV0FSUkFOVElFUyBPUiBDT05E
SVRJT05TIE9GIEFOWSBLSU5ELCBlaXRoZXIgZXhwcmVzcyBvciBpbXBsaWVk
LgogKiBTZWUgdGhlIExpY2Vuc2UgZm9yIHRoZSBzcGVjaWZpYyBsYW5ndWFn
ZSBnb3Zlcm5pbmcgcGVybWlzc2lvbnMgYW5kCiAqIGxpbWl0YXRpb25zIHVu
ZGVyIHRoZSBMaWNlbnNlLgogKi8KCnBhY2thZ2UgY29tLmdpdGh1Yi50aGVv
cnlkdWRlcy5tb2RlbAoKaW1wb3J0IGNvbS5naXRodWIudGhlb3J5ZHVkZXMu
dXRpbC5LaXZ5UHJldHR5UHJpbnRlcgppbXBvcnQgb3JnLmJpdGJ1Y2tldC5p
bmt5dG9uaWsua2lhbWEuPT0+CmltcG9ydCBvcmcuYml0YnVja2V0Lmlua3l0
b25pay5raWFtYS5yZXdyaXRpbmcuUmV3cml0ZXIuXwppbXBvcnQgb3JnLmJp
dGJ1Y2tldC5pbmt5dG9uaWsua2lhbWEucmV3cml0aW5nLlN0cmF0ZWd5Cgov
KioKICogQmFzZSBUeXBlIGZvciBhbGwgbm9kZXMgb2YgYSBLaXZ5LUFTVAog
Ki8KdHJhaXQgQVNUTm9kZSBleHRlbmRzIEZvbGRhYmxlQVNUIHsgc2VsZiA9
PgogIC8qKgogICAqIFRyYXZlcnNlcyB0aGUgQVNUTm9kZSBhbmQgYXBwbGll
cyBTdHJhdGVneSBgc2Agb250byBgc2VsZmAgYW5kIGFsbCBjaGlsZHJlbiBv
ZiBzZWxmLgogICAqCiAgICogYHNgIGlzIGhlcmVieSBhcHBsaWVkIGJvdHRv
bSB1cCBpbiBsZWZ0IHRvIHJpZ2h0IG9yZGVyLgogICAqCiAgICogQHNlZSBb
W2h0dHBzOi8vYml0YnVja2V0Lm9yZy9pbmt5dG9uaWsva2lhbWEvc3JjLzAz
MjYzMGZhMjFkZGFkNWNmMzNjYmQ2ZWY5YzJmMDI3ODY2MWE2NzUvd2lraS9S
ZXdyaXRpbmcubWRdXQogICAqIEBwYXJhbSBzIHN0cmF0ZWd5IHRoYXQgaXMg
YXBwbGllZCB0byBgc2VsZmAgYW5kIGFsbCBjaGlsZHJlbi4KICAgKiBAcmV0
dXJuIGEgcmV3cml0dGVuIEFTVE5vZGUgYWNjb3JkaW5nIHRvIHRoZSBzdHJh
dGVneSBgc2AKICAgKi8KICBwcml2YXRlW3RoZW9yeWR1ZGVzXSBkZWYgdHJh
dmVyc2VBbmRBcHBseShzOlN0cmF0ZWd5KTpBU1ROb2RlCgogIC8qKgogICAq
IFJld3JpdGUgdGhlIEFTVE5vZGUgYHNlbGZgIGJ5IHRoZSBzcGVjaWZpY2F0
aW9uIG9mIGEgcGFydGlhbCBmdW5jdGlvbiBgZnBgLgogICAqCiAgICogSWYg
d2Ugd2FudCB0byBjaGFuZ2UgYSBzcGVjaWZpYyBbW21vZGVsLlB5dGhvbl1d
LW5vZGUgaW4gdGhlIEFTVCBmb3IgZXhhbXBsZSB3ZSBjb3VsZAogICAqIGFw
cGx5IHRoZSBmb2xsb3dpbmcgcmV3cml0ZSBzdHJhdGVneToKICAgKnt7ewog
ICAqICAgYXN0LnJld3JpdGUoewogICAqICAgIGNhc2UgUHl0aG9uKCJbMSwy
LDNdIikgPT4gUHl0aG9uKCJbMSwyLDMsNF0iKQogICAqICAgfSkKICAgKn19
fQogICAqCiAgICogUGxlYXNlIG5vdGUsIHRoYXQgQVNUTm9kZXMgY2FuIG5v
dCBiZSByZXdyaXR0ZW4gYXJiaXRyYXJpbHkuIFNpbmNlIGVhY2ggQVNUTm9k
ZSBpbXBsaWVzCiAgICogYSBzcGVjaWZpYyBwYXJhbWV0ZXIgbGlzdC4gQW4g
QVNUIGhhcyB0byBzdGF5IHN0cnVjdHVyZS1jb25zaXN0ZW50IGFmdGVyIGFw
cGx5aW5nIHJld3JpdGluZyBydWxlcy4KICAgKiBBIHJld3JpdGluZyBydWxl
IGFzOgogICAqIHt7ewogICAqICAgewogICAqICAgIGNhc2UgUHl0aG9uKHMp
ID0+IFRvcExldmVsKE5pbCkKICAgKiAgIH0KICAgKiB9fX0KICAgKiBpcyBu
b3QgdmFsaWQgYXMgYSBbW21vZGVsLlRvcExldmVsXV0tbm9kZSBjYW4gbm90
IG9jY3VyIGF0IHBvc2l0aW9ucyB3aGVyZSBhIFtbbW9kZWwuUHl0aG9uXV0t
bm9kZSBjYW4uCiAgICoKICAgKiBAc2VlIFtbaHR0cHM6Ly9iaXRidWNrZXQu
b3JnL2lua3l0b25pay9raWFtYS9zcmMvMDMyNjMwZmEyMWRkYWQ1Y2YzM2Ni
ZDZlZjljMmYwMjc4NjYxYTY3NS93aWtpL1Jld3JpdGluZy5tZF1dCiAgICog
QHBhcmFtIGZwIFBhcnRpYWwgZnVuY3Rpb24gdGhhdCBkZWZpbmVzIGhvdyB0
aGUgYXN0IHNob3VsZCBiZSByZXdyaXR0ZW4uCiAgICogQHJldHVybiBBIHJl
d3JpdHRlbiBBU1QgYWNjb3JkaW5nIHRvIHRoZSBzcGVjaWZpY2F0aW9uIGlu
IGBmcGAgb3IgdGhlIHNhbWUgYXN0IGlmIGBmcGAgY291bGQgbm90IGJlIGFw
cGxpZWQuCiAgICovCiAgZGVmIHJld3JpdGUoZnA6QVNUTm9kZSA9PT4gQVNU
Tm9kZSk6IEFTVE5vZGUgPSBzZWxmLnRyYXZlcnNlQW5kQXBwbHkocnVsZShm
cCkpCgogIC8qKgogICAqIFRyYW5zZm9ybXMgYHNlbGZgIGludG8gYSB3ZWxs
IGZvcm1hdHRlZCBraXZ5IHByb2dyYW0gdGhhdCBjYW4gYmUgd3JpdHRlbgog
ICAqIGludG8gYSBmaWxlLgogICAqCiAgICogVGhlIGZvbGxvd2luZyBBU1RO
b2RlIGZvciBleGFtcGxlOgogICAqIHt7ewogICAqICAgVG9wTGV2ZWwoCiAg
ICogICAgTGlzdCgKICAgKiAgICAgIFJvb3QoCiAgICogICAgICAgIFdpZGdl
dCgKICAgKiAgICAgICAgICBQbG90LAogICAqICAgICAgICAgIExpc3QoCiAg
ICogICAgICAgICAgICBXaWRnZXQoCiAgICogICAgICAgICAgICAgIExpbmVH
cmFwaCwKICAgKiAgICAgICAgICAgICAgTGlzdCgKICAgKiAgICAgICAgICAg
ICAgICBQcm9wZXJ0eShiYWNrZ3JvdW5kX25vcm1hbCxMaXN0KCcnKSksCiAg
ICogICAgICAgICAgICAgICAgUHJvcGVydHkoYmFja2dyb3VuZF9jb2xvcixM
aXN0KFswLDAsMCwxXSkpCiAgICogICApKSkpKSkpCiAgICogfX19CiAgICoK
ICAgKiBpcyBwcmludGVkOgogICAqIHt7ewogICAqIFBsb3Q6CiAgICogIExp
bmVHcmFwaDoKICAgKiAgICBiYWNrZ3JvdW5kX25vcm1hbDogJycKICAgKiAg
ICBiYWNrZ3JvdW5kX2NvbG9yOiBbMCwwLDAsMV0KICAgKiB9fX0KICAgKgog
ICAqIEByZXR1cm4gQSBmb3JtYXR0ZWQgQVNUTm9kZSB0aGF0IGNhbiBiZSBp
bnRlcnByZXRlZCBhcyBhIEtpdnkgZmlsZS4KICAgKi8KICBkZWYgcHJldHR5
OlN0cmluZyA9IEtpdnlQcmV0dHlQcmludGVyLmZvcm1hdChzZWxmKS5sYXlv
dXQKfQ==
As far as I see, this encoding is correct and only contains the standard alphabet of characters for a Base64 encoding. If I decode this encoding here, I get a correct translation. However, I tried various approaches to decode it programmatically and did not find a solution yet.
Let contentEncoded be the string containing the encoded file. I tried the following:
java.util.Base64.getDecoder.decode(contentEncoded)
java.util.Base64.getDecoder.decode(contentEncoded.getBytes)
java.util.Base64.getDecoder.decode(contentEncoded.getBytes(StandardCharsets.UTF_8))
java.util.Base64.getUrlDecoder.decode(contentEncoded))
java.util.Base64.getUrlDecoder.decode(contentEncoded.getBytes(StandardCharsets.UTF_8))
java.util.Base64.getMimeDecoder.decode(contentEncoded.replaceAll("\\n", "").replaceAll("\\r", ""))
However, all of them resulted in an error message: java.lang.IllegalArgumentException: Illegal base64 character a.
My question is: Am I not seeing something obvious? Are there some hidden control characters? Has anybody had similar issues and was able to fix them?

Just remove line breaks and it should work.
contentEncoded.replace("\n", "")

The following snippet decodes the encoding correctly:
val decodedWithMime = java.util.Base64.getMimeDecoder.decode(contentEncoded)
val convertedByteArray = decodedWithMime.map(_.toChar).mkString
as pointed out by comments, the error message Illegal Base64 character a corresponds to the hex value for the newline character \n. Using the Mime Decoder it is possible to decode the string without removing the newline characters beforehand.

Java bridge code error while converting chinese characters : 'utf-8' codec can't decode byte 0xc0 in position 0: invalid start byte

We are receiving data in different encoding format, currently we are using below mentioned java encodings
https://docs.oracle.com/javase/1.5.0/docs/guide/intl/encoding.doc.html
we are moving to python so changing this encoding logic into python.
As python is not supporting encoding for Chinese character which is equivalent to java encoding Cp935 we are using
javabridge code as below
`
class String:
new_fn = javabridge.make_new("java/lang/String", "([BLjava/lang/String;)V")
def __init__(self, i, s):
self.new_fn(i, s)
toString = javabridge.make_method("toString", "()Ljava/lang/String;", "Retrieve the string value")
array = numpy.array(list(fielddata) , numpy.uint16)
strobject = String(array,encoding)
convertedstr = strobject.toString() `
however we are getting the error
'utf-8' codec can't decode byte 0xc0 in position 0: invalid start byte
looking for the help or alternative way of doing this in python.

class JavaEncoder:
# creating new method for java bridge
new_fn = javabridge.make_new("java/lang/String", "([BLjava/lang/String;)V")
def __init__(self, i, s):
i[i == 0] = 64
self.new_fn(i, s)
# creating toString method of JAVA
toString = javabridge.make_method("toString", "()Ljava/lang/String;", "Retrieve the integer value")
While converting data using JAVABRIDGE if field is having size 1 and data contains 00 then numpy.uint8 convert this into 0 considering this as integer because of which, while converting data, we are getting encoding error to avoid this we added above code 64 is space (40 EBCDIC/20 ASCII space) in uint8.

Convert JSON Base64 string to String in Java

I am trying to convert a protobuf stream to JSON object using the com.google.protobuf.util.JsonFormat class as below.
String jsonFormat = JsonFormat.printer().print(data);
As per the documentation https://developers.google.com/protocol-buffers/docs/proto3#json I am getting the bytes as Base64 string(example "hashedStaEthMac": "QDOMIxG+tTIRi7wlMA9yGtOoJ1g=",
). But I would like to get this a readable string(example "locAlgorithm": "ALGORITHM_ESTIMATION",
). Below is a sample output. is there a way to the JSON object asplain text or any work around to get the actual values.
{
"seq": "71811887",
"timestamp": 1488640438,
"op": "OP_UPDATE",
"topicSeq": "9023777",
"sourceId": "xxxxxxxx",
"location": {
"staEthMac": {
"addr": "xxxxxx"
},
"staLocationX": 1148.1763,
"staLocationY": 980.3377,
"errorLevel": 588,
"associated": false,
"campusId": "n5THo6IINuOSVZ/cTidNVA==",
"buildingId": "7hY/jVh9NRqqxF6gbqT7Jw==",
"floorId": "LV/ZiQRQMS2wwKiKTvYNBQ==",
"hashedStaEthMac": "xxxxxxxxxxx",
"locAlgorithm": "ALGORITHM_ESTIMATION",
"unit": "FEET"
}
}
Expected format is as below.
seq: 85264233
timestamp: 1488655098
op: OP_UPDATE
topic_seq: 10955622
source_id: 00505698749E
location {
sta_eth_mac {
addr: xx:xx:xx:xx:xx:xx
}
sta_location_x: 916.003
sta_location_y: 580.115
error_level: 854
associated: false
campus_id: 9F94C7A3A20836E392559FDC4E274D54
building_id: EE163F8D587D351AAAC45EA06EA4FB27
floor_id: 83144E609EEE3A64BBD22C536A76FF5A
hashed_sta_eth_mac:
loc_algorithm: ALGORITHM_ESTIMATION
unit: FEET
}

Not easily, because the actual values are binary, which is why they're Base64-encoded in the first place.
Try to decode one of these values:
$ echo -n 'n5THo6IINuOSVZ/cTidNVA==' | base64 -D
??ǣ6?U??N'MT
In order to get more readable values, you have to understand what the binary data actually is, and then decide what format you want to use to display it.
The field called staEthMac.addr is 6 bytes and is probably an Ethernet MAC address. It's usually displayed as xx:xx:xx:xx:xx:xx where xx are the hexadecimal values of each byte. So you could decode the Base64 strings into a byte[] and then call a function to convert each byte to hex and delimit them with ':'.
The fields campusId, buildingId, and floorId are 16 bytes (128 bits) and are probably UUIDs. UUIDs are usually displayed as xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx where each x is a hex digit (4 bits). So you could (again) convert the Base64 string to byte[] and then print the hex digits, optionally adding the dashes.
Not sure about sourceId and hashedStaEthMac, but you could just follow the pattern of converting to byte[] and printing as hex. Essentially you're just doing a conversion from base 64 to base 16. You'll wind up with something like this:
$ echo -n 'n5THo6IINuOSVZ/cTidNVA==' | base64 -D | xxd -p
9f94c7a3a20836e392559fdc4e274d54
A point that I'm not sure you are getting is that it's binary data. There is no "readable" version that makes sense like "ALGORITHM_ESTIMATION" does; the best you can do is encode the binary data using letters and numbers so you can at least pronounce it.
Base64 (which encodes binary using 64 different characters) is pronounceable "N five T H lowercase-O six ..." but it's not real friendly because letter case is significant and because it uses letters like O and I that look like numbers. Hex (which encodes binary using just 16 characters) is a little easier to read.

Using Amazon AWS Cognito `.well-known/jwks.json` data fails to base64 decode some fields

When using Amazon AWS Cognito Federated Identities, and parsing the data at:
https://cognito-identity.amazonaws.com/.well-known/jwks_uri which looks like:
{"keys":[
{"kty":"RSA",
"alg":"RS512",
"use":"sig",
"kid":"ap-northeast-11",
"n":"AI7mc1assO5n6yB4b7jPCFgVLYPSnwt4qp2BhJVAmlXRntRZ5w4910oKNZDOr4fe/BWOI2Z7upUTE/ICXdqirEkjiPbBN/duVy5YcHsQ5+GrxQ/UbytNVN/NsFhdG8W31lsE4dnrGds5cSshLaohyU/aChgaIMbmtU0NSWQ+jwrW8q1PTvnThVQbpte59a0dAwLeOCfrx6kVvs0Y7fX7NXBbFxe8yL+JR3SMJvxBFuYC+/om5EIRIlRexjWpNu7gJnaFFwbxCBNwFHahcg5gdtSkCHJy8Gj78rsgrkEbgoHk29pk8jUzo/O/GuSDGw8qXb6w0R1+UsXPYACOXM8C8+E=",
"e":"AQAB"},
... }
This works fine decoding the n field using this code (Kotlin calling JDK 8 Base64 class):
Base64.getDecoder().decode(encodedN.toByteArray())
But when using Cognito User Pools which has data at a URL in the form of: https://cognito-idp.${REGION}.amazonaws.com/${POOLID}/.well-known/jwks.json
It has the same type of data, but it will not decode. Instead I end up with errors such as:
Illegal base64 character 5f
Since that is an underscore _ and in the Base64 URL alphabet, I tried changing my decoding to:
Base64.getUrlDecoder().decode(encodedN.toByteArray())
But then the first set of data no longer decodes correctly because it contains / and other invalid characters for Base64 URL encoding.
Is there a method that can handle both of these jwks sets of data with the same decoder?!?
Note: this question is intentionally written and answered by the author (Self-Answered Questions), so that solutions for interesting problems are shared in SO.

The issue is that the Amazon AWS Cognito team is using two different Base64 encoding alphabets for basically the same thing. So you will need to detect which is being used.
If the encoded string ends with = or contains + or / then it is definitely the normal Base64.getDecoder(). If it contains a - or _ then it is definitely the Base64.getUrlDecoder(). Otherwise nothing special is there and it is best to use the Base64.getUrlDecoder() because you do not know if the length would need padding or not.
This translates to (in Kotlin, but logically is applicable to any language):
fun base64SafeDecoder(encoded: String): ByteArray {
val decoder = if (encoded.endsWith('=') || encoded.any { it == '+' || it == '/' }) {
Base64.getDecoder()
}
else {
Base64.getUrlDecoder()
}
return decoder.decode(encoded.toByteArray())
}
This would be a problem for any language that has Base64 decoding in that they might be loose and ignore the invalid character (some do), or they might be strict and throw an exception. Some test websites for Base64 encoding/decoding exhibit both of these behaviors as well, and the silent ignoring of invalid characters is dangerous. You would then have an error later using the results of the decoding later.

You can try using the apache variant of the Base64 decode (org.apache.commons.codec.binary.Base64).
The decodeBase64(String base64String) method handles both base64 and base64 url safe encodings seamlessly. And the isBase64 method provides a check to detect if a string is encoded in either base64 or base64 url safe.

Base64 encode gives different result on linux CentOS terminal and in Java

I am trying to generate some random password on Linux CentOS and store it in database as base64. Password is 'KQ3h3dEN' and when I convert it with 'echo KQ3h3dEN | base64' as a result I will get 'S1EzaDNkRU4K'.
I have function in java:
public static String encode64Base(String stringToEncode)
{
byte[] encodedBytes = Base64.getEncoder().encode(stringToEncode.getBytes());
String encodedString = new String(encodedBytes, "UTF-8");
return encodedString;
}
And result of encode64Base("KQ3h3dEN") is 'S1EzaDNkRU4='.
So, it is adding "K" instead of "=" in this example. How to ensure that I will always get same result when using base64 on linux and base64 encoding in java?
UPDATE: Updated question as I didn't noticed "K" at the end of linux encoded string. Also, here are few more examples:
'echo KQ3h3dENa | base64' => result='S1EzaDNkRU5hCg==', but it should be 'S1EzaDNkRU5h'
echo KQ3h3dENaa | base64' => result='S1EzaDNkRU5hYQo=', but it should be 'S1EzaDNkRU5hYQ=='

Found solution after few hours of experimenting. It seems like new line was added to the string I wanted to encode. Solution would be :
echo -n KQ3h3dEN | base64
Result will be the same as with java base64 encode.

Padding
The '==' sequence indicates that the last group contained only one byte, and '=' indicates that it contained two bytes.
In theory, the padding character is not needed for decoding, since the number of missing bytes can be calculated from the number of Base64 digits. In some implementations, the padding character is mandatory, while for others it is not used.
So it depends on tools and libraries you use. If base64 with padding is the same as without padding for them, there is no problem. As an insurance you can use on linux tool that generates base64 with padding.

Use withoutPadding() of Base64.Encoder class to get Base64.Encoder instance which encodes without adding any padding character at the end.
check the link :
https://docs.oracle.com/javase/8/docs/api/java/util/Base64.Encoder.html#withoutPadding

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.