Is there a way that I can exclude particular string (II*) while encode a tiff file and keep that String as it is and also during the decode ?
Or
How can I specify the encoding to always encode (II*) as three characters and not to combine with any other characters ?
Below code to replace the string II* with ( II* ), however tiff got corrupted after that.
Path path = Paths.get("D:\\Users\\Vinoth\\workspace\\Testing\\Testing.tiff");
Charset charset = StandardCharsets.UTF_8;
String content = new String(Files.readAllBytes(path), charset);
content = content.replace("II*", " II ");
content = content.replace(" II "," II* ");
Files.write(path, content.getBytes(charset));
Related
I am trying to convert codepoints from one charset to another in Java.
For example character ř is 248 in windows-1250, 345 in unicode.
So I have source charset and source codepoint and target charset and want to calculate target codepoint.
This may sound easy as windows-1250 is single byte,
but I want it to work on any charset, like GB2312.
I guess it can be done somehow with Charset class,
but it seems that it only converts bytes, not actual code points.
Charset sourceCharset = Charset.forName("GB2312");
int sourceCodePoint = 45257; //吧 chinese character
Charset targetCharset = Charset.forName("UTF-8");
int targetCodePoint = ...; //???
I checked Charset class for methods codepoint related, but there's only decode and encode, which works with bytes. I tried googling something related but without success.
Thanks in advance for any help.
At least in Java there is no notion of codepoints for character sets other than Unicode. You have to convert the integer to byte array and then to unicode.
Charset sourceCharset = Charset.forName("windows-1250");
int sourceCodePoint = 248; // ř
byte[] bytes = {(byte)sourceCodePoint};
String targetString = new String(bytes, sourceCharset);
int targetCodePoint = targetString.codePointAt(0);
System.out.println("targetString = " + targetString);
System.out.println("targetCodePoint = " + targetCodePoint);
output:
targetString = ř
targetCodePoint = 345
Chinese characters in GB2312 are represented by 2 bytes, so you need to store them in a byte array of length 2.
Charset sourceCharset = Charset.forName("GB2312");
int sourceCodePoint = 45257; // 吧 chinese character
byte[] bytes = ByteBuffer.allocate(2).putShort((short)sourceCodePoint).array();
String targetString = new String(bytes, sourceCharset);
int targetCodePoint = targetString.codePointAt(0);
System.out.println("targetString = " + targetString);
System.out.println("targetCodePoint = " + targetCodePoint);
output:
targetString = 吧
targetCodePoint = 21543
I am getting PDF content which is Base 64 encoded. I tried to decode it using NIFI with Processor Base64EncodeContent. The Decoded file I am sending in mail. Below is small sample of output coming in mail.
enter image description here
"No data should be available in . ¹ Check if sent . . All documents are sent as pdf to* 9 : ’ ³: > < âA m¬‘²#%é‚ÇŽÇ¢|ÀÈ™$Éز§Uû÷LÒTB¨ l,îåù˜$â´º?6N¬JC¤ŒÃ°‰_Ïg -æ¿;ž‰ìÛÖYl`õ?èÓÌ[ ÿÿ PK"
How to extract the data in PDF as sent by third party?
I have tried to decode it using JAVA code and there also its failing. Not able to open the PDF, junk characters coming there too.
ConvertedJPGPDF.pdf file used below contains Base64 encoded String.
String filePath = "C:\\Users\\xyz\\Desktop\\";
String originalFileName = "ConvertedJPGPDF.pdf";
String newFileName = "test.pdf";
byte[] input_file =
Files.readAllBytes(Paths.get(filePath+originalFileName));
// byte[] decodedBytes = Base64.getDecoder().decode(input_file);
byte[] decodedBytes1 = Base64.getMimeDecoder().decode(input_file);
FileOutputStream fos = new FileOutputStream(filePath+newFileName);
fos.write(decodedBytes1);
fos.flush();
fos.close();
You mentioned that the file contains base64 encoded string already.
ConvertedJPGPDF.pdf file used below contains Base64 encoded String.
So, you don't need to run this line:
byte[] encodedBytes = Base64.getEncoder().encode(input_file);
By doing so, you are trying to encode those bytes again.
Directly decode the input_file array and then save the obtained byte array into a .pdf file.
Update:
The ConvertedJPGPDF.pdf doesn't really have to be named .pdf. It's really a plain text file considering that it is base 64 encoded.
Anyway, the following piece of code is working for me:
String filePath = "C:\\Users\\xyz\\Desktop\\";
String originalFileName = "ConvertedJPGPDF.pdf";
String newFileName = "test.pdf";
byte[] input_file = Files.readAllBytes(Paths.get(filePath+originalFileName));
byte[] decodedBytes1 = Base64.getMimeDecoder().decode(input_file);
Files.write(Paths.get(filePath+newFileName), decodedBytes1);
Hope this helps!
I'm trying to save content of a pdf file in a json and thought of saving the pdf as String value converted from byte[].
byte[] byteArray = feature.convertPdfToByteArray(Paths.get("path.pdf"));
String byteString = new String(byteArray, StandardCharsets.UTF_8);
byte[] newByteArray = byteString.getBytes(StandardCharsets.UTF_8);
String secondString = new String(newByteArray, StandardCharsets.UTF_8);
System.out.println(secondString.equals(byteString));
System.out.println(Arrays.equals(byteArray, newByteArray));
System.out.println(byteArray.length + " vs " + newByteArray.length);
The result of the above code is as follows:
true
false
421371 vs 760998
The two String's are equal while the two byte[]s are not. Why is that and how to correctly convert/save a pdf inside a json?
You are probably using the wrong charset when reading from the PDF file.
For example, the character é (e with acute) does not exists in ISO-8859-1 :
byte[] byteArray = "é".getBytes(StandardCharsets.ISO_8859_1);
String byteString = new String(byteArray, StandardCharsets.UTF_8);
byte[] newByteArray = byteString.getBytes(StandardCharsets.UTF_8);
String secondString = new String(newByteArray, StandardCharsets.UTF_8);
System.out.println(secondString.equals(byteString));
System.out.println(Arrays.equals(byteArray, newByteArray));
System.out.println(byteArray.length + " vs " + newByteArray.length);
Output :
true
false
1 vs 3
Why is that
If the byteArray indeed contains a PDF, it most likely is not valid UTF-8. Thus, wherever
String byteString = new String(byteArray, StandardCharsets.UTF_8);
stumbles over a byte sequence which is not valid UTF-8, it will replace that by a Unicode replacement character. I.e. this line damages your data, most likely beyond repair. So the following
byte[] newByteArray = byteString.getBytes(StandardCharsets.UTF_8);
does not result in the original byte array but instead a damaged version of it.
The newByteArray, on the other hand, is the result of UTF-8 encoding a given string, byteString. Thus, newByteArray is valid UTF-8 and
String secondString = new String(newByteArray, StandardCharsets.UTF_8);
does not need to replace anything outside the UTF-8 mappings, in particular byteString and secondString are equal.
how to correctly convert/save a pdf inside a json?
As #mammago explained in his comment,
JSON is not the appropriate format for binary content (like files). You should propably use something like base64 to create a string out of your PDF and store that in your JSON object.
String original = "This is my string valúe";
I'm trying to encode the above string to UTF-8 equivalent but to replace only special character (ú) with -- "ú ;" in this case.
I've tried using the below but I get an error:
Input is not proper UTF-8, indicate encoding !Bytes: 0xFA 0x20 0x63 0x61
Code:
String original = new String("This is my string valúe");
byte ptext[] = original.getBytes("UTF-8");
String value = new String(ptext, "UTF-8");
System.out.println("Output : " + value);
This is my string valúe
You could use String.replace(CharSequence, CharSequence) and formatted io like
String original = "This is my string valúe";
System.out.printf("Output : %s%n", original.replace("ú", "ú"));
Which outputs (as I think you wanted)
Output : This is my string valúe
You seem to want to use XML character entities.
Appache Commons Lang has a method for this (in StringEscapeUtils).
Im trying to encode the above string to UTF-8 equivalent but to replace only >special character ( ú ) with -- "ú ;" in this case.
I'm not sure what encoding "ú ;" is but have you tried looking at the URLEncoder class? It won't encode the string exactly the way you asked but it gets rid of the spooky character.
Could you please try the below lines:
byte ptext[] = original.getBytes("UTF8");
String value = new String(ptext, "UTF8");
I retrieved HTML string from an objective site and within it there is a section
class="f9t" name="Óû§Ãû:ôâÈ»12"
I know it's in GBK encoding, as I can see it from the FF browser display. But I do not know how to convert that name string into a readable GBK string (such as 上海 or 北京).
I am using
String sname = new String(name.getBytes(), "UTF-8");
byte[] gbkbytes = sname.getBytes("gb2312");
String gbkStr = new String( gbkbytes );
System.out.println(gbkStr);
but it's not printed right in GBK text
???¡ì??:????12
I have no clue how to proceed.
You can try this if you already read the name with a wrong encoding and get the wrong name value "Óû§Ãû:ôâÈ»12", as #Karol S suggested:
new String(name.getBytes("ISO-8859-1"), "GBK")
Or if you read a GBK or GB2312 string from internet or a file, use something like this to get the right string at the first place:
BufferedReader r = new BufferedReader(new InputStreamReader(is,"GBK")); name = r.readLine();
Assuming that name.getBytes() returns GBK encoded string it's enough to create string specifying encoding of array of bytes:
new String(gbkString.getBytes(), "GBK");
Regarding to documentation the name of encryption should be GBK.
Sample code:
String gbkString = "Óû§Ãû:ôâÈ»12";
String utfString = new String(gbkString.getBytes(), "GBK");
System.out.println(utfString);
Result (not 100% sure that it's correct :) ):
脫脙禄搂脙没:么芒脠禄12