â® characters getting converted to question marks while getting back - java

I am having a very weird issue.
I am putting and getting messages from Amazon AWS SQS.
While putting I am compressing and encoding the messages, like this :
String responseMessageBodyOriginal = gson.toJson(responseData);
String responseMessageBodyCompressed = compressToBase64String(responseMessageBodyOriginal);
AmazonSqsHelper.sendMessage(responseMessageBodyCompressed, queue, null);
Compression and encoding function, looks like this :
public static String compressToBase64String(String data) throws IOException {
ByteArrayOutputStream bos = new ByteArrayOutputStream(data.length());
GZIPOutputStream gzip = new GZIPOutputStream(bos);
gzip.write(data.getBytes());
gzip.close();
byte[] compressedBytes = bos.toByteArray();
bos.close();
return new String(Base64.encodeBase64(compressedBytes));
}
On the other hand, while receiving message, this is the code :
List<Message> sqsMessageList = AmazonSqsHelper.receiveMessages(queueUrl, max_message_read_count,
default_visibility_timeout);
int num_messages = sqsMessageList.size();
if (num_messages > 0) {
for (Message m : sqsMessageList) {
String responseMessageBodyCompressed = m.getBody();
String responseMessageBodyOriginal = decompressFromBase64String(responseMessageBodyCompressed);
}
}
And the function used for decoding and unzipping is like this :
public static String decompressFromBase64String(String compressedString) throws IOException {
byte[] compressedBytes = Base64.decodeBase64(compressedString);
ByteArrayInputStream bis = new ByteArrayInputStream(compressedBytes);
GZIPInputStream gis = new GZIPInputStream(bis);
BufferedReader br = new BufferedReader(new InputStreamReader(gis, "UTF-8"));
StringBuilder sb = new StringBuilder();
String line;
while ((line = br.readLine()) != null) {
sb.append(line);
}
br.close();
gis.close();
bis.close();
return sb.toString();
}
But the problem is , at times if I pass characters like "â®" then those are getting converted to ???? , after decoding if I am printing the message.
Not able to figure out why encoding and decoding is behaving weird. Any help would be appreciated.

Issue is that encoding is done using the platform's default charset (data.getBytes()), while decoding - using UTF-8.
In compressToBase64String change data.getBytes() to data.getBytes(StandardCharsets.UTF_8).

Related

Process input json with illegal encoding [duplicate]

This question already has answers here:
JSON Invalid UTF-8 middle byte
(7 answers)
Closed 4 years ago.
My java app has consumer that gets on input JSON files from server and then I tries convert it using Jackson. But ObjectMapper throws an exception:
com.fasterxml.jackson.databind.JsonMappingException: Invalid UTF-8 middle byte 0x2f
As far as I understand it's is due to incorrect encoding.
Can I somehow recognize the encoding and process the server response?
I was needed to recognize the encoding and correctly process the data.
For that I used UniversalDetector from org.mozilla.universalchardet.UniversalDetector
private static final UniversalDetector DETECTOR = new UniversalDetector(null);
private static String getEncode(byte[] data) throws IOException {
DETECTOR.reset();
byte[] buf = new byte[data.length];
InputStream is = new ByteArrayInputStream(data);
int read;
while ((read = is.read(buf)) > 0 && !DETECTOR.isDone()) {
DETECTOR.handleData(buf, 0, read);
}
is.close();
DETECTOR.dataEnd();
return DETECTOR.getDetectedCharset();
}
And then I read it with correct encode:
private static String readWithEncode(byte[] data, String encoding) throws IOException {
BufferedReader br = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(data), encoding));
StringBuilder result = new StringBuilder();
String s;
while ((s = br.readLine()) != null) {
result.append(s);
}
br.close();
return result.toString();
}

BufferedReader.readline() returning null value

I am creating this method which takes an InputStream as parameter, but the readLine() function is returning null. While debugging, inputstream is not empty.
else if (requestedMessage instanceof BytesMessage) {
BytesMessage bytesMessage = (BytesMessage) requestedMessage;
byte[] sourceBytes = new byte[(int) bytesMessage.getBodyLength()];
bytesMessage.readBytes(sourceBytes);
String strFileContent = new String(sourceBytes);
ByteArrayInputStream byteInputStream = new ByteArrayInputStream(sourceBytes);
InputStream inputStrm = (InputStream) byteInputStream;
processMessage(inputStrm, requestedMessage);
}
public void processMessage(InputStream inputStrm, javax.jms.Message requestedMessage) {
String externalmessage = tradeEntryTrsMessageHandler.convertInputStringToString(inputStrm);
}
public String convertInputStringToString(InputStream inputStream) throws IOException {
BufferedReader br = new BufferedReader(new InputStreamReader(inputStream));
StringBuilder sb = new StringBuilder();
String line;
while ((line = br.readLine()) != null) {
sb.append(line);
}
br.close();
return sb.toString();
}
Kindly try this,
BufferedReader br = new BufferedReader(new InputStreamReader(inputStream, "UTF-8"));
i believe that raw data as it is taken is not formatted to follow a character set. so by mentioning UTF-8 (U from Universal Character Set + Transformation Format—8-bit might help
Are you sure you are initializing and passing a valid InputStream to the function?
Also, just FYI maybe you were trying to name your function convertInputStreamToString instead of convertInputStringToString?
Here are two other ways of converting your InputStream to String, try these maybe?
1.
String theString = IOUtils.toString(inputStream, encoding);
2.
public String convertInputStringToString(InputStream is) {
java.util.Scanner s = new java.util.Scanner(is, encoding).useDelimiter("\\A");
return s.hasNext() ? s.next() : "";
}
EDIT:
You needn't explicitly convert ByteArrayInputStream to InputStream. You could do directly:
InputStream inputStrm = new ByteArrayInputStream(sourceBytes);

Java fast stream copy with ISO-8859-1

I have the following code, which will read in files in ISO-8859-1, as thats what is required in this application,
private static String readFile(String filename) throws IOException {
String lineSep = System.getProperty("line.separator");
File f = new File(filename);
StringBuffer sb = new StringBuffer();
if (f.exists()) {
BufferedReader br =
new BufferedReader(
new InputStreamReader(
new FileInputStream(filename), "ISO-8859-1"));
String nextLine = "";
while ((nextLine = br.readLine()) != null) {
sb.append(nextLine+ " ");
// note: BufferedReader strips the EOL character.
// sb.append(lineSep);
}
br.close();
}
return sb.toString();
}
The problem is it is pretty slow. I have this function, which is MUCH faster, but I can not seem to find how to place the character encoding:
private static String fastStreamCopy(String filename)
{
String s = "";
FileChannel fc = null;
try
{
fc = new FileInputStream(filename).getChannel();
MappedByteBuffer byteBuffer = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
int size = byteBuffer.capacity();
if (size > 0)
{
byteBuffer.clear();
byte[] bytes = new byte[size];
byteBuffer.get(bytes, 0, bytes.length);
s = new String(bytes);
}
fc.close();
}
catch (FileNotFoundException fnfx)
{
System.out.println("File not found: " + fnfx);
}
catch (IOException iox)
{
System.out.println("I/O problems: " + iox);
}
finally
{
if (fc != null)
{
try
{
fc.close();
}
catch (IOException ignore)
{
}
}
}
return s;
}
Any one have an idea of where i should be putting the ISO encoding?
From the code you posted, you're not trying to "copy" the stream, but read it into a string.
You can simply provide the encoding in the String constructor:
s = new String(bytes, "ISO-88591-1");
Personally I'd just replace the whole method with a call to the Guava method Files.toString():
String content = Files.toString(new File(filename), StandardCharsets.ISO_8859_1);
If you're using Java 6 or earlier, you'll need to use the Guava field Charsets.ISO_8859_1 instead of StandardCharsets.ISO_8859_1 (which was only introduced in Java 7).
However your use of the term "copy" suggests that you want to write the result to some other file (or stream). If that is true, then you don't need to care about the encoding at all, since you can just handle the byte[] directly and avoid the (unnecessary) conversion to and from String.
where you are converting bytes to string e.g. s = new String(bytes, encoding); or vice versa.

Changing encoding in java

I am writting a function that is should detect used charset and then switch it to utf-8. I am using juniversalchardet which is java port for universalchardet by mozilla.
This is my code:
private List<List<String>> setProperEncoding(List<List<String>> input) {
try {
// Detect used charset
UniversalDetector detector = new UniversalDetector(null);
int position = 0;
while ((position < input.size()) & (!detector.isDone())) {
String row = null;
for (String cell : input.get(position)) {
row += cell;
}
byte[] bytes = row.getBytes();
detector.handleData(bytes, 0, bytes.length);
position++;
}
detector.dataEnd();
Charset charset = Charset.forName(detector.getDetectedCharset());
Charset utf8 = Charset.forName("UTF-8");
System.out.println("Detected charset: " + charset);
// rewrite input using proper charset
List<List<String>> newLines = new ArrayList<List<String>>();
for (List<String> row : input) {
List<String> newRow = new ArrayList<String>();
for (String cell : row) {
//newRow.add(new String(cell.getBytes(charset)));
ByteBuffer bb = ByteBuffer.wrap(cell.getBytes(charset));
CharBuffer cb = charset.decode(bb);
bb = utf8.encode(cb);
newRow.add(new String(bb.array()));
}
newLines.add(newRow);
}
return newLines;
} catch (Exception e) {
e.printStackTrace();
return input;
}
}
My problem is that when I read file with chars of for example Polish alphabet, letters like ł,ą,ć and similiar are replaced by ? and other strange things. What am I doing wrong?
EDIT:
For compilation I am using eclipse.
Method parameter is a result of reading MultipartFile. Just using FileInputStream to get every line and then splitting everyline by some separator (it is prepaired for xls, xlsx and csv files). Nothing special there.
First of all, you have your data somewhere in a binary format. For the sake of simplicity, I suppose it comes from an InputStream.
You want to write the output as an UTF-8 String, I suppose it can be an OutputStream.
I would recommend to create an AutoDetectInputStream:
public class AutoDetectInputStream extends InputStream {
private InputStream is;
private byte[] sampleData = new byte[4096];
private int sampleLen;
private int sampleIndex = 0;
public AutoDetectStream(InputStream is) throws IOException {
this.is = is;
// pre-read the data
sampleLen = is.read(sampleData);
}
public Charset getCharset() {
// detect the charset
UniversalDetector detector = new UniversalDetector(null);
detector.handleData(sampleData, 0, sampleLen);
detector.dataEnd();
return detector.getDetectedCharset();
}
#Override
public int read() throws IOException {
// simulate the stream for the reader
if(sampleIndex < sampleLen) {
return sampleData[sampleIndex++];
}
return is.read();
}
}
The second task is quite simple because Java stores the strings (characters) in UTF-8, so just use a simple OutputStreamWriter. So, here's your code:
// open input with Detector stream
// we use BufferedReader so we could read lines
InputStream is = new FileInputStream("in.txt");
AutoDetectInputStream detector = new AutoDetectInputStream(is);
Charset charset = detector.getCharset();
// here we can use the charset to decode the bytes into characters
BufferedReader rdr = new BufferedReader(new InputStreamReader(detector, charset));
// open output to write to
OutputStream os = new FileOutputStream("out.txt");
Writer utf8Writer = new OutputStreamWriter(os, Charset.forName("UTF-8"));
// copy the whole file
String line;
while((line = rdr.readLine()) != null) {
utf8Writer.append(line);
}
// close streams
rdr.close();
utf8Writer.flush();
utf8Writer.close();
So, finally you got all your txt file transcoded to UTF-8.
Note, that the buffer size should be big enough to feed the UniversalDetector.

Convert InputStream to String with encoding given in stream data

My input is a InputStream which contains an XML document. Encoding used in XML is unknown and it is defined in the first line of XML document.
From this InputStream, I want to have all document in a String.
To do this, I use a BufferedInputStream to mark the beginning of the file and start reading first line. I read this first line to get encoding and then I use an InputStreamReader to generate a String with the correct encoding.
It seems that it is not the best way to achieve this goal because it produces an OutOfMemory error.
Any idea, how to do it?
public static String streamToString(final InputStream is) {
String result = null;
if (is != null) {
BufferedInputStream bis = new BufferedInputStream(is);
bis.mark(Integer.MAX_VALUE);
final StringBuilder stringBuilder = new StringBuilder();
try {
// stream reader that handle encoding
final InputStreamReader readerForEncoding = new InputStreamReader(bis, "UTF-8");
final BufferedReader bufferedReaderForEncoding = new BufferedReader(readerForEncoding);
String encoding = extractEncodingFromStream(bufferedReaderForEncoding);
if (encoding == null) {
encoding = DEFAULT_ENCODING;
}
// stream reader that handle encoding
bis.reset();
final InputStreamReader readerForContent = new InputStreamReader(bis, encoding);
final BufferedReader bufferedReaderForContent = new BufferedReader(readerForContent);
String line = bufferedReaderForContent.readLine();
while (line != null) {
stringBuilder.append(line);
line = bufferedReaderForContent.readLine();
}
bufferedReaderForContent.close();
bufferedReaderForEncoding.close();
} catch (IOException e) {
// reset string builder
stringBuilder.delete(0, stringBuilder.length());
}
result = stringBuilder.toString();
}else {
result = null;
}
return result;
}
The call to mark(Integer.MAX_VALUE) is causing the OutOfMemoryError, since it's trying to allocate 2GB of memory.
You can solve this by using an iterative approach. Set the mark readLimit to a reasonable value, say 8K. In 99% of cases this will work, but in pathological cases, e.g 16K spaces between the attributes in the declaration, you will need to try again. Thus, have a loop that tries to find the encoding, but if it doesn't find it within the given mark region, it tries again, doubling the requested mark readLimit size.
To be sure you don't advance the input stream past the mark limit, you should read the InputStream yourself, upto the mark limit, into a byte array. You then wrap the byte array in a ByteArrayInputStream and pass that to the constructor of the InputStreamReader assigned to 'readerForEncoding'.
You can use this method to convert inputstream to string. this might help you...
private String convertStreamToString(InputStream input) throws Exception{
BufferedReader reader = new BufferedReader(new InputStreamReader(input));
StringBuilder sb = new StringBuilder();
String line = null;
while ((line = reader.readLine()) != null) {
sb.append(line);
}
input.close();
return sb.toString();
}

Categories

Resources