I am writting a function that is should detect used charset and then switch it to utf-8. I am using juniversalchardet which is java port for universalchardet by mozilla.
This is my code:
private List<List<String>> setProperEncoding(List<List<String>> input) {
try {
// Detect used charset
UniversalDetector detector = new UniversalDetector(null);
int position = 0;
while ((position < input.size()) & (!detector.isDone())) {
String row = null;
for (String cell : input.get(position)) {
row += cell;
}
byte[] bytes = row.getBytes();
detector.handleData(bytes, 0, bytes.length);
position++;
}
detector.dataEnd();
Charset charset = Charset.forName(detector.getDetectedCharset());
Charset utf8 = Charset.forName("UTF-8");
System.out.println("Detected charset: " + charset);
// rewrite input using proper charset
List<List<String>> newLines = new ArrayList<List<String>>();
for (List<String> row : input) {
List<String> newRow = new ArrayList<String>();
for (String cell : row) {
//newRow.add(new String(cell.getBytes(charset)));
ByteBuffer bb = ByteBuffer.wrap(cell.getBytes(charset));
CharBuffer cb = charset.decode(bb);
bb = utf8.encode(cb);
newRow.add(new String(bb.array()));
}
newLines.add(newRow);
}
return newLines;
} catch (Exception e) {
e.printStackTrace();
return input;
}
}
My problem is that when I read file with chars of for example Polish alphabet, letters like ł,ą,ć and similiar are replaced by ? and other strange things. What am I doing wrong?
EDIT:
For compilation I am using eclipse.
Method parameter is a result of reading MultipartFile. Just using FileInputStream to get every line and then splitting everyline by some separator (it is prepaired for xls, xlsx and csv files). Nothing special there.
First of all, you have your data somewhere in a binary format. For the sake of simplicity, I suppose it comes from an InputStream.
You want to write the output as an UTF-8 String, I suppose it can be an OutputStream.
I would recommend to create an AutoDetectInputStream:
public class AutoDetectInputStream extends InputStream {
private InputStream is;
private byte[] sampleData = new byte[4096];
private int sampleLen;
private int sampleIndex = 0;
public AutoDetectStream(InputStream is) throws IOException {
this.is = is;
// pre-read the data
sampleLen = is.read(sampleData);
}
public Charset getCharset() {
// detect the charset
UniversalDetector detector = new UniversalDetector(null);
detector.handleData(sampleData, 0, sampleLen);
detector.dataEnd();
return detector.getDetectedCharset();
}
#Override
public int read() throws IOException {
// simulate the stream for the reader
if(sampleIndex < sampleLen) {
return sampleData[sampleIndex++];
}
return is.read();
}
}
The second task is quite simple because Java stores the strings (characters) in UTF-8, so just use a simple OutputStreamWriter. So, here's your code:
// open input with Detector stream
// we use BufferedReader so we could read lines
InputStream is = new FileInputStream("in.txt");
AutoDetectInputStream detector = new AutoDetectInputStream(is);
Charset charset = detector.getCharset();
// here we can use the charset to decode the bytes into characters
BufferedReader rdr = new BufferedReader(new InputStreamReader(detector, charset));
// open output to write to
OutputStream os = new FileOutputStream("out.txt");
Writer utf8Writer = new OutputStreamWriter(os, Charset.forName("UTF-8"));
// copy the whole file
String line;
while((line = rdr.readLine()) != null) {
utf8Writer.append(line);
}
// close streams
rdr.close();
utf8Writer.flush();
utf8Writer.close();
So, finally you got all your txt file transcoded to UTF-8.
Note, that the buffer size should be big enough to feed the UniversalDetector.
Related
I am having a very weird issue.
I am putting and getting messages from Amazon AWS SQS.
While putting I am compressing and encoding the messages, like this :
String responseMessageBodyOriginal = gson.toJson(responseData);
String responseMessageBodyCompressed = compressToBase64String(responseMessageBodyOriginal);
AmazonSqsHelper.sendMessage(responseMessageBodyCompressed, queue, null);
Compression and encoding function, looks like this :
public static String compressToBase64String(String data) throws IOException {
ByteArrayOutputStream bos = new ByteArrayOutputStream(data.length());
GZIPOutputStream gzip = new GZIPOutputStream(bos);
gzip.write(data.getBytes());
gzip.close();
byte[] compressedBytes = bos.toByteArray();
bos.close();
return new String(Base64.encodeBase64(compressedBytes));
}
On the other hand, while receiving message, this is the code :
List<Message> sqsMessageList = AmazonSqsHelper.receiveMessages(queueUrl, max_message_read_count,
default_visibility_timeout);
int num_messages = sqsMessageList.size();
if (num_messages > 0) {
for (Message m : sqsMessageList) {
String responseMessageBodyCompressed = m.getBody();
String responseMessageBodyOriginal = decompressFromBase64String(responseMessageBodyCompressed);
}
}
And the function used for decoding and unzipping is like this :
public static String decompressFromBase64String(String compressedString) throws IOException {
byte[] compressedBytes = Base64.decodeBase64(compressedString);
ByteArrayInputStream bis = new ByteArrayInputStream(compressedBytes);
GZIPInputStream gis = new GZIPInputStream(bis);
BufferedReader br = new BufferedReader(new InputStreamReader(gis, "UTF-8"));
StringBuilder sb = new StringBuilder();
String line;
while ((line = br.readLine()) != null) {
sb.append(line);
}
br.close();
gis.close();
bis.close();
return sb.toString();
}
But the problem is , at times if I pass characters like "â®" then those are getting converted to ???? , after decoding if I am printing the message.
Not able to figure out why encoding and decoding is behaving weird. Any help would be appreciated.
Issue is that encoding is done using the platform's default charset (data.getBytes()), while decoding - using UTF-8.
In compressToBase64String change data.getBytes() to data.getBytes(StandardCharsets.UTF_8).
I'm trying to read first line from socket stream with BufferedReader from BufferedInputStream, it reads the first line(1), this is size of some contents(2) in this content i have the size of another content(3)
Reads correctly... ( with BufferedReader, _bin.readLine() )
Reads correctly too... ( with _in.read(byte[] b) )
Won't read, seems there's more content than my size read in (2)
I think problem is that I'm trying to read using BufferedReader and then BufferedInputStream... can anyone help me ?
public HashMap<String, byte[]> readHead() throws IOException {
JSONObject json;
try {
HashMap<String, byte[]> map = new HashMap<>();
System.out.println("reading header");
int headersize = Integer.parseInt(_bin.readLine());
byte[] parsable = new byte[headersize];
_in.read(parsable);
json = new JSONObject(new String(parsable));
map.put("id", lTob(json.getLong(SagConstants.KEY_ID)));
map.put("length", iTob(json.getInt(SagConstants.KEY_SIZE)));
map.put("type", new byte[]{(byte)json.getInt(SagConstants.KEY_TYPE)});
return map;
} catch(SocketException | JSONException e) {
_exception = e.getMessage();
_error_code = SagConstants.ERROR_OCCOURED_EXCEPTION;
return null;
}
}
sorry for bad english and for bad explanation, i tried to explain my problem, hope you understand
file format is so:
size1
{json, length is given size1, there is size2 given}
{second json, length is size2}
_in is BufferedInputStream();
_bin is BufferedReader(_in);
with _bin, i read first line (size1) and convert to integer
with _in, i read next data, where is size2 and length of this data is size1
then im trying to read the last data, its size is size2
something like this:
byte[] b = new byte[secondSize];
_in.read(b);
and nothing happens here, program is paused...
can't work with BufferedInputStream and BufferedReader together
That's correct. If you use any buffered stream or reader on a socket [or indeed any data source], you can't use any other stream or reader with it whatsoever. Data will get 'lost', that is to say read-ahead, in the buffer of the buffered stream or reader, and will not be available to the other stream/reader.
You need to rethink your design.
You create one BufferedReader _bin and BufferedInputStream _in and read a file both of them, but their cursor position is different so second read start from beginning because you use 2 object to read it. You should read size1 with _in too.
int headersize = Integer.parseInt(readLine(_in));
byte[] parsable = new byte[headersize];
_in.read(parsable);
Use below readLine to read all data with BufferedInputStream.
private final static byte NL = 10;// new line
private final static byte EOF = -1;// end of file
private final static byte EOL = 0;// end of line
private static String readLine(BufferedInputStream reader,
String accumulator) throws IOException {
byte[] container = new byte[1];
reader.read(container);
byte byteRead = container[0];
if (byteRead == NL || byteRead == EOL || byteRead == EOF) {
return accumulator;
}
String input = "";
input = new String(container, 0, 1);
accumulator = accumulator + input;
return readLine(reader, accumulator);
}
I have the following code, which will read in files in ISO-8859-1, as thats what is required in this application,
private static String readFile(String filename) throws IOException {
String lineSep = System.getProperty("line.separator");
File f = new File(filename);
StringBuffer sb = new StringBuffer();
if (f.exists()) {
BufferedReader br =
new BufferedReader(
new InputStreamReader(
new FileInputStream(filename), "ISO-8859-1"));
String nextLine = "";
while ((nextLine = br.readLine()) != null) {
sb.append(nextLine+ " ");
// note: BufferedReader strips the EOL character.
// sb.append(lineSep);
}
br.close();
}
return sb.toString();
}
The problem is it is pretty slow. I have this function, which is MUCH faster, but I can not seem to find how to place the character encoding:
private static String fastStreamCopy(String filename)
{
String s = "";
FileChannel fc = null;
try
{
fc = new FileInputStream(filename).getChannel();
MappedByteBuffer byteBuffer = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
int size = byteBuffer.capacity();
if (size > 0)
{
byteBuffer.clear();
byte[] bytes = new byte[size];
byteBuffer.get(bytes, 0, bytes.length);
s = new String(bytes);
}
fc.close();
}
catch (FileNotFoundException fnfx)
{
System.out.println("File not found: " + fnfx);
}
catch (IOException iox)
{
System.out.println("I/O problems: " + iox);
}
finally
{
if (fc != null)
{
try
{
fc.close();
}
catch (IOException ignore)
{
}
}
}
return s;
}
Any one have an idea of where i should be putting the ISO encoding?
From the code you posted, you're not trying to "copy" the stream, but read it into a string.
You can simply provide the encoding in the String constructor:
s = new String(bytes, "ISO-88591-1");
Personally I'd just replace the whole method with a call to the Guava method Files.toString():
String content = Files.toString(new File(filename), StandardCharsets.ISO_8859_1);
If you're using Java 6 or earlier, you'll need to use the Guava field Charsets.ISO_8859_1 instead of StandardCharsets.ISO_8859_1 (which was only introduced in Java 7).
However your use of the term "copy" suggests that you want to write the result to some other file (or stream). If that is true, then you don't need to care about the encoding at all, since you can just handle the byte[] directly and avoid the (unnecessary) conversion to and from String.
where you are converting bytes to string e.g. s = new String(bytes, encoding); or vice versa.
I am trying to convert all Windows special characters to their Unicode equivalent. We have a Flex application, where a user saves some Rich Text, and then it is emailed through a Java Emailer to their recipient. However, we keep running into Word's special characters that just show up in the email as a ?.
So far I've tried
private String replaceWordChars(String text_in) {
String s = text_in;
// smart single quotes and apostrophe
s = s.replaceAll("[\\u2018|\\u2019|\\u201A]", "\'");
// smart double quotes
s = s.replaceAll("[\\u201C|\\u201D|\\u201E]", "\"");
// ellipsis
s = s.replaceAll("\\u2026", "...");
// dashes
s = s.replaceAll("[\\u2013|\\u2014]", "-");
// circumflex
s = s.replaceAll("\\u02C6", "^");
// open angle bracket
s = s.replaceAll("\\u2039", "<");
// close angle bracket
s = s.replaceAll("\\u203A", ">");
// spaces
s = s.replaceAll("[\\u02DC|\\u00A0]", " ");
return s;
Which works, but I don't want to hand encode all Windows-1252 characters to their equivalent UTF-16 (assuming that's what default Java character set is)
However our users keep finding more characters from Microsoft Word that Java just can't handle. So I searched and searched, and found this example
private String replaceWordChars(String text_in) {
String s = text_in;
try {
byte[] b = s.getBytes("Cp1252");
byte[] encoded = new String(b, "Cp1252").getBytes("UTF-16");
s = new String(encoded, "UTF-16");
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return s;
But when I watch the encoding happen in the Eclipse debugger, nothing changes.
There has to be a simple solution to dealing with Microsoft's lovely encoding with Java.
Any thoughts?
You could try using java.nio.charset.Charset:
final Charset windowsCharset = Charset.forName("windows-1252");
final Charset utfCharset = Charset.forName("UTF-16");
final CharBuffer windowsEncoded = windowsCharset.decode(ByteBuffer.wrap(new byte[] {(byte) 0x91}));
final byte[] utfEncoded = utfCharset.encode(windowsEncoded).array();
System.out.println(new String(utfEncoded, utfCharset.displayName()));
Use the following steps:
Create an InputStreamReader using the source file's encoding (Windows-1252)
Create an OutputStreamWriter using the destination file's encoding (UTF-16)
Copy the information read from the reader to the writer. You can use BufferedReader and BufferedWriter to write contents line-by-line.
So your code may look like this:
public void reencode(InputStream source, OutputStream dest,
String sourceEncoding, String destEncoding)
throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(source, sourceEncoding));
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(dest, destEncoding));
String in;
while ((in = reader.readLine()) != null) {
writer.write(in);
writer.newLine();
}
}
This, of course, excludes try/catch stuff and delegates it to the caller.
If you're just trying to get the contents as a string of sorts, you can replace the writer with StringWriter and return its toString value. Then you don't need a destination stream or encoding, just a place to dump characters:
public String decode(InputStream source, String sourceEncoding)
throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(source, sourceEncoding));
StringWriter writer = new StringWriter();
String in;
while ((in = reader.readLine()) != null) {
writer.write(in);
writer.write('\n'); // Java newline should be fine, test this just in case
}
return writer.toString();
}
What seems to work so far for everything I've tested is:
private String replaceWordChars(String text_in) {
String s = text_in;
final Charset windowsCharset = Charset.forName("windows-1252");
final Charset utfCharset = Charset.forName("UTF-16");
byte[] incomingBytes = s.getBytes();
final CharBuffer windowsEncoded =
windowsCharset.decode(ByteBuffer.wrap(incomingBytes));
final byte[] utfEncoded = utfCharset.encode(windowsEncoded).array();
s = new String(utfEncoded);
return s;
}
I'm trying to read a resource (asdf.txt), but if the file is bigger than 5000 bytes, (for example) 4700 pieces of null-character inserted to the end of the content variable. Is there any way to remove them? (or to set the right size of the buffer?)
Here is the code:
String content = "";
try {
InputStream in = this.getClass().getResourceAsStream("asdf.txt");
byte[] buffer = new byte[5000];
while (in.read(buffer) != -1) {
content += new String(buffer);
}
} catch (Exception e) {
e.printStackTrace();
}
The simplest way is to do the correct thing: Use a Reader to read text data:
public String readFromFile(String filename, String enc) throws Exception {
String content = "";
Reader in = new
InputStreamReader(this.getClass().getResourceAsStream(filename), enc);
StringBuffer temp = new StringBuffer(1024);
char[] buffer = new char[1024];
int read;
while ((read=in.read(buffer, 0, buffer.length)) != -1) {
temp.append(buffer, 0, read);
}
content = temp.toString();
return content;
}
Note that you definitely should define the encoding of the text file you want to read.
And note that both your code and this example code work equally well on Java SE and J2ME.