ANSI to UTF-8 through Java : some lines are lost

ANSI to UTF-8 through Java : some lines are lost - java

I wanted to convert some files from ANSI (Arabic) to UTF-8. It works but after the new file is created, it is missing some lines (at the end). Any ideas why?
This is the code :
public class CustomFileConverter {
private static final char BYTE_ORDER_MARK = '\uFEFF';
public void createFile(String inputFile, String outputFile) throws IOException{
FileInputStream input = new FileInputStream(inputFile);
InputStreamReader inputStreamReader = new InputStreamReader(input, "windows-1256"); // Arabic
char[] data = new char[1024];
int i = inputStreamReader.read(data);
if(new File(outputFile).exists()){
new File(outputFile).delete();
}
FileOutputStream output = new FileOutputStream(outputFile,true);
Writer writer = new OutputStreamWriter(output,"UTF-8");
String text = "";
writer.write(BYTE_ORDER_MARK);
while(i !=-1){
String str = new String(data,0,i);
text = text+str;
i = inputStreamReader.read(data);
}
// System.out.print(text); It is printed Completely
writer.write(text);
// File lacks some final lines...
output.close();
input.close();
}
}

When wrapping an output stream in a writer and writing to the writer, the writer may cache data before actually forwarding it to the output stream.
Since you're closing the output stream (file) before flushing the writer, there may be unwritten data which can no longer be written to the file since the output stream is closed.
Instead of closing the FileOutputStream output, close the writer writer, that will flush the contents of the writer to the file and also close both the writer itself and the wrapped FileOutputStream;

Related

How to read n base64 encoded characters from a file at a time and decode and write to another file?

Currently I have a source file which has base64 encoded data (20 mb in size approx). I want to read from this file, decode the data and write to a .TIF output file. However I don't want to decode all 20MB data at once. I want to read a specific number of characters/bytes from the source file, decode it and write to destination file. I understand that the size of the data I read from the source file has to be in multiples of 4 or else it can't be decoded?
Below is my current code where I decode it all at once
public write Output(File file){
BufferedReader br = new BufferedReader (new Filereader(file));
String builder sb = new StringBuilder ();
String line=BR.readLine();
While(line!=null){
....
//Read line by line and append to sb
}
byte[] decoded = Base64.getMimeDecoder().decode(SB.toString());
File outputFile = new File ("output.tif")
OutputStream out = new BufferedOutputStream(new FileOutputStream(outputFile));
out.write(decoded);
out.flush();
}
How can I read specific number of characters from source file and decode and then write to output file so I don't have to load everything in memory?

Here is a simple method to demonstrate doing this, by wrapping the Base64 Decoder around an input stream and reading into an appropriately sized byte array.
public static void readBase64File(File inputFile, File outputFile, int chunkSize) throws IOException {
FileInputStream fin = new FileInputStream(inputFile);
FileOutputStream fout = new FileOutputStream(outputFile);
InputStream base64Stream = Base64.getMimeDecoder().wrap(fin);
byte[] chunk = new byte[chunkSize];
int read;
while ((read = base64Stream.read(chunk)) != -1) {
fout.write(chunk, 0, read);
}
fin.close();
fout.close();
}

How to read /write XORed txt file UTF8 in java?

what i did so far :
I read a file1 with text, XORed the bytes with a key and wrote it back to another file2.
My problem: I read for example 'H' from file1 , the byte value is 72;
72 XOR -32 = -88
Now i wrote -88 in to the file2.
when i read file2 i should get -88 as first byte, but i get -3.
public byte[] readInput(String File) throws IOException {
Path path = Paths.get(File);
byte[] data = Files.readAllBytes(path);
byte[]x=new byte[data.length ];
FileInputStream fis = new FileInputStream(File);
InputStreamReader isr = new InputStreamReader(fis);//utf8
Reader in = new BufferedReader(isr);
int ch;
int s = 0;
while ((ch = in.read()) > -1) {// read till EOF
x[s] = (byte) (ch);
}
in.close();
return x;
}
public void writeOutput(byte encrypted [],String file) {
try {
FileOutputStream fos = new FileOutputStream(file);
Writer out = new OutputStreamWriter(fos,"UTF-8");//utf8
String s = new String(encrypted, "UTF-8");
out.write(s);
out.close();
}
catch (IOException e) {
e.printStackTrace();
}
}
public byte[]DNcryption(byte[]key,byte[] mssg){
if(mssg.length==key.length)
{
byte[] encryptedBytes= new byte[key.length];
for(int i=0;i<key.length;i++)
{
encryptedBytes[i]=Byte.valueOf((byte)(mssg[i]^key[i]));//XOR
}
return encryptedBytes;
}
else
{
return null;
}
}

You're not reading the file as bytes - you're reading it as characters. The encrypted data isn't valid UTF-8-encoded text, so you shouldn't try to read it as such.
Likewise, you shouldn't be writing arbitrary byte arrays as if they're UTF-8-encoded text.
Basically, your methods have signatures accepting or returning arbitrary binary data - don't use Writer or Reader classes at all. Just write the data straight to the stream. (And don't swallow the exception, either - do you really want to continue if you've failed to write important data?)
I would actually remove both your readInput and writeOutput methods entirely. Instead, use Files.readAllBytes and Files.write.

In writeOutput method you convert encrypted byte array into UTF-8 String which changes the actual bytes you are writing later to the file. Try this code snippet to see what is happening when you try to convert byte array with negative values to UTF-8 String:
final String s = new String(new byte[]{-1}, "UTF-8");
System.out.println(Arrays.toString(s.getBytes("UTF-8")));
It will print something like [-17, -65, -67]. Try using OutputStream to write bytes to the file.
new FileOutputStream(file).write(encrypted);

Reading word by word from Android Internal Storage

I'm working on an application that needs to write some data in a setting file and then read them from that file. I've done it with 4 different fileNames and 2 different for each file. Now I want to make one file and read data from those 4 other written Strings from that one file but I don't know how to find spaces or any other specific character when I'm reading data.
This is how I read data:
public String read(String fileName) throws IOException{
FileInputStream inStream = openFileInput(fileName);
String content=null;
byte[] readByte = new byte[inStream.available()];
while(inStream.read(readByte) != -1)
content = new String(readByte);
inStream.close();
return content;
}
And that's how I write:
public void write(String fileName, String content) throws IOException{
FileOutputStream outStream = openFileOutput(fileName, Context.MODE_PRIVATE);
outStream.write(content.getBytes());
outStream.close();
}

Changing encoding in java

I am writting a function that is should detect used charset and then switch it to utf-8. I am using juniversalchardet which is java port for universalchardet by mozilla.
This is my code:
private List<List<String>> setProperEncoding(List<List<String>> input) {
try {
// Detect used charset
UniversalDetector detector = new UniversalDetector(null);
int position = 0;
while ((position < input.size()) & (!detector.isDone())) {
String row = null;
for (String cell : input.get(position)) {
row += cell;
}
byte[] bytes = row.getBytes();
detector.handleData(bytes, 0, bytes.length);
position++;
}
detector.dataEnd();
Charset charset = Charset.forName(detector.getDetectedCharset());
Charset utf8 = Charset.forName("UTF-8");
System.out.println("Detected charset: " + charset);
// rewrite input using proper charset
List<List<String>> newLines = new ArrayList<List<String>>();
for (List<String> row : input) {
List<String> newRow = new ArrayList<String>();
for (String cell : row) {
//newRow.add(new String(cell.getBytes(charset)));
ByteBuffer bb = ByteBuffer.wrap(cell.getBytes(charset));
CharBuffer cb = charset.decode(bb);
bb = utf8.encode(cb);
newRow.add(new String(bb.array()));
}
newLines.add(newRow);
}
return newLines;
} catch (Exception e) {
e.printStackTrace();
return input;
}
}
My problem is that when I read file with chars of for example Polish alphabet, letters like ł,ą,ć and similiar are replaced by ? and other strange things. What am I doing wrong?
EDIT:
For compilation I am using eclipse.
Method parameter is a result of reading MultipartFile. Just using FileInputStream to get every line and then splitting everyline by some separator (it is prepaired for xls, xlsx and csv files). Nothing special there.

First of all, you have your data somewhere in a binary format. For the sake of simplicity, I suppose it comes from an InputStream.
You want to write the output as an UTF-8 String, I suppose it can be an OutputStream.
I would recommend to create an AutoDetectInputStream:
public class AutoDetectInputStream extends InputStream {
private InputStream is;
private byte[] sampleData = new byte[4096];
private int sampleLen;
private int sampleIndex = 0;
public AutoDetectStream(InputStream is) throws IOException {
this.is = is;
// pre-read the data
sampleLen = is.read(sampleData);
}
public Charset getCharset() {
// detect the charset
UniversalDetector detector = new UniversalDetector(null);
detector.handleData(sampleData, 0, sampleLen);
detector.dataEnd();
return detector.getDetectedCharset();
}
#Override
public int read() throws IOException {
// simulate the stream for the reader
if(sampleIndex < sampleLen) {
return sampleData[sampleIndex++];
}
return is.read();
}
}
The second task is quite simple because Java stores the strings (characters) in UTF-8, so just use a simple OutputStreamWriter. So, here's your code:
// open input with Detector stream
// we use BufferedReader so we could read lines
InputStream is = new FileInputStream("in.txt");
AutoDetectInputStream detector = new AutoDetectInputStream(is);
Charset charset = detector.getCharset();
// here we can use the charset to decode the bytes into characters
BufferedReader rdr = new BufferedReader(new InputStreamReader(detector, charset));
// open output to write to
OutputStream os = new FileOutputStream("out.txt");
Writer utf8Writer = new OutputStreamWriter(os, Charset.forName("UTF-8"));
// copy the whole file
String line;
while((line = rdr.readLine()) != null) {
utf8Writer.append(line);
}
// close streams
rdr.close();
utf8Writer.flush();
utf8Writer.close();
So, finally you got all your txt file transcoded to UTF-8.
Note, that the buffer size should be big enough to feed the UniversalDetector.

How to save string with code page 1250 into RandomAccessFile in java

I have text file with string which code page is 1250. I want to save text into RandomAccessFile. When I read bytes from RandomAccessFile I get string with different character. Some solution...

If you're using writeUTF() then you should read its JavaDoc to learn that it always writes modified UTF-8.
If you want to use another encoding, then you'll have to "manually" do the encoding and somehow store the length of the byte[] as well.
For example:
RandomAccessFile raf = ...;
String writeThis = ...;
byte[] cp1250Data = writeThis.getBytes("cp1250");
raf.writeInt(cp1250Data.length);
raf.write(cp1250Data);
Reading would work similarly:
RandomAccessFile raf = ...;
int length = raf.readInt();
byte[] cp1250Data = new byte[length];
raf.readFully(cp1250Data);
String string = new String(cp1250Data, "cp1250");

This code will write and read a string using the 1250 code page. Of course, you will need to clean it, check exceptions and close streams properly before putting in prod :)
public static void main(String[] args) throws Exception {
File file = new File("/toto.txt");
String myString="This is a test";
OutputStreamWriter w = new OutputStreamWriter(new FileOutputStream(file), Charset.forName("windows-1250"));
w.write(myString);
w.flush();
CharBuffer b = CharBuffer.allocate((int)file.length());
new InputStreamReader(new FileInputStream(file), Charset.forName("windows-1250")).read(b);
System.out.println(b.toString());
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

ANSI to UTF-8 through Java : some lines are lost - java

Related

How to read n base64 encoded characters from a file at a time and decode and write to another file?

How to read /write XORed txt file UTF8 in java?

Reading word by word from Android Internal Storage

Changing encoding in java

How to save string with code page 1250 into RandomAccessFile in java

Categories

Resources