Extracting UTF-16 encoded file from ZIP archive in Java

Extracting UTF-16 encoded file from ZIP archive in Java - java

In the last section of the code I print what the Reader gives me. But its just bogus, where did I go wrong?
public static void read_impl(File file, String targetFile) {
// Create zipfile input stream
FileInputStream stream = new FileInputStream(file);
ZipInputStream zipFile = new ZipInputStream(new BufferedInputStream(stream));
// Im looking for a specific file/entry
while (!zipFile.getNextEntry().getName().equals(targetFile)) {
zipFile.getNextEntry();
}
// Next step in api requires a reader
// The target file is a UTF-16 encoded text file
InputStreamReader reader = new InputStreamReader(zipFile, Charset.forName("UTF-16"));
// I cant make sense of what this print
char buf[] = new char[1];
while (reader.read(buf, 0, 1) != -1) {
System.out.print(buf);
}
}

I'd guess that where you went wrong was believing that the file was UTF-16 encoded.
Can you show a few initial byte values if you don't decode them?

Your use of a char array is a bit pointless, though at first glance it should work. Try this instead:
int c;
while ((c = reader.read()) != -1) {
System.out.print((char)c);
}
If that does not work either, then perhaps you got the wrong file, or the file does not contain what you think it does, or the console can't display the characters it contains.

Related

Why is my DataInputStream only reading 114 bytes?

I'm trying to extract a file from my jar and copying it into the temp directory.
To read the file within the jar, I am using a DataInputStream, to write the file in the temp directory, I am using a DataOutputStream.
The file I am trying to extract has a file size of 310 kilobytes, my copied file only contains 114 bytes after I've called my method (this is also the number of bytes my method prints to the console).
Here is my method:
private static void extractFile(String pathInJar, String fileToCopy) {
File outputFile = new File(System.getProperty("java.io.tmpdir") + "/LDEngine/"+fileToCopy);
boolean couldDirsBeCreated = outputFile.getParentFile().mkdirs();
if(couldDirsBeCreated && !outputFile.exists()) {
int x;
int actualBytesRead = 0;
byte[] tmpByteArray = new byte[4096];
try(
DataOutputStream output = new DataOutputStream(new FileOutputStream(outputFile));
DataInputStream in = new DataInputStream(LibLoader.class.getResourceAsStream("/libs/natives/"+pathInJar))
){
while((x=in.read(tmpByteArray)) != -1) {
output.write(tmpByteArray);
actualBytesRead += x;
}
} catch(Exception e) {
System.err.println("Fatal error: Could not write file!");
System.exit(1);
}
System.out.println(actualBytesRead);
}
}
The file I am trying to copy is a .dll, so it's binary data I'm dealing with.
The question is why is this happening and what am I doing wrong?

This does not explain why your method stops so soon, but you need to take care of it or you will have an even stranger problem with the result data being completely garbled.
From the APi doc of DataInputStream.read():
Reads some number of bytes from the contained input stream and stores them into the buffer array b. The number of bytes actually read is returned as an integer.
You need to use that return value and call the write() method that takes and offset and length.

Java copy entire file without the double quotes

I have a method to copy the entire file from one destination to another destination using buffer:
InputStream in = new FileInputStream(src);
OutputStream out = new FileOutputStream(dest);
byte[] buf = new byte[1024];
int len;
while ((len = in.read(buf)) > 0) {
out.write(buf, 0, len);
}
in.close();
out.close();
The file is in csv format:
"2280B_TJ1400_001","TJ1400_Type-7SR","192.168.50.76","Aries SDH","6.0","192.168.0.254",24,"2280B Cyberjaya","Mahadzir Ibrahim"
But as you can see it has quotes inside it. Is it possible remove them by based on my exisitng code???
Output should be like this:
2280B_TJ1400_001,TJ1400_Type-7SR,192.168.50.76,Aries SDH,6.0,192.168.0.254,24,2280B Cyberjaya,Mahadzir Ibrahim

If you use a BufferedReader you can use the readLine() function to read the contents of the file as a String. Then you can use the normal functions on String to manipulate it before writing it to the output. By using an OutputStreamWriter you can write the Strings directly.
An advantage of the above is that you never have to bother with the raw bytes, this makes your code easier to read and less prone to mistakes in special cases.
BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(src)));
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(dest));
String line;
while ((line = in.readLine()) != null) {
String stringOut = line.replaceAll("\"", "");
out.write(stringOut);
}
in.close();
out.close();
Note that this removes all " characters, not just the ones at the start and end of each String. To do that, you can use a StringTokenizer, or a more complex replace.

Not sure it's a good idea or not, but you can do something like :
while ((len = in.read(buf)) > 0) {
String temp = new String(buf);
temp = temp.replaceAll("\"","");
buf = temp.getBytes();
len = temp.length();
out.write(buf, 0, len);
}

For me, I would read all the file before, in a String, and then strip out the ' " ' in the string. Then write it to the dest file.
Read the file in a string
I found this simple solution. This may not be the best depending on your level of error catching you need.But it's working enough ;)
String content = new Scanner(new File("filename")).useDelimiter("\\Z").next();
Stripout the ' " '
content = content.replaceAll('"', "");
Write it to dest file from here
Files.write(Paths.get("./duke.txt"), msg.getBytes());
This is for java 7+.
Did not test it but it should work !

Not necessarily good style, filtering quotes in binary data, but very solid.
Wrap the original InputStream with your own InputStream, filtering out the double quote.
I have added a quirk: in MS Excel a quoted field may contain a quote, which then is self-escaped, represented as two double quotes.
InputStream in = new UnquotingInputStream(new FileInputStream(src));
/**
* Removes ASCII double quote from an InputStream.
* Two consequtive quotes stand for one quote: self-escaping like used
* by MS Excel.
*/
public class UnquotingInputStream extends InputStream {
private final InputStream in;
private boolean justHadAQuote;
public UnquotingInputStream(InputStream in) {
this.in = in;
}
#Override
public int read() throws IOException {
int c = in.read();
if (c == '\"') {
if (!justHadAQuote) {
justHadAQuote = true;
return read(); // Skip quote
}
}
justHadAQuote = false;
return c;
}
}
Works for all encodings that use ASCII as subset. So not: UTF-16 or EBCDIC.

Method read from file using java language

I will enhance an old algorithm which use java language and i want it to read the text from file to encrypt it SO I will make a method that reads the text line by line from file then store them in array. I made this method and it works BUT the variable "line2" reads the first line correctly but once the next line come it will erase the first line and put the second line so what can i do please??
// The CODES
Private byte[] line2;
public byte[] readFromFile (){
try (BufferedReader br = new BufferedReader(new FileReader ("C:\\Users\\just\\Desktop\\message.txt")))
{
String sCurrentLine;
while ((sCurrentLine = br.readLine()) != null) {
line2 = sCurrentLine.getBytes();
}
} catch (IOException e) {
e.printStackTrace();
}
return line2;
}

I do it like this :
public byte[] readFromFile ()throws IOException
{
BufferedReader br = new BufferedReader(new FileReader("C:\Users\just\Desktop\message.txt"));
int k = 0;
int f=0;
byte[] line2;
// -1 means END, I made this loop to count the length of the file
while (br.read() != -1)
{ f++;
}
byte[] array2 = new byte[f];
String sCurrentLine;
while ((sCurrentLine = br.readLine()) != null) {
line2 = sCurrentLine.getBytes();
for(byte s : line2){
array2[k]=s;
k++;}
}
return array2;
}
this work with me BUT please can any one tell me which one better this one or the other one OR the other one which provided by "fge" Because i hope to take the best and thank you for all.

You said you will read the file line by line and store them in an array, but you haven't stored them in an array. Below the line line2 = sCurrentLine.getBytes();, store line2 in an array, and then read the next line.
sCurrentLine.getBytes(); returns the content of sCurrentLine as a byte array, so every time this statement is executed, it will return the bytes of the current line and so the previous line's contents is lost. So you have to store the contents line2 in another array, before reading the next line's byte.
You could use System.arraycopy() to copy the contents of line2 and append it to the contents of the previous line using this method. You can look at System class docs to find out how to use the System.arraycopy() method. Also have a look at Appending a byte[] to the end of another byte[] to append the contents of array to another array.

i want it to read the text from file to encrypt it
Reading as text is a surefire way of getting corrupted data. Read as bytes. More on this below.
With Java 7, it is as simple as:
final Path file = Paths.get("C:\\Users\\just\\Desktop\\message.txt");
final byte[] content = Files.readAllBytes(file);
Why corruption?
first of all, a BufferedReader's .readLine() strips newlines; the content you will encrypt will therefore not be the same;
second, you don't specify an encoding with which to read the file, and you don't specify an encoding to encode to bytes; and the JVM can choose to use a different default encoding and file encoding. Imagine what would happen if you read the file in windows-1252 and decoded them using UTF-8.
More generally:
when you "read a string" from a file, what is read is not characters; those are bytes. And a CharsetDecoder will then decode this sequence of bytes into a sequence of chars (possibly with information loss);
when you "write a string" to a file, what is written is not characters; again, those are bytes. And a CharsetEncoder will encode this sequence of chars into a sequence of bytes.

Reading from InflaterInputStream and parsing the result

I am quite new to java, just started yesterday. Since I am a big fan of learning by doing, I am making a small project with it. But I am stucked in this part. I have written a file using this function:
public static boolean writeZippedFile(File destFile, byte[] input) {
try {
// create file if doesn't exist part was here
try (OutputStream out = new DeflaterOutputStream(new FileOutputStream(destFile))) {
out.write(input);
}
return true;
} catch (IOException e) {
// error handlind was here
}
}
Now that I have successully wrote a compressed file using above method, I want to read it back to console. First I need to be able to read the decompressed content and write string representaion of that content to console. However, I have a second problem that I don't want to write characters up to first \0 null character. Here is how I attempt to read the compressed file:
try (InputStream is = new InflaterInputStream(new FileInputStream(destFile))) {
}
and I am completely stuck here. Question is, how to discard first few character until '\0' and then write the rest of the decompressed file to console.

I understand that your data contain text since you want to print a string respresentation. I further assume that the text contains unicode characters. If this is true, then your console should also support unicode for the characters to be displayed correctly.
So you should first read the data byte by byte until you encounter the \0 character and then you can use a BufferedReader to print the rest of the data as lines of text.
try (InputStream is = new InflaterInputStream(new FileInputStream(destFile))) {
// read the stream a single byte each time until we encounter '\0'
int aByte = 0;
while ((aByte = is.read()) != -1) {
if (aByte == '\0') {
break;
}
}
// from now on we want to print the data
BufferedReader b = new BufferedReader(new InputStreamReader(is, "UTF8"));
String line = null;
while ((line = b.readLine()) != null) {
System.out.println(line);
}
b.close();
} catch(IOException e) { // handle }

Skip the first few characters using InputStream#read()
while (is.read() != '\0');

Java: reading strings from a random access file with buffered input

I've never had close experiences with Java IO API before and I'm really frustrated now. I find it hard to believe how strange and complex it is and how hard it could be to do a simple task.
My task: I have 2 positions (starting byte, ending byte), pos1 and pos2. I need to read lines between these two bytes (including the starting one, not including the ending one) and use them as UTF8 String objects.
For example, in most script languages it would be a very simple 1-2-3-liner like that (in Ruby, but it will be essentially the same for Python, Perl, etc):
f = File.open("file.txt").seek(pos1)
while f.pos < pos2 {
s = f.readline
# do something with "s" here
}
It quickly comes hell with Java IO APIs ;) In fact, I see two ways to read lines (ending with \n) from regular local files:
RandomAccessFile has getFilePointer() and seek(long pos), but it's readLine() reads non-UTF8 strings (and even not byte arrays), but very strange strings with broken encoding, and it has no buffering (which probably means that every read*() call would be translated into single undelying OS read() => fairly slow).
BufferedReader has great readLine() method, and it can even do some seeking with skip(long n), but it has no way to determine even number of bytes that has been already read, not mentioning the current position in a file.
I've tried to use something like:
FileInputStream fis = new FileInputStream(fileName);
FileChannel fc = fis.getChannel();
BufferedReader br = new BufferedReader(
new InputStreamReader(
fis,
CHARSET_UTF8
)
);
... and then using fc.position() to get current file reading position and fc.position(newPosition) to set one, but it doesn't seem to work in my case: looks like it returns position of a buffer pre-filling done by BufferedReader, or something like that - these counters seem to be rounded up in 16K increments.
Do I really have to implement it all by myself, i.e. a file readering interface which would:
allow me to get/set position in a file
buffer file reading operations
allow reading UTF8 strings (or at least allow operations like "read everything till the next \n")
Is there a quicker way than implementing it all myself? Am I overseeing something?

import org.apache.commons.io.input.BoundedInputStream
FileInputStream file = new FileInputStream(filename);
file.skip(pos1);
BufferedReader br = new BufferedReader(
new InputStreamReader(new BoundedInputStream(file,pos2-pos1))
);
If you didn't care about pos2, then you woundn't need Apache Commons IO.

I wrote this code to read utf-8 using randomaccessfiles
//File: CyclicBuffer.java
public class CyclicBuffer {
private static final int size = 3;
private FileChannel channel;
private ByteBuffer buffer = ByteBuffer.allocate(size);
public CyclicBuffer(FileChannel channel) {
this.channel = channel;
}
private int read() throws IOException {
return channel.read(buffer);
}
/**
* Returns the byte read
*
* #return byte read -1 - end of file reached
* #throws IOException
*/
public byte get() throws IOException {
if (buffer.hasRemaining()) {
return buffer.get();
} else {
buffer.clear();
int eof = read();
if (eof == -1) {
return (byte) eof;
}
buffer.flip();
return buffer.get();
}
}
}
//File: UTFRandomFileLineReader.java
public class UTFRandomFileLineReader {
private final Charset charset = Charset.forName("utf-8");
private CyclicBuffer buffer;
private ByteBuffer temp = ByteBuffer.allocate(4096);
private boolean eof = false;
public UTFRandomFileLineReader(FileChannel channel) {
this.buffer = new CyclicBuffer(channel);
}
public String readLine() throws IOException {
if (eof) {
return null;
}
byte x = 0;
temp.clear();
while ((byte) -1 != (x = (buffer.get())) && x != '\n') {
if (temp.position() == temp.capacity()) {
temp = addCapacity(temp);
}
temp.put(x);
}
if (x == -1) {
eof = true;
}
temp.flip();
if (temp.hasRemaining()) {
return charset.decode(temp).toString();
} else {
return null;
}
}
private ByteBuffer addCapacity(ByteBuffer temp) {
ByteBuffer t = ByteBuffer.allocate(temp.capacity() + 1024);
temp.flip();
t.put(temp);
return t;
}
public static void main(String[] args) throws IOException {
RandomAccessFile file = new RandomAccessFile("/Users/sachins/utf8.txt",
"r");
UTFRandomFileLineReader reader = new UTFRandomFileLineReader(file
.getChannel());
int i = 1;
while (true) {
String s = reader.readLine();
if (s == null)
break;
System.out.println("\n line " + i++);
s = s + "\n";
for (byte b : s.getBytes(Charset.forName("utf-8"))) {
System.out.printf("%x", b);
}
System.out.printf("\n");
}
}
}

For #Ken Bloom A very quick go at a Java 7 version. Note: I don't think this is the most efficient way, I'm still getting my head around NIO.2, Oracle has started their tutorial here
Also note that this isn't using Java 7's new ARM syntax (which takes care of the Exception handling for file based resources), it wasn't working in the latest openJDK build that I have. But if people want to see the syntax, let me know.
/*
* Paths uses the default file system, note no exception thrown at this stage if
* file is missing
*/
Path file = Paths.get("C:/Projects/timesheet.txt");
ByteBuffer readBuffer = ByteBuffer.allocate(readBufferSize);
FileChannel fc = null;
try
{
/*
* newByteChannel is a SeekableByteChannel - this is the fun new construct that
* supports asynch file based I/O, e.g. If you declared an AsynchronousFileChannel
* you could read and write to that channel simultaneously with multiple threads.
*/
fc = (FileChannel)file.newByteChannel(StandardOpenOption.READ);
fc.position(startPosition);
while (fc.read(readBuffer) != -1)
{
readBuffer.rewind();
System.out.println(Charset.forName(encoding).decode(readBuffer));
readBuffer.flip();
}
}

Start with a RandomAccessFile and use read or readFully to get a byte array between pos1 and pos2. Let's say that we've stored the data read in a variable named rawBytes.
Then create your BufferedReader using
new BufferedReader(new InputStreamReader(new ByteArrayInputStream(rawBytes)))
Then you can call readLine on the BufferedReader.
Caveat: this probably uses more memory than if you could make the BufferedReader seek to the right location itself, because it preloads everything into memory.

I think the confusion is caused by the UTF-8 encoding and the possibility of double byte characters.
UTF8 doesn't specify how many bytes are in a single character. I'm assuming from your post that you are using single byte characters. For example, 412 bytes would mean 411 characters. But if the string were using double byte characters, you would get the 206 character.
The original java.io package didn't deal well with this multi-byte confusion. So, they added more classes to deal specifically with strings. The package mixes two different types of file handlers (and they can be confusing until the nomenclature is sorted out). The stream classes provide for direct data I/O without any conversion. The reader classes convert files to strings with full support for multi-byte characters. That might help clarify part of the problem.
Since you state you are using UTF-8 characters, you want the reader classes. In this case, I suggest FileReader. The skip() method in FileReader allows you to pass by X characters and then start reading text. Alternatively, I prefer the overloaded read() method since it allows you to grab all the text at one time.
If you assume your "bytes" are individual characters, try something like this:
FileReader fr = new FileReader( new File("x.txt") );
char[] buffer = new char[ pos2 - pos ];
fr.read( buffer, pos, buffer.length );
...

I'm late to the party here, but I ran across this problem in my own project.
After much traversal of Javadocs and Stack Overflow, I think I found a simple solution.
After seeking to the appropriate place in your RandomAccessFile, which I am here calling raFile, do the following:
FileDescriptor fd = raFile.getFD();
FileReader fr = new FileReader(fd);
BufferedReader br = new BufferedReader(fr);
Then you should be able to call br.readLine() to your heart's content, which will be much faster than calling raFile.readLine().
The one thing I'm not sure about is whether UTF8 strings are handled correctly.

The java IO API is very flexible. Unfortunately sometimes the flexibility makes it verbose. The main idea here is that there are many streams, writers and readers that implement wrapper patter. For example BufferedInputStream wraps any other InputStream. The same is about output streams.
The difference between streams and readers/writers is that streams work with bytes while readers/writers work with characters.
Fortunately some streams, writers and readers have convenient constructors that simplify coding. If you want to read file you just have to say
InputStream in = new FileInputStream("/usr/home/me/myfile.txt");
if (in.markSupported()) {
in.skip(1024);
in.read();
}
It is not so complicated as you afraid.
Channels is something different. It is a part of so called "new IO" or nio. New IO is not blocked - it is its main advantage. You can search in internet for any "nio java tutorial" and read about it. But it is more complicated than regular IO and is not needed for most applications.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extracting UTF-16 encoded file from ZIP archive in Java - java

I'd guess that where you went wrong was believing that the file was UTF-16 encoded. Can you show a few initial byte values if you don't decode them?

Related

Why is my DataInputStream only reading 114 bytes?

Java copy entire file without the double quotes

Method read from file using java language

Reading from InflaterInputStream and parsing the result

Java: reading strings from a random access file with buffered input

Categories

Resources