How to read x number of characters from a text file progressively

How to read x number of characters from a text file progressively - java

I am trying to read x characters from a text file at a time, progressively. So if I had: aaaaabbbbbcccccabckcka and im reading 5 at a time I would get, aaaaa, bbbbb,ccccc, abckc and ka. The code I am using is:
status = is.read(bytes);
text = new String(bytes);
where bytes is: bytes = new byte[5], I am calling these two lines of code till status becomes -1, the problem I am facing is, the output is not what I have mentioned above, but I get this:
aaaaa, bbbbb, ccccc, abckc and kackc, notice the last segment 'kackc' is garbage, why is this happening ?
Note: that bytes is initialized once outside the reading loop.

Your current solution works for ASCII, but many characters in other encodings use more than one byte. You should use a Reader and a char[] instead of an InputStream and a byte[], respectively.

It turns out, I need to clear my byte buffer every time I read new input, I just used a for loop to zero it out and it worked

Related

Reading in Partial InputStream

I'm looking to read an InputStream in sections because I need the first n bytes of the file and last m bytes as well as the contents between.
byte[] beginning = inputStream.readNBytes(16);
This works just fine, but to get the last m bytes, I tried the following:
byte[] middle = inputStream.readNBytes(inputStream.available() - 32);
byte[] end = inputStream.readNBytes(inputStream.available());
The end variable looks how I expect it to but not the middle variable, which ends up cutting out part of the stream.
I'm also a bit confused why the buf parameter size in the input stream doesn't seem to be equal to the byte array size when converting one to the other.
Anyway, I assume this isn't working how I want it to because (inputStream.available() - 32) is not adding up to a value compatible with readNBytes, so part of the stream is lost.
Is there a way to go about doing this?
EDIT: What I ended up doing which seemed to work(mostly) is when creating the file, to prepend both pieces I will later be extracting instead of prepending one and appending the other. That way I can just call inputStream.readAllBytes() on the last piece.
I also had to change where I'm writing to the file. I was writing to a CipherOutputStream when I should've been writing to the FileOutputStream and using that to create the Cipher OS.
Even after doing this I still have an extra 16 bytes at the end of the file, which confuses me, but I can easily ignore that last bit if I can't figure out why it's doing that.

Reading first four bytes from ByteBuffer, then writing them back?

I have a ByteBuffer object called msg with the intended message length in the first four bytes, which I read as follows:
int msgLen = msg.getInt();
LOG.debug("Message size: " + msgLen);
If the msgLen is less than some threshold value, I have a partial message and need to cache. In this case, I'd like to put those first four bytes back into the beginning of the message; that is, put the message back together to be identical to pre-reading. For example:
if (msgLen < threshold) {
msg.rewind();
msg.put(msgLen);
Unfortunately, this does not seem to be the correct way to do this. I've tried many combinations of flip, put, and rewind, but must be misunderstanding.
How would I put the bytes back into the write buffer in their original order?

Answer was posted by Andremoniy in comments section. Read operations do not consume bytes in the buffer, so msg.rewind() was adequate. This didn't work in my case because of some other logic in the program, and I incorrectly associated that with a problem at the buffer level.

How do you write any ASCII character to a file in Java?

Basically I'm trying to use a BufferedWriter to write to a file using Java. The problem is, I'm actually doing some compression so I generate ints between 0 and 255, and I want to write the character who's ASCII value is equal to that int. When I try writing to the file, it writes many ? characters, so when I read the file back in, it reads those as 63, which is clearly not what I want. Any ideas how I can fix this?
Example code:
int a = generateCode(character); //a now has an int between 0 and 255
bw.write((char) a);
a is always between 0 and 255, but it sometimes writes '?'

You are really trying to write / read bytes to / from a file.
When you are processing byte-oriented data (as distinct from character-oriented data), you should be using InputStream and OutputStream classes and not Reader and Writer classes.
In this case, you should use FileInputStream / FileOutputStream, and wrap with a BufferedInputStream / BufferedOutputStream if you are doing byte-at-a-time reads and writes.
Those pesky '?' characters are due to issues the encoding/decoding process that happens when Java converts between characters and the default text encoding for your platform. The conversion from bytes to characters and back is often "lossy" ... depending on the encoding scheme used. You can avoid this by using the byte-oriented stream classes.
(And the answers that point out that ASCII is a 7-bit not 8-bit character set are 100% correct. You are really trying to read / write binary octets, not characters.)

You need to make up your mind what are you really doing. Are you trying to write some bytes to a file, or are you trying to write encoded text? Because these are different concepts in Java; byte I/O is handled by subclasses of InputStream and OutputStream, while character I/O is handled by subclasses of Reader and Writer. If what you really want to write is bytes to a file (which I'm guessing from your mention of compression), use an OutputStream, not a Writer.
Then there's another confusion you have, which is evident from your mention of "ASCII characters from 0-255." There are no ASCII characters above 127. Please take 15 minutes to read this: "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" (by Joel Spolsky). Pay particular attention to the parts where he explains the difference between a character set and an encoding, because it's critical for understanding Java I/O. (To review whether you understood, here's what you need to learn: Java Writers are classes that translate character output to byte output by applying a client-specified encoding to the text, and sending the bytes to an OutputStream.)

Java strings are based on 16 bit wide characters, it tries to perform conversions around that assumption if there is no clear specifications.
The following sample code, write and reads data directly as bytes, meaning 8-bit numbers which have an ASCII meaning associated with them.
import java.io.*;
public class RWBytes{
public static void main(String[] args)throws IOException{
String filename = "MiTestFile.txt";
byte[] bArray1 =new byte[5];
byte[] bArray2 =new byte[5];
bArray1[0]=65;//A
bArray1[1]=66;//B
bArray1[2]=67;//C
bArray1[3]=68;//D
bArray1[4]=69;//E
FileOutputStream fos = new FileOutputStream(filename);
fos.write(bArray1);
fos.close();
FileInputStream fis = new FileInputStream(filename);
fis.read(bArray2);
ByteArrayInputStream bais = new ByteArrayInputStream(bArray2);
for(int i =0; i< bArray2.length ; i++){
System.out.println("As the bytem value: "+ bArray2[i]);//as the numeric byte value
System.out.println("Converted as char to printiong to the screen: "+ String.valueOf((char)bArray2[i]));
}
}
}
A fixed subset of the 7 bit ASCII code is printable, A=65 for example, the 10 corresponds to the "new line" character which steps down one line on screen when found and "printed". Many other codes exist which manipulate a character oriented screen, these are invisible and manipulated the screen representation like tabs, spaces, etc. There are also other control characters which had the purpose of ringing a bell for example.
The higher 8 bit end above 127 is defined as whatever the implementer wanted, only the lower half have standard meanings associated.
For general binary byte handling there are no such qualm, they are number which represent the data. Only when trying to print to the screen the become meaningful in all kind of ways.

How do I convert a file's line number to a byte offset (or get the byte offset of the beginning of each line with a BufferedReader)?

I'm using a FileReader wrapped in a LineNumberReader to index a large text file for speedy access later on. Trouble is I can't seem to find a way to read a specific line number directly. BufferedReader supports the skip() function, but I need to convert the line number to a byte offset (or index the byte offset in the first place).
I took a crack at it using RandomAccessFile, and while it worked, it was horribly slow during the initial indexing. BufferedReader's speed is fantastic, but... well, you see the problem.
Some key info:
The file can be any size (currently 35,000 lines)
It's stored on Android's internal filesystem (via getFilesDir() to be exact)
The formatting is not fixed width, unfortunately (hence the need to read by line)
Any ideas?

Describes an extended RandomAccessFile with buffering semantics

Trouble is I can't seem to find a way to read a specific line number directly
Unless you know the length of each line you can't read it directly
There is no shortcut, you will need to read then entire file up front and calculate the offsets manualy.
I would just use a BufferedReader and then get the length of each string and add 1 (or 2?) for the EOL string.

Consider saving an file index along with the large text file. If this file is something you are generating, either on your server, or on the device, it should be trivial to generate an index once and distribute and/or save it along with the file.
I'd recommend an int[] where each value is the absolute offset in bytes for the n*(index+1) th line. So you could have an array of size 35,000 with the start of each line, or an array of size 350, with the start of every 100th line.
Here's an example assuming you have an index file containing an raw sequence of int values:
public String getLineByNumber(RandomAccessFile index,
RandomAccessFile data,
int lineNum) {
index.seek(lineNum*4);
data.seek(index.readInt());
return data.readLine();
}

I took a crack at it using
RandomAccessFile, and while it worked,
it was horribly slow during the
initial indexing
You've started the hard part already. Now for the harder part.
BufferedReader's speed is fantastic,
but...
Is there something in your use of RandomAccessFile that made it slower than it has to be? How many bytes did you read at a time? If you read one byte at a time it will be sloooooow. IF you read in an array of bytes at a time, you can speed things up and use the byte array as a buffer.

Just wrapping up the previous comments :
Either you use RandomAccessFile to first count byte and second parse what you read to find lines by hand OR you use a LineNumberReader to first read lines by lines and count the bytes of each line of char (2 bytes in utf 16 ?) by hand.

Same loop giving different output. Java IO

I am facing a very strange problem where the same loop keeps giving me different different output on change of value of BUFFER
final int BUFFER = 100;
char[] charArr = new char[BUFFER];
StringBuffer objStringBuffer = new StringBuffer();
while (objBufferedReader.read(charArr, 0,BUFFER) != -1) {
objStringBuffer.append(charArr);
}
objFileWriter.write(objStringBuffer.toString());
When i change BUFFER size to 500 it gives me a file of 7 kb when i change BUFFER size to 100000 it gives a file of 400 kb where the contents are repeated again and again. Please help. What should i do to prevent this?

You must remember the return value of the read() call, because read does not guarantee that the entire buffer has been filled.
You will need to remember that value and use it in the append call to only append that many characters.
Otherwise you'll append un-initialized characters to the StringBuffer that didn't actually come from the Reader (probably either 0 or values written by previous read() calls).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to read x number of characters from a text file progressively - java

Your current solution works for ASCII, but many characters in other encodings use more than one byte. You should use a Reader and a char[] instead of an InputStream and a byte[], respectively.

It turns out, I need to clear my byte buffer every time I read new input, I just used a for loop to zero it out and it worked

Related

Reading in Partial InputStream

Reading first four bytes from ByteBuffer, then writing them back?

How do you write any ASCII character to a file in Java?

How do I convert a file's line number to a byte offset (or get the byte offset of the beginning of each line with a BufferedReader)?

Same loop giving different output. Java IO

Categories

Resources