Strange FileInputStream/DataFileInputStream behaviour: seek()ing to odd positions

Strange FileInputStream/DataFileInputStream behaviour: seek()ing to odd positions - java

The good:
so, I have this binary data file (size - exactly 640631 bytes), and I'm trying to make Java read it.
I have two interchangeable classes implemented as layers for reading that data. One of them uses RandomAccessFile, which works great and all.
The bad:
Another one (the one this question is mostly about) tries to use FileInputStream and DataInputStream so that the very same data could be (at least theoretically) be read on MIDP 2.0 (CLDC 1.1) Java configuration (which doesn't have RandomAccessFile).
In that class, I open the data file like this:
FileInputStream res = new FileInputStream(new File(filename));
h = new DataInputStream(res);
...and implement seek()/skip() like this (position is a long that takes note of a current position in a file):
public void seek(long pos) throws java.io.IOException {
if (! this.isOpen()) {
throw new java.io.IOException("No file is open");
}
if (pos < position) {
// Seek to the start, then skip some bytes
this.reset();
this.skip(pos);
} else if (pos > position) {
// skip the remaining bytes until the position
this.skip(pos - position);
}
}
and
public void skip(long bytes) throws java.io.IOException {
if (! this.isOpen()) {
throw new java.io.IOException("No file is open");
}
long skipped = 0, step = 0;
do {
step = h.skipBytes((int)(bytes - skipped));
if (step < 0) {
throw new java.io.IOException("skip() failed");
}
skipped += step;
} while (skipped < bytes);
position += bytes;
}
The ugly:
The problem with the second class (the FileInputStream/DataInputStream one) is that sometimes it decides to reset the file position to some strange place in a file :) This happens both when I run this on J2SE (a computer) and J2ME (a mobile phone). Here's an example of the actual usage of that reader class and a bug that occurs:
// Open the data file
Reader r = new Reader(filename);
// r.position = 0, actual position in a file = 0
// Skip to where the data block that is needed starts
// (determined by some other code)
r.seek(189248);
// r.position = 189248, actual position in a file = 189248
// Do some reading...
r.readID(); r.readName(); r.readSurname();
// r.position = 189332, actual position in a file = 189332
// Skip some bytes (an unneeded record)
r.skip(288);
// r.position = 189620, actual position in a file = 189620
// Do some more reading...
r.readID(); r.readName(); r.readSurname();
// r.position = 189673, actual position in a file = 189673
// Skip some bytes (an unneeded record)
r.skip(37);
// AAAAND HERE WE GO:
// r.position = 189710, actual position in a file = 477
I was able to determine that when asked to skip another 37 bytes, Java positioned the file pointer to byte 477 from the very start or file instead.
"Freshly" (just after opening a file) seeking to a position 189710 (and beyond that) works OK. However, reopening a file every time I need a seek() is just painfully slow, especially on a mobile phone.
What has happened?

I can see nothing wrong with this. Are you positive of the r.position value before the last skip? Unless there's an underlying bug in the JDK streams or if you have multiple multiple threads using the Reader, then the only possibility I can guess at is that something is modifying the position value incorrectly when you read your fields.

Related

How Buffer Streams works internally in Java

I'm reading about Buffer Streams. I searched about it and found many answers that clear my concepts but still have little more questions.
After searching, I have come to know that, Buffer is temporary memory(RAM) which helps program to read data quickly instead hard disk. and when Buffers empty then native input API is called.
After reading little more I got answer from here that is.
Reading data from disk byte-by-byte is very inefficient. One way to
speed it up is to use a buffer: instead of reading one byte at a time,
you read a few thousand bytes at once, and put them in a buffer, in
memory. Then you can look at the bytes in the buffer one by one.
I have two confusion,
1: How/Who data filled in Buffers? (native API how?) as quote above, who filled thousand bytes at once? and it will consume same time. Suppose I have 5MB data, and 5MB loaded once in Buffer in 5 Seconds. and then program use this data from buffer in 5 seconds. Total 10 seconds. But if I skip buffering, then program get direct data from hard disk in 1MB/2sec same as 10Sec total. Please clear my this confusion.
2: The second one how this line works
BufferedReader inputStream = new BufferedReader(new FileReader("xanadu.txt"));
As I'm thinking FileReader write data to buffer, then BufferedReader read data from buffer memory? Also explain this.
Thanks.

As for the performance of using buffering during read/write, it's probably minimal in impact since the OS will cache too, however buffering will reduce the number of calls to the OS, which will have an impact.
When you add other operations on top, such as character encoding/decoding or compression/decompression, the impact is greater as those operations are more efficient when done in blocks.
You second question said:
As I'm thinking FileReader write data to buffer, then BufferedReader read data from buffer memory? Also explain this.
I believe your thinking is wrong. Yes, technically the FileReader will write data to a buffer, but the buffer is not defined by the FileReader, it's defined by the caller of the FileReader.read(buffer) method.
The operation is initiated from outside, when some code calls BufferedReader.read() (any of the overloads). BufferedReader will then check it's buffer, and if enough data is available in the buffer, it will return the data without involving the FileReader. If more data is needed, the BufferedReader will call the FileReader.read(buffer) method to get the next chunk of data.
It's a pull operation, not a push, meaning the data is pulled out of the readers by the caller.

All the stuff is done by a private method named fill() i give you for educational purpose, but all java IDE let you see the source code yourself :
private void fill() throws IOException {
int dst;
if (markedChar <= UNMARKED) {
/* No mark */
dst = 0;
} else {
/* Marked */
int delta = nextChar - markedChar;
if (delta >= readAheadLimit) {
/* Gone past read-ahead limit: Invalidate mark */
markedChar = INVALIDATED;
readAheadLimit = 0;
dst = 0;
} else {
if (readAheadLimit <= cb.length) {
/* Shuffle in the current buffer */
// here copy the read chars in a memory buffer named cb
System.arraycopy(cb, markedChar, cb, 0, delta);
markedChar = 0;
dst = delta;
} else {
/* Reallocate buffer to accommodate read-ahead limit */
char ncb[] = new char[readAheadLimit];
System.arraycopy(cb, markedChar, ncb, 0, delta);
cb = ncb;
markedChar = 0;
dst = delta;
}
nextChar = nChars = delta;
}
}
int n;
do {
n = in.read(cb, dst, cb.length - dst);
} while (n == 0);
if (n > 0) {
nChars = dst + n;
nextChar = dst;
}
}

read(byte[] b, int off, int len) : Reading A File To Its End

PREMISE
I am new to working with steams in Java and am finding that my question at least appears different from those asked, previously. Here is a fragment of my code at this juncture (the code is more of a proof of concept):
try {
//file is initialized to the path contained in the command-line args.
File file = new File(args[0]);
inputStream = new FileInputStream(file);
byte[] byteArray = new byte[(int)file.length()];
int offset = 0;
while (inputStream.read(byteArray, offset, byteArray.length - offset) != -1) {
for (int i = 0; i < byteArray.length; i++) {
if (byteArray[offset + i] >=32 && byteArray[offset + i] <= 126) {
System.out.print("#");
} else {
System.out.print("#");
}
//offset = byteArray.length - offset;
}
}
GOAL
Here is my goal: to create a program that reads in only 80 bytes of input (the number is arbitrary - let it be x), decides whether each byte within that segment represents an ASCII character, and prints accordingly. The last two portions of the code are "correct" for all intents and purposes: being that the code already appropriately makes the determination and prints, accordingly - this is not the premise of my question.
Let's say the length() of file is greater than 80 bytes and I want - while only reading in 80 bytes of input at a time - to reach the EOF, i.e input the file's entire contents. Each line printed to the console can only contain 80 - or, x amount of - bytes worth of content. I know to adjust the offset and have been tinkering with that; however, when I hit the EOF, I don't want to program to crash and burn - to "explode", so to speak.
QUESTION
When encountering EOF, how do I ensure the captured bytes are still read and that the code in the for loop is still executed?
For instance, changing the above inputStream.read() to:
inputStream.read(byteArray, offset, 80)
This would "bomb" were the end of file (EOF) encountered in reading the last bytes within the file. For instance, if I am trying to read 80 bytes and only 10 remain.

The return value from read tells you the number of bytes which were read. This will be <= the value of length. Just because the file is larger than length does not mean that a request for length number of bytes will actually result in that many bytes being read into your byte[].
When -1 is returned, that does indicate EOF. It also indicates that no data was read into your byte[].

Writing partial region of downloaded file

I am downloading a file from internet separately. Like 3 regions. Lets say I have to download a file of size 1024kB and i am have set the region as 0-340kB, 341 - 680kB and 681kB - 1024 kB. I have separate thread of each sections. But, the problem i have now is, writing the downloaded file content into a single file.
Since we have 3 threads, each will download the sections which needs to be write in to the file sequentially.
How can I achieve this ? I thought of having 3 temporary files and write into them. Once all the files written, I have to read file by file and write into a single file. I felt like this is kind of overhead. Is there any other better way ?
Thanks in advance.

To be clear, I am not convinced that this approach will actually improve the download speed. It may give more consistent download speeds if you are downloading the same file from multiple mirrors, though.
First off, if your file isn't too large, you can buffer all of it before you write it out. So allocate a buffer that all your threads can access:
byte[] buf = new byte[fileSize];
Now you create a suitable Thread type:
public class WriterThread extends Thread
{
byte[] buf;
int write_pos, write_remaining;
public WriterThread(byte[] buf, int start, int len)
{
this.buf = buf;
this.write_pos = start;
this.write_remaining = len;
}
#Override
public void run()
{
try (Socket s = yourMethodForSettingUpTheSocketConnection();
InputStream istream = s.getInputStream()) {
while (this.write_remaining > 0) {
int read = istream.read(this.buf, this.write_pos, this.write_remaining);
if (read == -1) error("Not enough data received");
this.write_remaining -= read;
this.write_pos += read;
}
// otherwise you are done reading your chunk!
}
}
}
Now you can start as many of these WriterThread objects with suitable starts and lengths. For example, for a file that is 6000 bytes in size:
byte[] buf = new byte[6000];
WriterThread t0 = new WriterThread(buf, 0, 3000);
WriterThread t1 = new WriterThread(buf, 3000, 3000);
t0.start();
t1.start();
t0.join();
t1.join();
// check for errors
Note the important bit here: each of the WriterThreads has a referecence to exactly the same buffer, just a different offset that it starts writing at. Of course you have to make sure that yourMethodForSettingUpTheSocketConnection requests data starting at offset this.write_pos; how you do that depends on the networking protocol that you use and is beyond what you asked about.
If your file is too big to fit into memory, this approach won't work. Instead, you'll have to use the (slower) method of first creating a large file and then writing to that. While I haven't tried that, you should be able to use java.nio.file.File.newByteChannel()' to set up a suitableSeekableByteChannelas your output file. If you create such aSeekableByteChannel sbc`, you should then be able to do
sbc.location(fileSize - 1); // move to the specified position in the file
sbc.write(java.nio.ByteBuffer.allocate(1)); // grow the file to the expected final size
and then use one distinct SeekableByteChannel object per thread, pointing to the same file on disk, and setting the write start location using the SeekableByteChannel.location(int) method. You'll need a temporary byte[] around which you can wrap a ByteBuffer (via ByteBuffer.wrap()), but otherwise the strategy is analogous to the above:
thread_local_sbc.location(this.write_pos);
and then every thread_local_sbc.write() will write to the file starting at this.write_pos.

Java: ASCII random line file access with state

Is there a better [pre-existing optional Java 1.6] solution than creating a streaming file reader class that will meet the following criteria?
Given an ASCII file of arbitrary large size where each line is terminated by a \n
For each invocation of some method readLine() read a random line from the file
And for the life of the file handle no call to readLine() should return the same line twice
Update:
All lines must eventually be read
Context: the file's contents are created from Unix shell commands to get a directory listing of all paths contained within a given directory; there are between millions to a billion files (which yields millions to a billion lines in the target file). If there is some way to randomly distribute the paths into a file during creation time that is an acceptable solution as well.

In order to avoid reading in the whole file, which may not be possible in your case, you may want to use a RandomAccessFile instead of a standard java FileInputStream. With RandomAccessFile, you can use the seek(long position) method to skip to an arbitrary place in the file and start reading there. The code would look something like this.
RandomAccessFile raf = new RandomAccessFile("path-to-file","rw");
HashMap<Integer,String> sampledLines = new HashMap<Integer,String>();
for(int i = 0; i < numberOfRandomSamples; i++)
{
//seek to a random point in the file
raf.seek((long)(Math.random()*raf.length()));
//skip from the random location to the beginning of the next line
int nextByte = raf.read();
while(((char)nextByte) != '\n')
{
if(nextByte == -1) raf.seek(0);//wrap around to the beginning of the file if you reach the end
nextByte = raf.read();
}
//read the line into a buffer
StringBuffer lineBuffer = new StringBuffer();
nextByte = raf.read();
while(nextByte != -1 && (((char)nextByte) != '\n'))
lineBuffer.append((char)nextByte);
//ensure uniqueness
String line = lineBuffer.toString();
if(sampledLines.get(line.hashCode()) != null)
i--;
else
sampledLines.put(line.hashCode(),line);
}
Here, sampledLines should hold your randomly selected lines at the end. You may need to check that you haven't randomly skipped to the end of the file as well to avoid an error in that case.
EDIT: I made it wrap to the beginning of the file in case you reach the end. It was a pretty simple check.
EDIT 2: I made it verify uniqueness of lines by using a HashMap.

Pre-process the input file and remember the offset of each new line. Use a BitSet to keep track of used lines. If you want to save some memory, then remember the offset of every 16th line; it is still easy to jump into the file and do a sequential lookup within a block of 16 lines.

Since you can pad the lines, I would do something along those lines, and you should also note that even then, there may exist a limitation with regards to what a List can actually hold.
Using a random number each time you want to read the line and adding it to a Set would also do, however this ensures that the file is completely read:
public class VeryLargeFileReading
implements Iterator<String>, Closeable
{
private static Random RND = new Random();
// List of all indices
final List<Long> indices = new ArrayList<Long>();
final RandomAccessFile fd;
public VeryLargeFileReading(String fileName, long lineSize)
{
fd = new RandomAccessFile(fileName);
long nrLines = fd.length() / lineSize;
for (long i = 0; i < nrLines; i++)
indices.add(i * lineSize);
Collections.shuffle(indices);
}
// Iterator methods
#Override
public boolean hasNext()
{
return !indices.isEmpty();
}
#Override
public void remove()
{
// Nope
throw new IllegalStateException();
}
#Override
public String next()
{
final long offset = indices.remove(0);
fd.seek(offset);
return fd.readLine().trim();
}
#Override
public void close() throws IOException
{
fd.close();
}
}

If the number of files is truly arbitrary it seems like there could be an associated issue with tracking processed files in terms of memory usage (or IO time if tracking in files instead of a list or set). Solutions that keep a growing list of selected lines also run in to timing-related issues.
I'd consider something along the lines of the following:
Create n "bucket" files. n could be determined based on something that takes in to account the number of files and system memory. (If n is large, you could generate a subset of n to keep open file handles down.)
Each file's name is hashed, and goes into an appropriate bucket file, "sharding" the directory based on arbitrary criteria.
Read in the bucket file contents (just filenames) and process as-is (randomness provided by hashing mechanism), or pick rnd(n) and remove as you go, providing a bit more randomosity.
Alternatively, you could pad and use the random access idea, removing indices/offsets from a list as they're picked.

How to speed up/optimize file write in my program

Ok. I am supposed to write a program to take a 20 GB file as input with 1,000,000,000 records and create some kind of an index for faster access. I have basically decided to split the 1 bil records into 10 buckets and 10 sub-buckets within those. I am calculating two hash values for the record to locate its appropriate bucket. Now, i create 10*10 files, one for each sub-bucket. And as i hash the record from the input file, i decide which of the 100 files it goes to; then append the record's offset to that particular file.
I have tested this with a sample file with 10,000 records. I have repeated the process 10 times. Effectively emulating a 100,000 record file. For this it takes me around 18 seconds. This means its gonna take me forever to do the same for a 1 bil record file.
Is there anyway i can speed up/ optimize my writing.
And i am going through all this because i can't store all the records in main memory.
import java.io.*;
// PROGRAM DOES THE FOLLOWING
// 1. READS RECORDS FROM A FILE.
// 2. CALCULATES TWO SETS OF HASH VALUES N, M
// 3. APPENDING THE OFFSET OF THAT RECORD IN THE ORIGINAL FILE TO ANOTHER FILE "NM.TXT" i.e REPLACE THE VALUES OF N AND M.
// 4.
class storage
{
public static int siz=10;
public static FileWriter[][] f;
}
class proxy
{
static String[][] virtual_buffer;
public static void main(String[] args) throws Exception
{
virtual_buffer = new String[storage.siz][storage.siz]; // TEMPORARY STRING BUFFER TO REDUCE WRITES
String s,tes;
for(int y=0;y<storage.siz;y++)
{
for(int z=0;z<storage.siz;z++)
{
virtual_buffer[y][z]=""; // INITIALISING ALL ELEMENTS TO ZERO
}
}
int offset_in_file = 0;
long start = System.currentTimeMillis();
// READING FROM THE SAME IP FILE 20 TIMES TO EMULATE A SINGLE BIGGER FILE OF SIZE 20*IP FILE
for(int h=0;h<20;h++){
BufferedReader in = new BufferedReader(new FileReader("outTest.txt"));
while((s = in.readLine() )!= null)
{
tes = (s.split(";"))[0];
int n = calcHash(tes); // FINDING FIRST HASH VALUE
int m = calcHash2(tes); // SECOND HASH
index_up(n,m,offset_in_file); // METHOD TO WRITE TO THE APPROPRIATE FILE I.E NM.TXT
offset_in_file++;
}
in.close();
}
System.out.println(offset_in_file);
long end = System.currentTimeMillis();
System.out.println((end-start));
}
static int calcHash(String s) throws Exception
{
char[] charr = s.toCharArray();;
int i,tot=0;
for(i=0;i<charr.length;i++)
{
if(i%2==0)tot+= (int)charr[i];
}
tot = tot % storage.siz;
return tot;
}
static int calcHash2(String s) throws Exception
{
char[] charr = s.toCharArray();
int i,tot=1;
for(i=0;i<charr.length;i++)
{
if(i%2==1)tot+= (int)charr[i];
}
tot = tot % storage.siz;
if (tot<0)
tot=tot*-1;
return tot;
}
static void index_up(int a,int b,int off) throws Exception
{
virtual_buffer[a][b]+=Integer.toString(off)+"'"; // THIS BUFFER STORES THE DATA TO BE WRITTEN
if(virtual_buffer[a][b].length()>2000) // TO A FILE BEFORE WRITING TO IT, TO REDUCE NO. OF WRITES
{ .
String file = "c:\\adsproj\\"+a+b+".txt";
new writethreader(file,virtual_buffer[a][b]); // DOING THE ACTUAL WRITE PART IN A THREAD.
virtual_buffer[a][b]="";
}
}
}
class writethreader implements Runnable
{
Thread t;
String name, data;
writethreader(String name, String data)
{
this.name = name;
this.data = data;
t = new Thread(this);
t.start();
}
public void run()
{
try{
File f = new File(name);
if(!f.exists())f.createNewFile();
FileWriter fstream = new FileWriter(name,true); //APPEND MODE
fstream.write(data);
fstream.flush(); fstream.close();
}
catch(Exception e){}
}
}

Consider using VisualVM to pinpoint the bottlenecks. Everything else below is based on guesswork - and performance guesswork is often really, really wrong.
I think you have two issues with your write strategy.
The first is that you're starting a new thread on each write; the second is that you're re-opening the file on each write.
The thread problem is especially bad, I think, because I don't see anything preventing one thread writing on a file from overlapping with another. What happens then? Frankly, I don't know - but I doubt it's good.
Consider, instead, creating an array of open files for all 100. Your OS may have a problem with this - but I think probably not. Then create a queue of work for each file. Create a set of worker threads (100 is too many - think 10 or so) where each "owns" a set of files that it loops through, outputting and emptying the queue for each file. Pay attention to the interthread interaction between queue reader and writer - use an appropriate queue class.

I would throw away the entire requirement and use a database.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Strange FileInputStream/DataFileInputStream behaviour: seek()ing to odd positions - java

Related

How Buffer Streams works internally in Java

read(byte[] b, int off, int len) : Reading A File To Its End

Writing partial region of downloaded file

Java: ASCII random line file access with state

How to speed up/optimize file write in my program

Categories

Resources