what is the better way to substring large text? - java

Suppose my file is 2GB, I want some specific data from one. index to another index(considering specific data 300MB between two index), what is the better way to do that?? I tried substring but throwing out of memory exception. Please suggest better way to do same.

In general, assuming that 2GB file is on disk, and you want to read some part from it into memory, you absolutely don't have to read the whole 2GB into memory first.
The most straightforward solution is using Random Access File
The point is that it provides an abstraction of a pointer that can be moved back and forth over a big file and once you're set you can read bytes from the place the pointer points on.
RandomAccessFile file = new RandomAccessFile(path, "r");
file.seek(position);
byte[] bytes = new byte[size];
file.read(bytes);
file.close();

Reading the file by character and writing them to the output file can solve the issue. Since it won't load the whole file at once.
So, the process will be - read the input file by character, continue to the desired substring start index, then start writing to an output file until the end of the substring.
If you are getting Exception in thread "main" java.lang.OutOfMemoryError: Java heap space, you can try increasing the heap size if you really need to read the file at once and you are sure that String size won't go past max String size limit.
The following snippet shows the idea above -
import java.io.*;
public class LargeFileSubstr {
public static void main(String[] args) throws IOException {
BufferedReader r = new BufferedReader(new FileReader("/Users/me/Downloads/big.txt"));
try (PrintWriter wr = new PrintWriter(new FileWriter("/Users/me/Downloads/big_substr.txt"))) {
int startIndex = 100;
int endIndex = 200;
int pointer = 0;
int ch;
while ((ch = r.read()) != -1) {
if (pointer > endIndex) {
break;
}
if (pointer >= startIndex) {
wr.print((char) ch);
}
pointer++;
}
}
}
}
I have tried this to take a 200MB substring out of 2GB file, works pretty reasonably fast.

Related

Is reading/writing in an array more efficient than reading/writing a char/byte one by one?

try(FileReader reader = new FileReader("input.txt")) {
int c;
while ((c = reader.read()) != -1)
System.out.print((char)c);
} catch (Exception ignored) { }
In this code, I read a char by char. Is it more efficient in someway to read a into an array of chars at once? In other words, is there any kind of optimization that happens when reading in arrays?
For example in this code, I have an array of char called arr and I read into it until there is noting left to read. Is it more efficient?
try(FileReader reader = new FileReader("input.txt")) {
int size;
char[] arr = new char[100];
while ((size = reader.read(arr)) != -1)
for (int i = 0; i < size; i++)
System.out.print(arr[i]);
} catch (Exception ignored) { }
The question applies for both reading/writing both chars/bytes.
Depends on the reader. The answer can be yes, though. Whatever Reader or InputStream is the actual 'raw' driver (the one that isn't just wrapping another reader or inputstream, but the one that is actually talking to the OS to get the data) - it may well implement the single-character read() method by asking the OS to read a single character.
In the end, you have a disk, and disks return data in blocks. So if you ask for 1 byte, you have 2 options as a computer:
Ask the disk for the block that contains the byte that is to be read. Store the block in memory someplace for a while. Return one byte; for the next few moments, if more requests for bytes come in from the same block, return from the stored data in memory and don't bother asking the disk at all. NOTE: This requires memory! Who allocates it? How much memory is okay? Tricky questions. OSes tend to give low level tools and don't like just picking values for any of these questions.
Ask the disk for the block that contains the byte that is to be read. Find the 1 byte needed from within this block. Ignore the rest of the data, return just that one byte. If in a few moments another byte from that block is asked for... ask the disk, again, for the whole block, and repeat this routine.
Which of the two models you get depends on many factors: For example: What kind of disk is it, what OS do you have, what underlying java reader are you using. But it is plausible you end up in this second mode and that is, as you can probably tell, usually incredibly slow, because you end up reading the same block 4000+ times instead of only once.
So, how to fix this? Well, java doesn't really know what the OS is doing either, so the safest bet is to let java do the caching. Then you have no dependencies on whatever the OS is doing.
You could write it yourself, so instead of:
for (int i = in.read(); i != -1; i = in.read()) {
processOneChar((char) i);
}
you could do:
char[] buffer = new char[4096];
while (true) {
int r = in.read(buffer);
if (r == -1) break;
for (int i = 0; i < r; i++) processOneChar(buffer[i]);
}
more code, but now the second scenario (the same block is read off the disk a ton of times) can no longer occur; you have given the OS the freedom to return to you up to 4096 chars worth of data.
Or, use a java builtin: BufferedX:
BufferedReader br = new BufferedReader(in);
for (int i = br.read(); i != -1; i = br.read()) {
processOneChar((char) i);
}
The implementation of BufferedReader guarantees that java will take care of making some reasonably sized buffer to avoid rereads of the same block off of disk.
NB: Note that the FileReader constructor you are using should not be used. It uses platform default encoding (anytime you convert bytes to characters, encoding is involved), and platform default is a recipe for untestable bugs, which are very bad. Use new FileReader(file, StandardCharsets.UTF_8) instead, or better yet, use the new API:
Path p = Paths.get("C:/file.txt");
try (BufferedReader br = Files.newBufferedReader(p)) {
for (int i = br.read(); i != -1; i = br.read()) {
processOneChar((char) i);
}
}
Note that this:
Defaults to UTF-8, because the Files API defaults to UTF-8 unlike most places in the VM.
Makes a bufferedreader immediately, no need to make it yourself.
Properly manages the resource (ensures it is closed regardless of how this code exits, be it normally or be exception), by using an ARM block.
Because a BufferedX is involved, no risk of the 'read the same block a lot' performance hole.
NB: The same logic applies when writing; disks such as SSDs can only write a whole block at a time. Now it's not just slow as molasses to write, you're also ruining your disk, as they get a limited number of writes.

How to read a file in chunks that is to large to be stored in memory

I'm practicing and I ran across a problem about sorting numbers from a file that is to large to fit in memory. I don't know how to do this so I thought I would give it a try. I ended up finding external sort, and I'm basically just trying to take the concept and code a solution to this problem. The text file that I'm practicing with is not that large to fit into memory; I'm just trying to to learn how to accomplish something like this. So Far I am reading from the file in 3 chunks of 500 lines each, sorting the chunks, and then writing the results chunks to their own file. This is working... although I'm not sure my implementation is how the external sort process is intended to be implemented:
import java.util.*;
import java.io.*;
public class ExternalSort{
public static void main(String[] args) {
File file = new File("Practice/lots_of_numbers.txt");
final int NUMBER_OF_CHUNKS = 3;
final int AMOUNT_PER_CHUNK = 500;
int numbers[][] = new int[NUMBER_OF_CHUNKS][AMOUNT_PER_CHUNK];
try{
Scanner scanner = new Scanner(file);
for(int i = 0; i < NUMBER_OF_CHUNKS; i++){
//Just creating a new file name for each chunk
StringBuilder sortedFileName = new StringBuilder().append("sortedFile").append(i).append(".txt");
for(int j = 0; j < AMOUNT_PER_CHUNK; j++){
numbers[i][j] = Integer.parseInt(scanner.nextLine());
}
Arrays.sort(numbers[i]);
saveResultsToFile(sortedFileName.toString(),numbers[i]);
}
scanner.close();
}catch(FileNotFoundException e){
System.out.println("Error: " + e);
}
}
public static void saveResultsToFile(String fileName, int arr[]){
try{
File file = new File(fileName);
PrintWriter printer = new PrintWriter(file);
for(int i : arr)
printer.println(i);
printer.close();
}catch(FileNotFoundException e){
System.out.println("Error :" + e);
}
}
}
My question is how am I supposed to break up a file into chunks? I happen to know exactly how many lines of text my file has because I created it, so it was easy to write this code...BUT the problem actually tells you the size of the file; as in memory, not how many LINES of text the file. I'm uncertain how to break up the data into "chunks of memory"(and how to size them) instead of lines of text. Also, if there is anything weird about my code, wrong, or bad practice PLEASE tell me, as I honestly don't know what I'm doing; I'm just trying to learn. As far as merging the sorted files back together, I don't know how to do that either, but I have an idea. I would like to try it before I ask for help on that part. Thanks!
This is how to get the size of the chunks that we want to break the file into:
public static long chunkSize(File file){
//We don't want to create more that 1024 temp files for sorting
final long MAX_AMOUNT_OF_TEMP_FILES = 1024;
long fileSize = file.length();
long freeMemory = Runtime.getRuntime().freeMemory();
//We want to divide the file size by the maximum amount of temp files we will use for sorting
long chunkSize = fileSize / MAX_AMOUNT_OF_TEMP_FILES;
//If the block size is less than half the available memory, then we can stand to make the block size larger
if(chunkSize < freeMemory / 2)
chunkSize = freeMemory / 2;
else
System.out.println("Me may potentially run out of memory");
return chunkSize ;
}

How Buffer Streams works internally in Java

I'm reading about Buffer Streams. I searched about it and found many answers that clear my concepts but still have little more questions.
After searching, I have come to know that, Buffer is temporary memory(RAM) which helps program to read data quickly instead hard disk. and when Buffers empty then native input API is called.
After reading little more I got answer from here that is.
Reading data from disk byte-by-byte is very inefficient. One way to
speed it up is to use a buffer: instead of reading one byte at a time,
you read a few thousand bytes at once, and put them in a buffer, in
memory. Then you can look at the bytes in the buffer one by one.
I have two confusion,
1: How/Who data filled in Buffers? (native API how?) as quote above, who filled thousand bytes at once? and it will consume same time. Suppose I have 5MB data, and 5MB loaded once in Buffer in 5 Seconds. and then program use this data from buffer in 5 seconds. Total 10 seconds. But if I skip buffering, then program get direct data from hard disk in 1MB/2sec same as 10Sec total. Please clear my this confusion.
2: The second one how this line works
BufferedReader inputStream = new BufferedReader(new FileReader("xanadu.txt"));
As I'm thinking FileReader write data to buffer, then BufferedReader read data from buffer memory? Also explain this.
Thanks.
As for the performance of using buffering during read/write, it's probably minimal in impact since the OS will cache too, however buffering will reduce the number of calls to the OS, which will have an impact.
When you add other operations on top, such as character encoding/decoding or compression/decompression, the impact is greater as those operations are more efficient when done in blocks.
You second question said:
As I'm thinking FileReader write data to buffer, then BufferedReader read data from buffer memory? Also explain this.
I believe your thinking is wrong. Yes, technically the FileReader will write data to a buffer, but the buffer is not defined by the FileReader, it's defined by the caller of the FileReader.read(buffer) method.
The operation is initiated from outside, when some code calls BufferedReader.read() (any of the overloads). BufferedReader will then check it's buffer, and if enough data is available in the buffer, it will return the data without involving the FileReader. If more data is needed, the BufferedReader will call the FileReader.read(buffer) method to get the next chunk of data.
It's a pull operation, not a push, meaning the data is pulled out of the readers by the caller.
All the stuff is done by a private method named fill() i give you for educational purpose, but all java IDE let you see the source code yourself :
private void fill() throws IOException {
int dst;
if (markedChar <= UNMARKED) {
/* No mark */
dst = 0;
} else {
/* Marked */
int delta = nextChar - markedChar;
if (delta >= readAheadLimit) {
/* Gone past read-ahead limit: Invalidate mark */
markedChar = INVALIDATED;
readAheadLimit = 0;
dst = 0;
} else {
if (readAheadLimit <= cb.length) {
/* Shuffle in the current buffer */
// here copy the read chars in a memory buffer named cb
System.arraycopy(cb, markedChar, cb, 0, delta);
markedChar = 0;
dst = delta;
} else {
/* Reallocate buffer to accommodate read-ahead limit */
char ncb[] = new char[readAheadLimit];
System.arraycopy(cb, markedChar, ncb, 0, delta);
cb = ncb;
markedChar = 0;
dst = delta;
}
nextChar = nChars = delta;
}
}
int n;
do {
n = in.read(cb, dst, cb.length - dst);
} while (n == 0);
if (n > 0) {
nChars = dst + n;
nextChar = dst;
}
}

sorting lines of an enormous file.txt in java

I'm working with a very big text file (755Mb).
I need to sort the lines (about 1890000) and then write them back in another file.
I already noticed that discussion that has a starting file really similar to mine:
Sorting Lines Based on words in them as keys
The problem is that i cannot store the lines in a collection in memory because I get a Java Heap Space Exception (even if i expanded it at maximum)..(already tried!)
I can't either open it with excel and use the sorting feature because the file is too large and it cannot be completely loaded..
I thought about using a DB ..but i think that writing all the lines then use the SELECT query it's too much long in terms of time executing..am I wrong?
Any hints appreciated
Thanks in advance
I think the solution here is to do a merge sort using temporary files:
Read the first n lines of the first file, (n being the number of lines you can afford to store and sort in memory), sort them, and write them to file 1.tmp (or however you call it). Do the same with the next n lines and store it in 2.tmp. Repeat until all lines of the original file has been processed.
Read the first line of each temporary file. Determine the smallest one (according to your sort order), write it to the destination file, and read the next line from the corresponding temporary file. Repeat until all lines have been processed.
Delete all the temporary files.
This works with arbitrary large files, as long as you have enough disk space.
You can run the following with
-mx1g -XX:+UseCompressedStrings # on Java 6 update 29
-mx1800m -XX:-UseCompressedStrings # on Java 6 update 29
-mx2g # on Java 7 update 2.
import java.io.*;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
public class Main {
public static void main(String... args) throws IOException {
long start = System.nanoTime();
generateFile("lines.txt", 755 * 1024 * 1024, 189000);
List<String> lines = loadLines("lines.txt");
System.out.println("Sorting file");
Collections.sort(lines);
System.out.println("... Sorted file");
// save lines.
long time = System.nanoTime() - start;
System.out.printf("Took %.3f second to read, sort and write to a file%n", time / 1e9);
}
private static void generateFile(String fileName, int size, int lines) throws FileNotFoundException {
System.out.println("Creating file to load");
int lineSize = size / lines;
StringBuilder sb = new StringBuilder();
while (sb.length() < lineSize) sb.append('-');
String padding = sb.toString();
PrintWriter pw = new PrintWriter(fileName);
for (int i = 0; i < lines; i++) {
String text = (i + padding).substring(0, lineSize);
pw.println(text);
}
pw.close();
System.out.println("... Created file to load");
}
private static List<String> loadLines(String fileName) throws IOException {
System.out.println("Reading file");
BufferedReader br = new BufferedReader(new FileReader(fileName));
List<String> ret = new ArrayList<String>();
String line;
while ((line = br.readLine()) != null)
ret.add(line);
System.out.println("... Read file.");
return ret;
}
}
prints
Creating file to load
... Created file to load
Reading file
... Read file.
Sorting file
... Sorted file
Took 4.886 second to read, sort and write to a file
divide and conquer is the best solution :)
divide your file to smaller ones, sort each file seperately then regroup.
Links:
Sort a file with huge volume of data given memory constraint
http://hackerne.ws/item?id=1603381
Algorithm:
How much memory do we have available? Let’s assume we have X MB of memory available.
Divide the file into K chunks, where X * K = 2 GB. Bring each chunk into memory and sort the lines as usual using any O(n log n) algorithm. Save the lines back to the file.
Now bring the next chunk into memory and sort.
Once we’re done, merge them one by one.
The above algorithm is also known as external sort. Step 3 is known as N-way merge
Why don't you try multithreading and increasing heap size of the program you are running? (this also requires you to use merge sort kind of thing provided you have more memory than 755mb in your system.)
Maybe u can use perl to format the file .and load into the database like mysql. it's so fast. and use the index to query the data. and write to another file.
u can set jvm heap size like '-Xms256m -Xmx1024m' .i hope to help u .thanks

Java heap space error while reading file in byte array

I am getting java out of heap error while using following code. Can someone tell me what I am doing wrong here ?
On debugging I see taht value of length is 709582875
In main function
File file = new File(fileLocation+fileName);
if(file.exists()){
s3Client.upload(bucketName,fileName,getBytesFromFile(file));
}
// Returns the contents of the file in a byte array.
public static byte[] getBytesFromFile(File file) throws IOException {
InputStream is = new FileInputStream(file);
// Get the size of the file
long length = file.length();
// You cannot create an array using a long type.
// It needs to be an int type.
// Before converting to an int type, check
// to ensure that file is not larger than Integer.MAX_VALUE.
if (length > Integer.MAX_VALUE) {
// File is too large
log.debug("file is too large"+length);
System.out.println("file is too large"+length);
}
if (length < Integer.MIN_VALUE || length > Integer.MAX_VALUE) {
throw new IOException
(length + " cannot be cast to int without changing its value.");
}
// return "test".getBytes();
// Create the byte array to hold the data
try{
byte[] bytes = new byte[(int)length];
}
catch(OutOfMemoryError e){ System.out.println(e.getStackTrace().toString());}
// Read in the bytes
int offset = 0;
int numRead = 0;
while (offset < bytes.length
&& (numRead=is.read(bytes, offset, bytes.length-offset)) >= 0) {
offset += numRead;
}
// Ensure all the bytes have been read in
if (offset < bytes.length) {
throw new IOException("Could not completely read file "+file.getName());
}
// Close the input stream and return bytes
is.close();
return bytes;
}
The problem is that the byte array you are allocating is too large and it use up the heap space.
You may try running your program with -Xms and -Xmx option to specify the min and max heap space the java virtual machine uses to run your program.
But I suggest you not to read the whole file into a byte array to process it. you can read part of it into a small byte array, process the portion, and continue to the next part. This way uses less heap space.
You are consuming 709582875 bytes (about 677MB) at the moment the byte array in the try block is allocated. This is quite large by conventional personal computing standards, and would consume most (if not all) of the memory of a JVM started with default settings.
Some information on default JVM memory settings can be found here
Try to increase heap size allocated by the Java Virtual Machine (JVM),
something like:
java -Xms<initial heap size> -Xmx<maximum heap size>
For example:
java -Xms64m -Xmx256m HelloWorld
Donot create such a huge byte[] array. Your heap may go out of memory. It is bad idea to create byte[] array of file length for such a large file. create small byte array and read the file in chunk by chunk basis
need some jvm tuning
java -Xms256m -Xmx1024m
Is there a particular reason you nee to read the whole file at once as a byte[]? Can you use a memory mapped ByteBuffer instead as this uses very little heap regardless of the size of the file.

Categories

Resources