I need the advice from someone who knows Java very well and the memory issues.
I have a large file (something like 1.5GB) and I need to cut this file in many (100 small files for example) smaller files.
I know generally how to do it (using a BufferedReader), but I would like to know if you have any advice regarding the memory, or tips how to do it faster.
My file contains text, it is not binary and I have about 20 character per line.
To save memory, do not unnecessarily store/duplicate the data in memory (i.e. do not assign them to variables outside the loop). Just process the output immediately as soon as the input comes in.
It really doesn't matter whether you're using BufferedReader or not. It will not cost significantly much more memory as some implicitly seem to suggest. It will at highest only hit a few % from performance. The same applies on using NIO. It will only improve scalability, not memory use. It will only become interesting when you've hundreds of threads running on the same file.
Just loop through the file, write every line immediately to other file as you read in, count the lines and if it reaches 100, then switch to next file, etcetera.
Kickoff example:
String encoding = "UTF-8";
int maxlines = 100;
BufferedReader reader = null;
BufferedWriter writer = null;
try {
reader = new BufferedReader(new InputStreamReader(new FileInputStream("/bigfile.txt"), encoding));
int count = 0;
for (String line; (line = reader.readLine()) != null;) {
if (count++ % maxlines == 0) {
close(writer);
writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("/smallfile" + (count / maxlines) + ".txt"), encoding));
}
writer.write(line);
writer.newLine();
}
} finally {
close(writer);
close(reader);
}
First, if your file contains binary data, then using BufferedReader would be a big mistake (because you would be converting the data to String, which is unnecessary and could easily corrupt the data); you should use a BufferedInputStream instead. If it's text data and you need to split it along linebreaks, then using BufferedReader is OK (assuming the file contains lines of a sensible length).
Regarding memory, there shouldn't be any problem if you use a decently sized buffer (I'd use at least 1MB to make sure the HD is doing mostly sequential reading and writing).
If speed turns out to be a problem, you could have a look at the java.nio packages - those are supposedly faster than java.io,
You can consider using memory-mapped files, via FileChannels .
Generally a lot faster for large files. There are performance trade-offs that could make it slower, so YMMV.
Related answer: Java NIO FileChannel versus FileOutputstream performance / usefulness
This is a very good article:
http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/
In summary, for great performance, you should:
Avoid accessing the disk.
Avoid accessing the underlying operating system.
Avoid method calls.
Avoid processing bytes and characters individually.
For example, to reduce the access to disk, you can use a large buffer. The article describes various approaches.
Does it have to be done in Java? I.e. does it need to be platform independent? If not, I'd suggest using the 'split' command in *nix. If you really wanted, you could execute this command via your java program. While I haven't tested, I imagine it perform faster than whatever Java IO implementation you could come up with.
You can use java.nio which is faster than classical Input/Output stream:
http://java.sun.com/javase/6/docs/technotes/guides/io/index.html
Yes.
I also think that using read() with arguments like read(Char[], int init, int end) is a better way to read a such a large file
(Eg : read(buffer,0,buffer.length))
And I also experienced the problem of missing values of using the BufferedReader instead of BufferedInputStreamReader for a binary data input stream. So, using the BufferedInputStreamReader is a much better in this like case.
package all.is.well;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import junit.framework.TestCase;
/**
* #author Naresh Bhabat
*
Following implementation helps to deal with extra large files in java.
This program is tested for dealing with 2GB input file.
There are some points where extra logic can be added in future.
Pleasenote: if we want to deal with binary input file, then instead of reading line,we need to read bytes from read file object.
It uses random access file,which is almost like streaming API.
* ****************************************
Notes regarding executor framework and its readings.
Please note :ExecutorService executor = Executors.newFixedThreadPool(10);
* for 10 threads:Total time required for reading and writing the text in
* :seconds 349.317
*
* For 100:Total time required for reading the text and writing : seconds 464.042
*
* For 1000 : Total time required for reading and writing text :466.538
* For 10000 Total time required for reading and writing in seconds 479.701
*
*
*/
public class DealWithHugeRecordsinFile extends TestCase {
static final String FILEPATH = "C:\\springbatch\\bigfile1.txt.txt";
static final String FILEPATH_WRITE = "C:\\springbatch\\writinghere.txt";
static volatile RandomAccessFile fileToWrite;
static volatile RandomAccessFile file;
static volatile String fileContentsIter;
static volatile int position = 0;
public static void main(String[] args) throws IOException, InterruptedException {
long currentTimeMillis = System.currentTimeMillis();
try {
fileToWrite = new RandomAccessFile(FILEPATH_WRITE, "rw");//for random write,independent of thread obstacles
file = new RandomAccessFile(FILEPATH, "r");//for random read,independent of thread obstacles
seriouslyReadProcessAndWriteAsynch();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Thread currentThread = Thread.currentThread();
System.out.println(currentThread.getName());
long currentTimeMillis2 = System.currentTimeMillis();
double time_seconds = (currentTimeMillis2 - currentTimeMillis) / 1000.0;
System.out.println("Total time required for reading the text in seconds " + time_seconds);
}
/**
* #throws IOException
* Something asynchronously serious
*/
public static void seriouslyReadProcessAndWriteAsynch() throws IOException {
ExecutorService executor = Executors.newFixedThreadPool(10);//pls see for explanation in comments section of the class
while (true) {
String readLine = file.readLine();
if (readLine == null) {
break;
}
Runnable genuineWorker = new Runnable() {
#Override
public void run() {
// do hard processing here in this thread,i have consumed
// some time and ignore some exception in write method.
writeToFile(FILEPATH_WRITE, readLine);
// System.out.println(" :" +
// Thread.currentThread().getName());
}
};
executor.execute(genuineWorker);
}
executor.shutdown();
while (!executor.isTerminated()) {
}
System.out.println("Finished all threads");
file.close();
fileToWrite.close();
}
/**
* #param filePath
* #param data
* #param position
*/
private static void writeToFile(String filePath, String data) {
try {
// fileToWrite.seek(position);
data = "\n" + data;
if (!data.contains("Randomization")) {
return;
}
System.out.println("Let us do something time consuming to make this thread busy"+(position++) + " :" + data);
System.out.println("Lets consume through this loop");
int i=1000;
while(i>0){
i--;
}
fileToWrite.write(data.getBytes());
throw new Exception();
} catch (Exception exception) {
System.out.println("exception was thrown but still we are able to proceeed further"
+ " \n This can be used for marking failure of the records");
//exception.printStackTrace();
}
}
}
Don't use read without arguments.
It's very slow.
Better read it to buffer and move it to file quickly.
Use bufferedInputStream because it supports binary reading.
And it's all.
Unless you accidentally read in the whole input file instead of reading it line by line, then your primary limitation will be disk speed. You may want to try starting with a file containing 100 lines and write it to 100 different files one line in each and make the triggering mechanism work on the number of lines written to the current file. That program will be easily scalable to your situation.
Related
I am running a long running operation, Say 100k jobs. i want to update the progress of it in a file once every 100 such jobs are completed.
i am opening the file using bufferedWriter with append mode as false. Writing it and then closing it. this is done once every 100 jobs are completed. So the file open and close would have happened 1000 times. Can i optimise it further by opening and closing the file only once?
public static void writeMetaData(String writeDir, JSONObject jsonObject) throws Exception {
String filePath = writeDir.concat("/").concat("metadata.txt");
BufferedWriter metaDataWriter = Files.newBufferedWriter(Paths.get(filePath), StandardCharsets.UTF_8, StandardOpenOption.TRUNCATE_EXISTING);
metaDataWriter.write(jsonObject.toString());
IOUtils.closeQuietly(metaDataWriter);
}
for(int i =0 ; i < 100000; i++) {
// do Something;
if(i % 100 == 0) {
writeMetaData(writeDir, jsonObject);
}
}
File should only have a single line.
Expected file content after 100 jobs:
progress: 100
Expected file content after 200 jobs:
progress: 200
Can this be optimised further?
First of all, an expression like writeDir.concat("/").concat("metadata.txt") is reducing readability and performance. A straight-forward writeDir + "/" + "metadata.txt" will provide better performance. But since you’re constructing a string merely for constructing a Path, it’s even more straight-forward not to do the Path’s job in your code but rather use Paths.get(writeDir, "metadata.txt").
You can not rewind a BufferedWriter but you can rewind a FileChannel. Therefore, to keep the channel open and rewind it when needed, you have to construct a new writer after rewinding:
public static void writeMetaData(FileChannel ch, JSONObject jsonObj) throws IOException {
ch.position(0);
if(ch.size() > 0) ch.truncate(0);
Writer w = Channels.newWriter(ch, StandardCharsets.UTF_8.newEncoder(), 8192);
w.write(jsonObj.toString());
w.flush();
}
try(FileChannel ch = FileChannel.open(Paths.get(writeDir, "metadata.txt"),
StandardOpenOption.WRITE, StandardOpenOption.CREATE)) {
for(int i = 0; i < 100000; i++) {
// do Something;
if(i % 100 == 0) {
writeMetaData(ch, jsonObject);
}
}
}
It’s important that the use of the Writer ends with flush() to force the write of all buffered data, but not close() as that would also close the underlying channel. Note that this code does not wrap the writer into a BufferedWriter; encoding text as UTF-8 is already a buffered operation and by requesting a larger buffer for the encoder, matching BufferedWriter’s default buffer size, we get the same effect of buffering without the copying overhead.
Since writing is not an end in itself, there’s a question left regarding your reading side. If the reader is trying to read the data in some intervals, there’s the risk of overlapping with the write, getting incomplete data.
You could use
public static void writeMetaData(FileChannel ch, JSONObject jsonObj) throws IOException {
try(FileLock lock = ch.lock()) {
ch.position(0);
if(ch.size() > 0) ch.truncate(0);
Writer w = Channels.newWriter(ch, StandardCharsets.UTF_8.newEncoder(), 8192);
w.write(jsonObj.toString());
w.flush();
}
}
to lock the file during the write. But depending on the system, file locking might not be mandatory but only affect readers also trying to get a read lock.
When you use JDK 11 or newer, you may consider using
for(int i = 0; i < 100000; i++) {
// do Something;
if(i % 100 == 0) {
Files.writeString(Paths.get(writeDir, "metadata.txt"), jsonObject.toString());
}
}
which clearly wins on simplicity (yes, that’s the complete code, no additional method required). The default options do already include the desired StandardCharsets.UTF_8 and StandardOpenOption.TRUNCATE_EXISTING.
While it does open and close the file internally, it has some other performance tweaks which may compensate. Especially in the likely case that the string consists of ASCII characters only, as the implementation will simply write the string’s internal array directly to the file then.
A Stream does not allow to go back and rewrite content. A way to achieve what you want is using a RandomAccessFile.
Its setLength() method will truncate the file if you pass 0.
Here is a simple example:
import java.io.*;
public class Test
{
public static void updateFile(RandomAccessFile raf, String content) throws IOException
{
raf.setLength(0);
raf.write(content.getBytes("UTF-8"));
}
public static void main(String[] args) throws IOException
{
try(RandomAccessFile raf = new RandomAccessFile("metadata.txt", "rw"))
{
updateFile(raf, "progress: 100");
updateFile(raf, "progress: 200");
}
}
}
File operations are typically buffered by the underlying kernel, and so you're unlikely to see much of a performance benefit by keeping an open file descriptor for this kind of low throughput application.
Keeping your code as a single operation that gets to leave no state after it finishes, rather than designing it as a continuous rewindable stream, makes for an elegant, simple, and unless you've specifically requested synchronous IO, then also sufficiently performant implementation that gets to benefit from the optimizations of all of the layers that sit beneath it.
When you do get measurable impedance to performance by this, which I suspect you never will, you could use the RandomAccessFile API, or go unnecessarily lower level by using FileChannel as others already specified.
I think you shouldn't compromise the simplicity/elegance of your design for this kind of micro-optimization, which in the grand scheme of things, is guaranteed to be insignificant (one tiny write operation per 100 jobs processed).
I have to analyze different log files which include retrieving time-stamp, URL, etc.
I am using multithreading for this. Each thread is accessing different log file and doing the task.
Program for doing it :
public class checkMultithreadedThroughput{
public static void main(String args[]){
ArrayList<String> fileNames = new ArrayList<>();
fileNames.add("log1");
fileNames.add("log2");
fileNames.add("log3");
fileNames.add("log4");
fileNames.add("log5");
fileNames.add("log6");
fileNames.add("log7");
fileNames.add("log8");
fileNames.add("log9");
Thread[] threads = new Thread[fileNames.size()];
try{
for(int i=0; i<fileNames.size(); i++){
threads[i] = new MultithreadedThroughput(fileNames.get(i));
threads[i].start();
}
}catch(Exception e){
e.printStackTrace();
}
}
}
class MultithreadedThroughput extends Thread{
String filename = null;
MultithreadedThroughput(String filename){
this.filename = filename;
}
public void run(){
calculateThroughput();
}
public void calculateThroughput(){
String line = null;
BufferedReader br = null;
try{
br = new = new BufferedReader(new FileReader(new File(filename)));
while((line = br.readLine())!=null){
//do the analysis on line
}
}catch(Exception e){
e.printStackTrace();
}
}
}
Now in the MultithreadedThroughput class which extends Thread I am reading file using BufferedReader. The whole process takes around 15 minutes (file size is big around 2GB each).
I want to optimize the program in such a way that it takes less time.
The solution which I thought instead of starting threads on all log files, I will take one large log file at a time, split the large file into chunks (number of chunks equal to number of processor) and then starting threads on them OR other solution to have the same program as previous but instead of reading one line at a time, read multiple lines at a time and do the analysis.But I dont know any of them.
Please explain the solution.
In the calculateThroughput method I have to estimate throughput of a URL in per hour interval. So suppose if I break the files depending on number of processor then it may break in between one interval i.e.
Suppose interval start at 06.00.00 till 07:00:00 (one interval) like this their will be 24 interval (one day) in each log file. So if I break a large log file it may break in between an interval and if it does that one interval calculating how I will do. That's the problem I am facing with splitting of file.
I would not try and split a single file for multiple threads. This will create overhead and can't be better than doing several files in parallel.
Create the BufferedReader with a substantial buffer size, e.g. 64k or bigger. The optimum is system-dependent - you'll have to experiment. Later (due to a comment from OP:) The buffer size does not affect the application logic - data is read line by line, and the step from one hour into the next must be handled anyway by carrying the line over into the next batch.
There is no point in reading several lines at a time - readLine just fetches a line from the buffer.
Very likely you are losing time in the analysis.
I don't think you can do the job faster because more threads do not help if your processor doesn't have enough cores.
I have a huge CSV file with over 700K + lines. I have to parse lines of that CSV file and do operations. I thought of doing it by using threading. What I attempt to do at the first is simple. Every thread should process unique lines of the CSV file. I have a limited number of lines to read to 3000 only. I create three threads. Each thread should read a line of the CSV file. Following is the code:
import java.io.*;
class CSVOps implements Runnable
{
static int lineCount = 1;
static int limit = 3000;
BufferedReader CSVBufferedReader;
public CSVOps(){} // Default constructor
public CSVOps(BufferedReader br){
this.CSVBufferedReader = br;
}
private synchronized void readCSV(){
System.out.println("Current thread "+Thread.currentThread().getName());
String line;
try {
while((line = CSVBufferedReader.readLine()) != null){
System.out.println(line);
lineCount ++;
if(lineCount >= limit){
break;
}
}
}
catch (IOException e) {
e.printStackTrace();
}
}
public void run() {
readCSV();
}
}
class CSVResourceHandler
{
String CSVPath;
public CSVResourceHandler(){ }// default constructor
public CSVResourceHandler(String path){
File f = new File(path);
if(f.exists()){
CSVPath = path;
}
else{
System.out.println("Wrong file path! You gave: "+path);
}
}
public BufferedReader getCSVFileHandler(){
BufferedReader br = null;
try{
FileReader is = new FileReader(CSVPath);
br = new BufferedReader(is);
}
catch(Exception e){
}
return br;
}
}
public class invalidRefererCheck
{
public static void main(String [] args) throws InterruptedException
{
String pathToCSV = "/home/shantanu/DEV_DOCS/Contextual_Work/invalid_domain_kw_site_wise_click_rev2.csv";
CSVResourceHandler csvResHandler = new CSVResourceHandler(pathToCSV);
CSVOps ops = new CSVOps(csvResHandler.getCSVFileHandler());
Thread t1 = new Thread(ops);
t1.setName("T1");
Thread t2 = new Thread(ops);
t1.setName("T2");
Thread t3 = new Thread(ops);
t1.setName("T3");
t1.start();
t2.start();
t3.start();
}
}
Class CSVResourceHandler simple finds if the passed file exists and then creates a BufferedReader and gives it. This reader is passed to the CSVOps class. It has a method, readCSV, which reads a single line of the CSV file and prints it. There is a limit set to 3000.
Now for threads to not mess up with count, I declare those limit and count variable both as static. When I run this program I get weird output. I get only about 1000 records, and sometimes I get 1500. They are in random order. At the end of output I get two lines of the CSV file and the current thread name comes out to be main!!
I am very much a novice with threads. I want reading this CSV file to become fast. What can it be done?
Ok, first, do not use multiple threads to do parallel I/O from a single mechanical disk. It actually slows down performance because the mechanical head needs to seek the next reading location every time a thread gets a chance to run. Thus you are unnecessarily bouncing the disk's head around which is a costly operation.
Use a single producer multiple consumer model to read lines using a single thread and process them using a pool of workers.
On to your problem:
Shouldn't you actually be waiting for the threads to finish before exiting main?
public class invalidRefererCheck
{
public static void main(String [] args) throws InterruptedException
{
...
t1.start();
t2.start();
t3.start();
t1.join();
t2.join();
t3.join();
}
}
I suggest reading the file in big chunks. Allocate a big buffer object, read a chunk, parse back from the end to find the last EOL char, copy the last bit of the buffer into a temp string, shove a null into the buffer at the EOL+1, queue off the buffer reference, immediately create a new one, copy in the temp string first, then fill up the rest of the buffer and repeat until EOF. Repeat until done. Use a pool of threads to parse/process the buffers.
You have to queue up whole chunks of valid lines. Queueing off single lines will result in the thread comms taking longer than the parsing.
Note that this, and similar, will probably result in the chunks being processed 'out-of-order' by the threads in the pool. If order must be preserved, (for example, the input file is sorted and the output is going into another file which must remain sorted), you can have the chunk-assembler thread insert a sequence-number in each chunk object. The pool threads can then pass processed buffers to yet another thread, (or task), that keeps a list of out-of-order chunks until all previous chunks have come in.
Multithreading does not have to be difficult/dangerous/ineffective. If you use queues/pools/tasks, avoid synchronize/join, don't continually create/terminate/destroy threads and only queue around large buffer objects that only one thread ever gets to work on at a time. You should see a good speedup with next-to-no possibility of deadlocks, false-sharing, etc.
The next step in such a speedup would be to pre-allocate a pool queue of buffers to eliminate continual creation/deletion of the buffers and associated GC and with a (L1 cache size) 'dead-zone' at the start of every buffer to eliminate cache-sharing completely.
That would go plenty quick on a multicore box, (esp. with an SSD!).
Oh, Java, right. I apologise for the 'CplusPlus-iness' of my answer with the null terminator. The rest of the points are OK, though. This should be a language-agnostic answer:)
I'm working with a very big text file (755Mb).
I need to sort the lines (about 1890000) and then write them back in another file.
I already noticed that discussion that has a starting file really similar to mine:
Sorting Lines Based on words in them as keys
The problem is that i cannot store the lines in a collection in memory because I get a Java Heap Space Exception (even if i expanded it at maximum)..(already tried!)
I can't either open it with excel and use the sorting feature because the file is too large and it cannot be completely loaded..
I thought about using a DB ..but i think that writing all the lines then use the SELECT query it's too much long in terms of time executing..am I wrong?
Any hints appreciated
Thanks in advance
I think the solution here is to do a merge sort using temporary files:
Read the first n lines of the first file, (n being the number of lines you can afford to store and sort in memory), sort them, and write them to file 1.tmp (or however you call it). Do the same with the next n lines and store it in 2.tmp. Repeat until all lines of the original file has been processed.
Read the first line of each temporary file. Determine the smallest one (according to your sort order), write it to the destination file, and read the next line from the corresponding temporary file. Repeat until all lines have been processed.
Delete all the temporary files.
This works with arbitrary large files, as long as you have enough disk space.
You can run the following with
-mx1g -XX:+UseCompressedStrings # on Java 6 update 29
-mx1800m -XX:-UseCompressedStrings # on Java 6 update 29
-mx2g # on Java 7 update 2.
import java.io.*;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
public class Main {
public static void main(String... args) throws IOException {
long start = System.nanoTime();
generateFile("lines.txt", 755 * 1024 * 1024, 189000);
List<String> lines = loadLines("lines.txt");
System.out.println("Sorting file");
Collections.sort(lines);
System.out.println("... Sorted file");
// save lines.
long time = System.nanoTime() - start;
System.out.printf("Took %.3f second to read, sort and write to a file%n", time / 1e9);
}
private static void generateFile(String fileName, int size, int lines) throws FileNotFoundException {
System.out.println("Creating file to load");
int lineSize = size / lines;
StringBuilder sb = new StringBuilder();
while (sb.length() < lineSize) sb.append('-');
String padding = sb.toString();
PrintWriter pw = new PrintWriter(fileName);
for (int i = 0; i < lines; i++) {
String text = (i + padding).substring(0, lineSize);
pw.println(text);
}
pw.close();
System.out.println("... Created file to load");
}
private static List<String> loadLines(String fileName) throws IOException {
System.out.println("Reading file");
BufferedReader br = new BufferedReader(new FileReader(fileName));
List<String> ret = new ArrayList<String>();
String line;
while ((line = br.readLine()) != null)
ret.add(line);
System.out.println("... Read file.");
return ret;
}
}
prints
Creating file to load
... Created file to load
Reading file
... Read file.
Sorting file
... Sorted file
Took 4.886 second to read, sort and write to a file
divide and conquer is the best solution :)
divide your file to smaller ones, sort each file seperately then regroup.
Links:
Sort a file with huge volume of data given memory constraint
http://hackerne.ws/item?id=1603381
Algorithm:
How much memory do we have available? Let’s assume we have X MB of memory available.
Divide the file into K chunks, where X * K = 2 GB. Bring each chunk into memory and sort the lines as usual using any O(n log n) algorithm. Save the lines back to the file.
Now bring the next chunk into memory and sort.
Once we’re done, merge them one by one.
The above algorithm is also known as external sort. Step 3 is known as N-way merge
Why don't you try multithreading and increasing heap size of the program you are running? (this also requires you to use merge sort kind of thing provided you have more memory than 755mb in your system.)
Maybe u can use perl to format the file .and load into the database like mysql. it's so fast. and use the index to query the data. and write to another file.
u can set jvm heap size like '-Xms256m -Xmx1024m' .i hope to help u .thanks
Ok. I am supposed to write a program to take a 20 GB file as input with 1,000,000,000 records and create some kind of an index for faster access. I have basically decided to split the 1 bil records into 10 buckets and 10 sub-buckets within those. I am calculating two hash values for the record to locate its appropriate bucket. Now, i create 10*10 files, one for each sub-bucket. And as i hash the record from the input file, i decide which of the 100 files it goes to; then append the record's offset to that particular file.
I have tested this with a sample file with 10,000 records. I have repeated the process 10 times. Effectively emulating a 100,000 record file. For this it takes me around 18 seconds. This means its gonna take me forever to do the same for a 1 bil record file.
Is there anyway i can speed up/ optimize my writing.
And i am going through all this because i can't store all the records in main memory.
import java.io.*;
// PROGRAM DOES THE FOLLOWING
// 1. READS RECORDS FROM A FILE.
// 2. CALCULATES TWO SETS OF HASH VALUES N, M
// 3. APPENDING THE OFFSET OF THAT RECORD IN THE ORIGINAL FILE TO ANOTHER FILE "NM.TXT" i.e REPLACE THE VALUES OF N AND M.
// 4.
class storage
{
public static int siz=10;
public static FileWriter[][] f;
}
class proxy
{
static String[][] virtual_buffer;
public static void main(String[] args) throws Exception
{
virtual_buffer = new String[storage.siz][storage.siz]; // TEMPORARY STRING BUFFER TO REDUCE WRITES
String s,tes;
for(int y=0;y<storage.siz;y++)
{
for(int z=0;z<storage.siz;z++)
{
virtual_buffer[y][z]=""; // INITIALISING ALL ELEMENTS TO ZERO
}
}
int offset_in_file = 0;
long start = System.currentTimeMillis();
// READING FROM THE SAME IP FILE 20 TIMES TO EMULATE A SINGLE BIGGER FILE OF SIZE 20*IP FILE
for(int h=0;h<20;h++){
BufferedReader in = new BufferedReader(new FileReader("outTest.txt"));
while((s = in.readLine() )!= null)
{
tes = (s.split(";"))[0];
int n = calcHash(tes); // FINDING FIRST HASH VALUE
int m = calcHash2(tes); // SECOND HASH
index_up(n,m,offset_in_file); // METHOD TO WRITE TO THE APPROPRIATE FILE I.E NM.TXT
offset_in_file++;
}
in.close();
}
System.out.println(offset_in_file);
long end = System.currentTimeMillis();
System.out.println((end-start));
}
static int calcHash(String s) throws Exception
{
char[] charr = s.toCharArray();;
int i,tot=0;
for(i=0;i<charr.length;i++)
{
if(i%2==0)tot+= (int)charr[i];
}
tot = tot % storage.siz;
return tot;
}
static int calcHash2(String s) throws Exception
{
char[] charr = s.toCharArray();
int i,tot=1;
for(i=0;i<charr.length;i++)
{
if(i%2==1)tot+= (int)charr[i];
}
tot = tot % storage.siz;
if (tot<0)
tot=tot*-1;
return tot;
}
static void index_up(int a,int b,int off) throws Exception
{
virtual_buffer[a][b]+=Integer.toString(off)+"'"; // THIS BUFFER STORES THE DATA TO BE WRITTEN
if(virtual_buffer[a][b].length()>2000) // TO A FILE BEFORE WRITING TO IT, TO REDUCE NO. OF WRITES
{ .
String file = "c:\\adsproj\\"+a+b+".txt";
new writethreader(file,virtual_buffer[a][b]); // DOING THE ACTUAL WRITE PART IN A THREAD.
virtual_buffer[a][b]="";
}
}
}
class writethreader implements Runnable
{
Thread t;
String name, data;
writethreader(String name, String data)
{
this.name = name;
this.data = data;
t = new Thread(this);
t.start();
}
public void run()
{
try{
File f = new File(name);
if(!f.exists())f.createNewFile();
FileWriter fstream = new FileWriter(name,true); //APPEND MODE
fstream.write(data);
fstream.flush(); fstream.close();
}
catch(Exception e){}
}
}
Consider using VisualVM to pinpoint the bottlenecks. Everything else below is based on guesswork - and performance guesswork is often really, really wrong.
I think you have two issues with your write strategy.
The first is that you're starting a new thread on each write; the second is that you're re-opening the file on each write.
The thread problem is especially bad, I think, because I don't see anything preventing one thread writing on a file from overlapping with another. What happens then? Frankly, I don't know - but I doubt it's good.
Consider, instead, creating an array of open files for all 100. Your OS may have a problem with this - but I think probably not. Then create a queue of work for each file. Create a set of worker threads (100 is too many - think 10 or so) where each "owns" a set of files that it loops through, outputting and emptying the queue for each file. Pay attention to the interthread interaction between queue reader and writer - use an appropriate queue class.
I would throw away the entire requirement and use a database.