Ok. I am supposed to write a program to take a 20 GB file as input with 1,000,000,000 records and create some kind of an index for faster access. I have basically decided to split the 1 bil records into 10 buckets and 10 sub-buckets within those. I am calculating two hash values for the record to locate its appropriate bucket. Now, i create 10*10 files, one for each sub-bucket. And as i hash the record from the input file, i decide which of the 100 files it goes to; then append the record's offset to that particular file.
I have tested this with a sample file with 10,000 records. I have repeated the process 10 times. Effectively emulating a 100,000 record file. For this it takes me around 18 seconds. This means its gonna take me forever to do the same for a 1 bil record file.
Is there anyway i can speed up/ optimize my writing.
And i am going through all this because i can't store all the records in main memory.
import java.io.*;
// PROGRAM DOES THE FOLLOWING
// 1. READS RECORDS FROM A FILE.
// 2. CALCULATES TWO SETS OF HASH VALUES N, M
// 3. APPENDING THE OFFSET OF THAT RECORD IN THE ORIGINAL FILE TO ANOTHER FILE "NM.TXT" i.e REPLACE THE VALUES OF N AND M.
// 4.
class storage
{
public static int siz=10;
public static FileWriter[][] f;
}
class proxy
{
static String[][] virtual_buffer;
public static void main(String[] args) throws Exception
{
virtual_buffer = new String[storage.siz][storage.siz]; // TEMPORARY STRING BUFFER TO REDUCE WRITES
String s,tes;
for(int y=0;y<storage.siz;y++)
{
for(int z=0;z<storage.siz;z++)
{
virtual_buffer[y][z]=""; // INITIALISING ALL ELEMENTS TO ZERO
}
}
int offset_in_file = 0;
long start = System.currentTimeMillis();
// READING FROM THE SAME IP FILE 20 TIMES TO EMULATE A SINGLE BIGGER FILE OF SIZE 20*IP FILE
for(int h=0;h<20;h++){
BufferedReader in = new BufferedReader(new FileReader("outTest.txt"));
while((s = in.readLine() )!= null)
{
tes = (s.split(";"))[0];
int n = calcHash(tes); // FINDING FIRST HASH VALUE
int m = calcHash2(tes); // SECOND HASH
index_up(n,m,offset_in_file); // METHOD TO WRITE TO THE APPROPRIATE FILE I.E NM.TXT
offset_in_file++;
}
in.close();
}
System.out.println(offset_in_file);
long end = System.currentTimeMillis();
System.out.println((end-start));
}
static int calcHash(String s) throws Exception
{
char[] charr = s.toCharArray();;
int i,tot=0;
for(i=0;i<charr.length;i++)
{
if(i%2==0)tot+= (int)charr[i];
}
tot = tot % storage.siz;
return tot;
}
static int calcHash2(String s) throws Exception
{
char[] charr = s.toCharArray();
int i,tot=1;
for(i=0;i<charr.length;i++)
{
if(i%2==1)tot+= (int)charr[i];
}
tot = tot % storage.siz;
if (tot<0)
tot=tot*-1;
return tot;
}
static void index_up(int a,int b,int off) throws Exception
{
virtual_buffer[a][b]+=Integer.toString(off)+"'"; // THIS BUFFER STORES THE DATA TO BE WRITTEN
if(virtual_buffer[a][b].length()>2000) // TO A FILE BEFORE WRITING TO IT, TO REDUCE NO. OF WRITES
{ .
String file = "c:\\adsproj\\"+a+b+".txt";
new writethreader(file,virtual_buffer[a][b]); // DOING THE ACTUAL WRITE PART IN A THREAD.
virtual_buffer[a][b]="";
}
}
}
class writethreader implements Runnable
{
Thread t;
String name, data;
writethreader(String name, String data)
{
this.name = name;
this.data = data;
t = new Thread(this);
t.start();
}
public void run()
{
try{
File f = new File(name);
if(!f.exists())f.createNewFile();
FileWriter fstream = new FileWriter(name,true); //APPEND MODE
fstream.write(data);
fstream.flush(); fstream.close();
}
catch(Exception e){}
}
}
Consider using VisualVM to pinpoint the bottlenecks. Everything else below is based on guesswork - and performance guesswork is often really, really wrong.
I think you have two issues with your write strategy.
The first is that you're starting a new thread on each write; the second is that you're re-opening the file on each write.
The thread problem is especially bad, I think, because I don't see anything preventing one thread writing on a file from overlapping with another. What happens then? Frankly, I don't know - but I doubt it's good.
Consider, instead, creating an array of open files for all 100. Your OS may have a problem with this - but I think probably not. Then create a queue of work for each file. Create a set of worker threads (100 is too many - think 10 or so) where each "owns" a set of files that it loops through, outputting and emptying the queue for each file. Pay attention to the interthread interaction between queue reader and writer - use an appropriate queue class.
I would throw away the entire requirement and use a database.
Related
Currently I have about 12 csv files, each having about 1.5 million records.
I'm using univocity-parsers as my csv reader/parser library.
Using univocity-parsers, I read each file and add all the records to an arraylist with the addAll() method. When all 12 files are parsed and added to the array list, my code prints the size of the arraylist at the end.
for (int i = 0; i < 12; i++) {
myList.addAll(parser.parseAll(getReader("file-" + i + ".csv")));
}
It works fine at first until I reach my 6th consecutive file, then it seem to take forever in my IntelliJ IDE output window, never printing out the arraylist size even after an hour, where before my 6th file it was rather fast.
If it helps I'm running on a macbook pro (mid 2014) OSX Yosemite.
It was a textbook problem on forks and joins.
I'm the creator of this library. If you want to just count rows, use a
RowProcessor. You don't even need to count the rows yourself as the parser does that for you:
// Let's create our own RowProcessor to analyze the rows
static class RowCount extends AbstractRowProcessor {
long rowCount = 0;
#Override
public void processEnded(ParsingContext context) {
// this returns the number of the last valid record.
rowCount = context.currentRecord();
}
}
public static void main(String... args) throws FileNotFoundException {
// let's measure the time roughly
long start = System.currentTimeMillis();
//Creates an instance of our own custom RowProcessor, defined above.
RowCount myRowCountProcessor = new RowCount();
CsvParserSettings settings = new CsvParserSettings();
//Here you can select the column indexes you are interested in reading.
//The parser will return values for the columns you selected, in the order you defined
//By selecting no indexes here, no String objects will be created
settings.selectIndexes(/*nothing here*/);
//When you select indexes, the columns are reordered so they come in the order you defined.
//By disabling column reordering, you will get the original row, with nulls in the columns you didn't select
settings.setColumnReorderingEnabled(false);
//We instruct the parser to send all rows parsed to your custom RowProcessor.
settings.setRowProcessor(myRowCountProcessor);
//Finally, we create a parser
CsvParser parser = new CsvParser(settings);
//And parse! All rows are sent to your custom RowProcessor (CsvDimension)
//I'm using a 150MB CSV file with 3.1 million rows.
parser.parse(new File("c:/tmp/worldcitiespop.txt"));
//Nothing else to do. The parser closes the input and does everything for you safely. Let's just get the results:
System.out.println("Rows: " + myRowCountProcessor.rowCount);
System.out.println("Time taken: " + (System.currentTimeMillis() - start) + " ms");
}
Output
Rows: 3173959
Time taken: 1062 ms
Edit: I saw your comment regarding your need to use the actual data in the rows. In this case, process the rows in the rowProcessed() method of the RowProcessor class, that's the most efficient way to handle this.
Edit 2:
If you want to just count rows use getInputDimension from CsvRoutines:
CsvRoutines csvRoutines = new CsvRoutines();
InputDimension d = csvRoutines.getInputDimension(new File("/path/to/your.csv"));
System.out.println(d.rowCount());
System.out.println(d.columnCount());
In parseAll they use 10000 elements for preallocation.
/**
* Parses all records from the input and returns them in a list.
*
* #param reader the input to be parsed
* #return the list of all records parsed from the input.
*/
public final List<String[]> parseAll(Reader reader) {
List<String[]> out = new ArrayList<String[]>(10000);
beginParsing(reader);
String[] row;
while ((row = parseNext()) != null) {
out.add(row);
}
return out;
}
If you have millions of records (lines in file I guess) it is not good for performance and memory allocation because it will double the size and copy when allocate new space.
You could try to implement your own parseAll method like this:
public List<String[]> parseAll(Reader reader, int numberOfLines) {
List<String[]> out = new ArrayList<String[]>(numberOfLines);
parser.beginParsing(reader);
String[] row;
while ((row = parser.parseNext()) != null) {
out.add(row);
}
return out;
}
And check if it helps.
The problem is that you are running out of memory. When this happens, the computer begins to crawl, since it starts to swap memory to disk, and viceversa.
Reading the whole contents into memory is definitely not the best strategy to follow. And since you are only interested in calculating some statistics, you do not even need to use addAll() at all.
The objective in computer science is always to meet an equilibrium between memory spent and execution speed. You can always deal with both concepts, trading memory for more speed or speed for memory savings.
So, loading the whole files into memory is comfortable for you, but not a solution, not even in the future, when computers will include terabytes of memory.
public int getNumRecords(CsvParser parser, int start) {
int toret = start;
parser.beginParsing(reader);
while (parser.parseNext() != null) {
++toret;
}
return toret;
}
As you can see, there is no memory spent in this function (except each single row); you can use it inside a loop for your CSV files, and finish with the total count of rows. The next step is to create a class for all your statistics, substituting that int start with your object.
class Statistics {
public Statistics() {
numRows = 0;
numComedies = 0;
}
public countRow() {
++numRows;
}
public countComedies() {
++numComedies;
}
// more things...
private int numRows;
private int numComedies;
}
public int calculateStatistics(CsvParser parser, Statistics stats) {
int toret = start;
parser.beginParsing(reader);
while (parser.parseNext() != null) {
stats.countRow();
}
return toret;
}
Hope this helps.
I have two tsv files to parse and extract values from each file. Each line may have 4-5 attributes per line. The content of both the files are as below :
1 44539 C T 19.44
1 44994 A G 4.62
1 45112 TATGG 0.92
2 43635 Z Q 0.87
3 5672 AAS 0.67
There are some records in each file that have first 3 or 4 attributes same but different value. I want to retain higher value of such records and prepare new file with all unique values. For example:
1 44539 C T 19.44
1 44539 C T 25.44
I need to retain one with the higher value in above case record with value 25.44
I have drafted code for this however after few minutes the program runs slow. I am reading each record from a file forming a key value pair with the first 3 or 4 records as key and last record as value and storing it in hashmap and use it to again write to a file. Is there a better solution?
also how can I test if my code is giving me correct output in file?
One file is of size 498 MB with 23822225 records and other is of 515 MB with 24500367 records.
I get Exception in thread "main" java.lang.OutOfMemoryError: Java heap space error for the file with size 515 MB.
Is there a better way I can code to execute the program efficiently with out increasing heap size.
I might have to deal with larger files in future, what would be the trick to solve such problems?
public class UniqueExtractor {
private int counter = 0;
public static void main(String... aArgs) throws IOException {
UniqueExtractor parser = new UniqueExtractor("/Users/xxx/Documents/xyz.txt");
long startTime = System.currentTimeMillis();
parser.processLineByLine();
parser.writeToFile();
long endTime = System.currentTimeMillis();
long total_time = endTime - startTime;
System.out.println("done in " + total_time/1000 + "seconds ");
}
public void writeToFile()
{
System.out.println("writing to a file");
try {
PrintWriter writer = new PrintWriter("/Users/xxx/Documents/xyz_unique.txt", "UTF-8");
Iterator it = map.entrySet().iterator();
StringBuilder sb = new StringBuilder();
while (it.hasNext()) {
sb.setLength(0);
Map.Entry pair = (Map.Entry)it.next();
sb.append(pair.getKey());
sb.append(pair.getValue());
writer.println(sb.toString());
writer.flush();
it.remove();
}
}
catch(Exception e)
{
e.printStackTrace();
}
}
public UniqueExtractor(String fileName)
{
fFilePath = fileName;
}
private HashMap<String, BigDecimal> map = new HashMap<String, BigDecimal>();
public final void processLineByLine() throws IOException {
try (Scanner scanner = new Scanner(new File(fFilePath))) {
while (scanner.hasNextLine())
{
//System.out.println("ha");
System.out.println(++counter);
processLine(scanner.nextLine());
}
}
}
protected void processLine(String aLine)
{
StringBuilder sb = new StringBuilder();
String[] split = aLine.split(" ");
BigDecimal bd = null;
BigDecimal bd1= null;
for (int i=0; i < split.length-1; i++)
{
//System.out.println(i);
//System.out.println();
sb.append(split[i]);
sb.append(" ");
}
bd= new BigDecimal((split[split.length-1]));
//System.out.print("key is" + sb.toString());
//System.out.println("value is "+ bd);
if (map.containsKey(sb.toString()))
{
bd1 = map.get(sb.toString());
int res = bd1.compareTo(bd);
if (res == -1)
{
System.out.println("replacing ...."+ sb.toString() + bd1 + " with " + bd);
map.put(sb.toString(), bd);
}
}
else
{
map.put(sb.toString(), bd);
}
sb.setLength(0);
}
private String fFilePath;
}
There are a couple main things you may want to consider to improve the performance of this program.
Avoid BigDecimal
While BigDecimal is very useful, it has a lot of overhead, both in speed and space requirements. According to your examples, you don't have very much precision to worry about, so I would recommend switching to plain floats or doubles. These would take a mere fraction of the space (so you could process larger files) and would probably be faster to work with.
Avoid StringBuilder
This is not a general rule, but applies in this case: you appear to be parsing and then rebuilding aLine in processLine. This is very expensive, and probably unnecessary. You could, instead, use aLine.lastIndexOf('\t') and aLine.substring to cut up the String with much less overhead.
These two should significantly improve the performance of your code, but don't address the overall algorithm.
Dataset splitting
You're trying to handle enough data that you might want to consider not keeping all of it in memory at once.
For example, you could split up your data set into multiple files based on the first field, run your program on each of the files, and then rejoin the files into one. You can do this with more than one field if you need more splitting. This requires less memory usage because the splitting program does not have to keep more than a single line in memory at once, and the latter programs only need to keep a chunk of the original data in memory at once, not the entire thing.
You may want to try the specific optimizations outlined above, and then see if you need more efficiency, in which case try to do dataset splitting.
I implemented a wordcount program with Java. Basically, the program takes a large file (in my tests, I used a 10 gb data file that contained numbers only), and counts the number of times each 'word' appears - in this case, a number (23723 for example might appear 243 times in the file).
Below is my implementation. I seek to improve it, with mainly performance in mind, but a few other things as well, and I am looking for some guidance. Here are a few of the issues I wish to correct:
Currently, the program is threaded and works properly. However, what I do is pass a chunk of memory (500MB/NUM_THREADS) to each thread, and each thread proceeds to wordcount. The problem here is that I have the main thread wait for ALL the threads to complete before passing more data to each thread. It isn't too much of a problem, but there is a period of time where a few threads will wait and do nothing for a while. I believe some sort of worker pool or executor service could solve this problem (I have not learned the syntax for this yet).
The program will only work for a file that contains integers. That's a problem. I struggled with this a lot, as I didn't know how to iterate through the data without creating loads of unused variables (using a String or even StringBuilder had awful performance). Currently, I use the fact that I know the input is an integer, and just store the temporary variables as an int, so no memory problems there. I want to be able to use some sort of delimiter, whether that delimiter be a space, or several characters.
I am using a global ConcurrentHashMap to story key value pairs. For example, if a thread finds a number "24624", it searches for that number in the map. If it exists, it will increase the value of that key by one. The value of the keys at the end represent the number of occurrences of that key. So is this the proper design? Would I gain in performance by giving each thread it's own hashmap, and then merging them all at the end?
Is there any other way of seeking through a file with an offset without using the class RandomAccessMemory? This class will only read into a byte array, which I then have to convert. I haven't timed this conversion, but maybe it could be faster to use something else.
I am open to other possibilities as well, this is just what comes to mind.
Note: Splitting the file is not an option I want to explore, as I might be deploying this on a server in which I should not be creating my own files, but if it would really be a performance boost, I might listen.
Other Note: I am new to java threading, as well as new to StackOverflow. Be gentle.
public class BigCount2 {
public static void main(String[] args) throws IOException, InterruptedException {
int num, counter;
long i, j;
String delimiterString = " ";
ArrayList<Character> delim = new ArrayList<Character>();
for (char c : delimiterString.toCharArray()) {
delim.add(c);
}
int counter2 = 0;
num = Integer.parseInt(args[0]);
int bytesToRead = 1024 * 1024 * 1024 / 2; //500 MB, size of loop
int remainder = bytesToRead % num;
int k = 0;
bytesToRead = bytesToRead - remainder;
int byr = bytesToRead / num;
String filepath = "C:/Users/Daniel/Desktop/int-dataset-10g.dat";
RandomAccessFile file = new RandomAccessFile(filepath, "r");
Thread[] t = new Thread [num];//array of threads
ConcurrentMap<Integer, Integer> wordCountMap = new ConcurrentHashMap<Integer, Integer>(25000);
byte [] byteArray = new byte [byr]; //allocates 500mb to a 2D byte array
char[] newbyte;
for (i = 0; i < file.length(); i += bytesToRead) {
counter = 0;
for (j = 0; j < bytesToRead; j += byr) {
file.seek(i + j);
file.read(byteArray, 0, byr);
newbyte = new String(byteArray).toCharArray();
t[counter] = new Thread(
new BigCountThread2(counter,
newbyte,
delim,
wordCountMap));//giving each thread t[i] different file fileReader[i]
t[counter].start();
counter++;
newbyte = null;
}
for (k = 0; k < num; k++){
t[k].join(); //main thread continues after ALL threads have finished.
}
counter2++;
System.gc();
}
file.close();
System.exit(0);
}
}
class BigCountThread2 implements Runnable {
private final ConcurrentMap<Integer, Integer> wordCountMap;
char [] newbyte;
private ArrayList<Character> delim;
private int threadId; //use for later
BigCountThread2(int tid,
char[] newbyte,
ArrayList<Character> delim,
ConcurrentMap<Integer, Integer> wordCountMap) {
this.delim = delim;
threadId = tid;
this.wordCountMap = wordCountMap;
this.newbyte = newbyte;
}
public void run() {
int intCheck = 0;
int counter = 0; int i = 0; Integer check; int j =0; int temp = 0; int intbuilder = 0;
for (i = 0; i < newbyte.length; i++) {
intCheck = Character.getNumericValue(newbyte[i]);
if (newbyte[i] == ' ' || intCheck == -1) { //once a delimiter is found, the current tempArray needs to be added to the MAP
check = wordCountMap.putIfAbsent(intbuilder, 1);
if (check != null) { //if returns null, then it is the first instance
wordCountMap.put(intbuilder, wordCountMap.get(intbuilder) + 1);
}
intbuilder = 0;
}
else {
intbuilder = (intbuilder * 10) + intCheck;
counter++;
}
}
}
}
Some thoughts on a little of most ..
.. I believe some sort of worker pool or executor service could solve this problem (I have not learned the syntax for this yet).
If all the threads take about the same time to process the same amount of data, then there really isn't that much of a "problem" here.
However, one nice thing about a Thread Pool is it allows one to rather trivially adjust some basic parameters such as number of concurrent workers. Furthermore, using an executor service and Futures can provide an additional level of abstraction; in this case it could be especially handy if each thread returned a map as the result.
The program will only work for a file that contains integers. That's a problem. I struggled with this a lot, as I didn't know how to iterate through the data without creating loads of unused variables (using a String or even StringBuilder had awful performance) ..
This sounds like an implementation issue. While I would first try a StreamTokenizer (because it's already written), if doing it manually, I would check out the source - a good bit of that can be omitted when simplifying the notion of a "token". (It uses a temporary array to build the token.)
I am using a global ConcurrentHashMap to story key value pairs. .. So is this the proper design? Would I gain in performance by giving each thread it's own hashmap, and then merging them all at the end?
It would reduce locking and may increase performance to use a separate map per thread and merge strategy. Furthermore, the current implementation is broken as wordCountMap.put(intbuilder, wordCountMap.get(intbuilder) + 1) is not atomic and thus the operation might under count. I would use a separate map simply because reducing mutable shared state makes a threaded program much easier to reason about.
Is there any other way of seeking through a file with an offset without using the class RandomAccessMemory? This class will only read into a byte array, which I then have to convert. I haven't timed this conversion, but maybe it could be faster to use something else.
Consider using a FileReader (and BufferedReader) per thread on the same file. This will avoid having to first copy the file into the array and slice it out for individual threads which, while the same amount of total reading, avoids having to soak up so much memory. The reading done is actually not random access, but merely sequential (with a "skip") starting from different offsets - each thread still works on a mutually exclusive range.
Also, the original code with the slicing is broken if an integer value was "cut" in half as each of the threads would read half the word. One work-about is have each thread skip the first word if it was a continuation from the previous block (i.e. scan one byte sooner) and then read-past the end of it's range as required to complete the last word.
I need the advice from someone who knows Java very well and the memory issues.
I have a large file (something like 1.5GB) and I need to cut this file in many (100 small files for example) smaller files.
I know generally how to do it (using a BufferedReader), but I would like to know if you have any advice regarding the memory, or tips how to do it faster.
My file contains text, it is not binary and I have about 20 character per line.
To save memory, do not unnecessarily store/duplicate the data in memory (i.e. do not assign them to variables outside the loop). Just process the output immediately as soon as the input comes in.
It really doesn't matter whether you're using BufferedReader or not. It will not cost significantly much more memory as some implicitly seem to suggest. It will at highest only hit a few % from performance. The same applies on using NIO. It will only improve scalability, not memory use. It will only become interesting when you've hundreds of threads running on the same file.
Just loop through the file, write every line immediately to other file as you read in, count the lines and if it reaches 100, then switch to next file, etcetera.
Kickoff example:
String encoding = "UTF-8";
int maxlines = 100;
BufferedReader reader = null;
BufferedWriter writer = null;
try {
reader = new BufferedReader(new InputStreamReader(new FileInputStream("/bigfile.txt"), encoding));
int count = 0;
for (String line; (line = reader.readLine()) != null;) {
if (count++ % maxlines == 0) {
close(writer);
writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("/smallfile" + (count / maxlines) + ".txt"), encoding));
}
writer.write(line);
writer.newLine();
}
} finally {
close(writer);
close(reader);
}
First, if your file contains binary data, then using BufferedReader would be a big mistake (because you would be converting the data to String, which is unnecessary and could easily corrupt the data); you should use a BufferedInputStream instead. If it's text data and you need to split it along linebreaks, then using BufferedReader is OK (assuming the file contains lines of a sensible length).
Regarding memory, there shouldn't be any problem if you use a decently sized buffer (I'd use at least 1MB to make sure the HD is doing mostly sequential reading and writing).
If speed turns out to be a problem, you could have a look at the java.nio packages - those are supposedly faster than java.io,
You can consider using memory-mapped files, via FileChannels .
Generally a lot faster for large files. There are performance trade-offs that could make it slower, so YMMV.
Related answer: Java NIO FileChannel versus FileOutputstream performance / usefulness
This is a very good article:
http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/
In summary, for great performance, you should:
Avoid accessing the disk.
Avoid accessing the underlying operating system.
Avoid method calls.
Avoid processing bytes and characters individually.
For example, to reduce the access to disk, you can use a large buffer. The article describes various approaches.
Does it have to be done in Java? I.e. does it need to be platform independent? If not, I'd suggest using the 'split' command in *nix. If you really wanted, you could execute this command via your java program. While I haven't tested, I imagine it perform faster than whatever Java IO implementation you could come up with.
You can use java.nio which is faster than classical Input/Output stream:
http://java.sun.com/javase/6/docs/technotes/guides/io/index.html
Yes.
I also think that using read() with arguments like read(Char[], int init, int end) is a better way to read a such a large file
(Eg : read(buffer,0,buffer.length))
And I also experienced the problem of missing values of using the BufferedReader instead of BufferedInputStreamReader for a binary data input stream. So, using the BufferedInputStreamReader is a much better in this like case.
package all.is.well;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import junit.framework.TestCase;
/**
* #author Naresh Bhabat
*
Following implementation helps to deal with extra large files in java.
This program is tested for dealing with 2GB input file.
There are some points where extra logic can be added in future.
Pleasenote: if we want to deal with binary input file, then instead of reading line,we need to read bytes from read file object.
It uses random access file,which is almost like streaming API.
* ****************************************
Notes regarding executor framework and its readings.
Please note :ExecutorService executor = Executors.newFixedThreadPool(10);
* for 10 threads:Total time required for reading and writing the text in
* :seconds 349.317
*
* For 100:Total time required for reading the text and writing : seconds 464.042
*
* For 1000 : Total time required for reading and writing text :466.538
* For 10000 Total time required for reading and writing in seconds 479.701
*
*
*/
public class DealWithHugeRecordsinFile extends TestCase {
static final String FILEPATH = "C:\\springbatch\\bigfile1.txt.txt";
static final String FILEPATH_WRITE = "C:\\springbatch\\writinghere.txt";
static volatile RandomAccessFile fileToWrite;
static volatile RandomAccessFile file;
static volatile String fileContentsIter;
static volatile int position = 0;
public static void main(String[] args) throws IOException, InterruptedException {
long currentTimeMillis = System.currentTimeMillis();
try {
fileToWrite = new RandomAccessFile(FILEPATH_WRITE, "rw");//for random write,independent of thread obstacles
file = new RandomAccessFile(FILEPATH, "r");//for random read,independent of thread obstacles
seriouslyReadProcessAndWriteAsynch();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Thread currentThread = Thread.currentThread();
System.out.println(currentThread.getName());
long currentTimeMillis2 = System.currentTimeMillis();
double time_seconds = (currentTimeMillis2 - currentTimeMillis) / 1000.0;
System.out.println("Total time required for reading the text in seconds " + time_seconds);
}
/**
* #throws IOException
* Something asynchronously serious
*/
public static void seriouslyReadProcessAndWriteAsynch() throws IOException {
ExecutorService executor = Executors.newFixedThreadPool(10);//pls see for explanation in comments section of the class
while (true) {
String readLine = file.readLine();
if (readLine == null) {
break;
}
Runnable genuineWorker = new Runnable() {
#Override
public void run() {
// do hard processing here in this thread,i have consumed
// some time and ignore some exception in write method.
writeToFile(FILEPATH_WRITE, readLine);
// System.out.println(" :" +
// Thread.currentThread().getName());
}
};
executor.execute(genuineWorker);
}
executor.shutdown();
while (!executor.isTerminated()) {
}
System.out.println("Finished all threads");
file.close();
fileToWrite.close();
}
/**
* #param filePath
* #param data
* #param position
*/
private static void writeToFile(String filePath, String data) {
try {
// fileToWrite.seek(position);
data = "\n" + data;
if (!data.contains("Randomization")) {
return;
}
System.out.println("Let us do something time consuming to make this thread busy"+(position++) + " :" + data);
System.out.println("Lets consume through this loop");
int i=1000;
while(i>0){
i--;
}
fileToWrite.write(data.getBytes());
throw new Exception();
} catch (Exception exception) {
System.out.println("exception was thrown but still we are able to proceeed further"
+ " \n This can be used for marking failure of the records");
//exception.printStackTrace();
}
}
}
Don't use read without arguments.
It's very slow.
Better read it to buffer and move it to file quickly.
Use bufferedInputStream because it supports binary reading.
And it's all.
Unless you accidentally read in the whole input file instead of reading it line by line, then your primary limitation will be disk speed. You may want to try starting with a file containing 100 lines and write it to 100 different files one line in each and make the triggering mechanism work on the number of lines written to the current file. That program will be easily scalable to your situation.
I have a huge CSV file with over 700K + lines. I have to parse lines of that CSV file and do operations. I thought of doing it by using threading. What I attempt to do at the first is simple. Every thread should process unique lines of the CSV file. I have a limited number of lines to read to 3000 only. I create three threads. Each thread should read a line of the CSV file. Following is the code:
import java.io.*;
class CSVOps implements Runnable
{
static int lineCount = 1;
static int limit = 3000;
BufferedReader CSVBufferedReader;
public CSVOps(){} // Default constructor
public CSVOps(BufferedReader br){
this.CSVBufferedReader = br;
}
private synchronized void readCSV(){
System.out.println("Current thread "+Thread.currentThread().getName());
String line;
try {
while((line = CSVBufferedReader.readLine()) != null){
System.out.println(line);
lineCount ++;
if(lineCount >= limit){
break;
}
}
}
catch (IOException e) {
e.printStackTrace();
}
}
public void run() {
readCSV();
}
}
class CSVResourceHandler
{
String CSVPath;
public CSVResourceHandler(){ }// default constructor
public CSVResourceHandler(String path){
File f = new File(path);
if(f.exists()){
CSVPath = path;
}
else{
System.out.println("Wrong file path! You gave: "+path);
}
}
public BufferedReader getCSVFileHandler(){
BufferedReader br = null;
try{
FileReader is = new FileReader(CSVPath);
br = new BufferedReader(is);
}
catch(Exception e){
}
return br;
}
}
public class invalidRefererCheck
{
public static void main(String [] args) throws InterruptedException
{
String pathToCSV = "/home/shantanu/DEV_DOCS/Contextual_Work/invalid_domain_kw_site_wise_click_rev2.csv";
CSVResourceHandler csvResHandler = new CSVResourceHandler(pathToCSV);
CSVOps ops = new CSVOps(csvResHandler.getCSVFileHandler());
Thread t1 = new Thread(ops);
t1.setName("T1");
Thread t2 = new Thread(ops);
t1.setName("T2");
Thread t3 = new Thread(ops);
t1.setName("T3");
t1.start();
t2.start();
t3.start();
}
}
Class CSVResourceHandler simple finds if the passed file exists and then creates a BufferedReader and gives it. This reader is passed to the CSVOps class. It has a method, readCSV, which reads a single line of the CSV file and prints it. There is a limit set to 3000.
Now for threads to not mess up with count, I declare those limit and count variable both as static. When I run this program I get weird output. I get only about 1000 records, and sometimes I get 1500. They are in random order. At the end of output I get two lines of the CSV file and the current thread name comes out to be main!!
I am very much a novice with threads. I want reading this CSV file to become fast. What can it be done?
Ok, first, do not use multiple threads to do parallel I/O from a single mechanical disk. It actually slows down performance because the mechanical head needs to seek the next reading location every time a thread gets a chance to run. Thus you are unnecessarily bouncing the disk's head around which is a costly operation.
Use a single producer multiple consumer model to read lines using a single thread and process them using a pool of workers.
On to your problem:
Shouldn't you actually be waiting for the threads to finish before exiting main?
public class invalidRefererCheck
{
public static void main(String [] args) throws InterruptedException
{
...
t1.start();
t2.start();
t3.start();
t1.join();
t2.join();
t3.join();
}
}
I suggest reading the file in big chunks. Allocate a big buffer object, read a chunk, parse back from the end to find the last EOL char, copy the last bit of the buffer into a temp string, shove a null into the buffer at the EOL+1, queue off the buffer reference, immediately create a new one, copy in the temp string first, then fill up the rest of the buffer and repeat until EOF. Repeat until done. Use a pool of threads to parse/process the buffers.
You have to queue up whole chunks of valid lines. Queueing off single lines will result in the thread comms taking longer than the parsing.
Note that this, and similar, will probably result in the chunks being processed 'out-of-order' by the threads in the pool. If order must be preserved, (for example, the input file is sorted and the output is going into another file which must remain sorted), you can have the chunk-assembler thread insert a sequence-number in each chunk object. The pool threads can then pass processed buffers to yet another thread, (or task), that keeps a list of out-of-order chunks until all previous chunks have come in.
Multithreading does not have to be difficult/dangerous/ineffective. If you use queues/pools/tasks, avoid synchronize/join, don't continually create/terminate/destroy threads and only queue around large buffer objects that only one thread ever gets to work on at a time. You should see a good speedup with next-to-no possibility of deadlocks, false-sharing, etc.
The next step in such a speedup would be to pre-allocate a pool queue of buffers to eliminate continual creation/deletion of the buffers and associated GC and with a (L1 cache size) 'dead-zone' at the start of every buffer to eliminate cache-sharing completely.
That would go plenty quick on a multicore box, (esp. with an SSD!).
Oh, Java, right. I apologise for the 'CplusPlus-iness' of my answer with the null terminator. The rest of the points are OK, though. This should be a language-agnostic answer:)