Parse and Extract unique value from a text file efficiently - java

I have two tsv files to parse and extract values from each file. Each line may have 4-5 attributes per line. The content of both the files are as below :
1 44539 C T 19.44
1 44994 A G 4.62
1 45112 TATGG 0.92
2 43635 Z Q 0.87
3 5672 AAS 0.67
There are some records in each file that have first 3 or 4 attributes same but different value. I want to retain higher value of such records and prepare new file with all unique values. For example:
1 44539 C T 19.44
1 44539 C T 25.44
I need to retain one with the higher value in above case record with value 25.44
I have drafted code for this however after few minutes the program runs slow. I am reading each record from a file forming a key value pair with the first 3 or 4 records as key and last record as value and storing it in hashmap and use it to again write to a file. Is there a better solution?
also how can I test if my code is giving me correct output in file?
One file is of size 498 MB with 23822225 records and other is of 515 MB with 24500367 records.
I get Exception in thread "main" java.lang.OutOfMemoryError: Java heap space error for the file with size 515 MB.
Is there a better way I can code to execute the program efficiently with out increasing heap size.
I might have to deal with larger files in future, what would be the trick to solve such problems?
public class UniqueExtractor {
private int counter = 0;
public static void main(String... aArgs) throws IOException {
UniqueExtractor parser = new UniqueExtractor("/Users/xxx/Documents/xyz.txt");
long startTime = System.currentTimeMillis();
parser.processLineByLine();
parser.writeToFile();
long endTime = System.currentTimeMillis();
long total_time = endTime - startTime;
System.out.println("done in " + total_time/1000 + "seconds ");
}
public void writeToFile()
{
System.out.println("writing to a file");
try {
PrintWriter writer = new PrintWriter("/Users/xxx/Documents/xyz_unique.txt", "UTF-8");
Iterator it = map.entrySet().iterator();
StringBuilder sb = new StringBuilder();
while (it.hasNext()) {
sb.setLength(0);
Map.Entry pair = (Map.Entry)it.next();
sb.append(pair.getKey());
sb.append(pair.getValue());
writer.println(sb.toString());
writer.flush();
it.remove();
}
}
catch(Exception e)
{
e.printStackTrace();
}
}
public UniqueExtractor(String fileName)
{
fFilePath = fileName;
}
private HashMap<String, BigDecimal> map = new HashMap<String, BigDecimal>();
public final void processLineByLine() throws IOException {
try (Scanner scanner = new Scanner(new File(fFilePath))) {
while (scanner.hasNextLine())
{
//System.out.println("ha");
System.out.println(++counter);
processLine(scanner.nextLine());
}
}
}
protected void processLine(String aLine)
{
StringBuilder sb = new StringBuilder();
String[] split = aLine.split(" ");
BigDecimal bd = null;
BigDecimal bd1= null;
for (int i=0; i < split.length-1; i++)
{
//System.out.println(i);
//System.out.println();
sb.append(split[i]);
sb.append(" ");
}
bd= new BigDecimal((split[split.length-1]));
//System.out.print("key is" + sb.toString());
//System.out.println("value is "+ bd);
if (map.containsKey(sb.toString()))
{
bd1 = map.get(sb.toString());
int res = bd1.compareTo(bd);
if (res == -1)
{
System.out.println("replacing ...."+ sb.toString() + bd1 + " with " + bd);
map.put(sb.toString(), bd);
}
}
else
{
map.put(sb.toString(), bd);
}
sb.setLength(0);
}
private String fFilePath;
}

There are a couple main things you may want to consider to improve the performance of this program.
Avoid BigDecimal
While BigDecimal is very useful, it has a lot of overhead, both in speed and space requirements. According to your examples, you don't have very much precision to worry about, so I would recommend switching to plain floats or doubles. These would take a mere fraction of the space (so you could process larger files) and would probably be faster to work with.
Avoid StringBuilder
This is not a general rule, but applies in this case: you appear to be parsing and then rebuilding aLine in processLine. This is very expensive, and probably unnecessary. You could, instead, use aLine.lastIndexOf('\t') and aLine.substring to cut up the String with much less overhead.
These two should significantly improve the performance of your code, but don't address the overall algorithm.
Dataset splitting
You're trying to handle enough data that you might want to consider not keeping all of it in memory at once.
For example, you could split up your data set into multiple files based on the first field, run your program on each of the files, and then rejoin the files into one. You can do this with more than one field if you need more splitting. This requires less memory usage because the splitting program does not have to keep more than a single line in memory at once, and the latter programs only need to keep a chunk of the original data in memory at once, not the entire thing.
You may want to try the specific optimizations outlined above, and then see if you need more efficiency, in which case try to do dataset splitting.

Related

Subtraction in Java value shortage

I encounter a difficult problem. I am looking for a suggestion how to approach in this problem. I have three field in my dataset. I want to perform a subtraction.The problem is like that.
Time(s) a x
1 0.1 0.2
2 0.4
3 0.6
4 0.7
5 0.2 0.9
I need to perform a subtraction from (a-x). But the method of subtraction is like that at time 1s a has value 0.1. The operation will be (0.1-0.2) 1st iteration. 2nd iteration (0.1-0.4). 3rd iteration (0.1-0.6).4th iteration (0.1-0.7) But in 2nd iteration it will be (0.2-0.9).
This is my problem statement. I want to write down this code in Java. I don't need Java code. I can write it down myself. I need a suggestion how to proceed in this approach?. One thought is that creating array for each variable. But then stuck on loop. How the loop iterated? It is clear array a is static until it get next value, which is available at Time 5s.
This will depend on how large is your input file:
If the dataset fits into memory load it as either 2 separate array or as one array of Row objects with a and x as fields. After that it's simple iteration remembering what was the last row that contained a to use it when a is missing.
If the dataset is large it's better to read it using BufferedReader and only remember the last encountered a and x. This will greatly reduce the memory consumption and would be the preferred approach.
If a changes every 4 numbers you can use time's / 4 + 1 to get value from small array of a.
If a changes not every 4 numbers, then I suggest to use full array filled with same values.
Now that I see you're not using a database and just reading from a file, maybe try this
Just keep the old value of a until a new value can overwrite it.
This is memory efficient since it parses line by line.
public static List<Double> parseFile(String myFile) throws IOException {
List<Double> results = new ArrayList<>();
try (BufferedReader b = new BufferedReader(new FileReader(myFile));) {
b.readLine(); // ** skip header?
String line;
Integer time = null;
Double a = null;
Double x = null;
for (int lineNum = 0; (line = b.readLine()) != null; lineNum++) {
// ** split the data on any-and-all-whitespace
final String[] data = line.split("\\s+");
if (data.length != 3)
throw new RuntimeException("Invalid data format on line " + lineNum);
try {
time = Integer.valueOf(data[0]);
if (!data[1].trim().isEmpty()) {
a = Double.valueOf(data[1]);
}
if (!data[2].trim().isEmpty()) {
x = Double.valueOf(data[2]);
}
} catch (Exception e) {
throw new RuntimeException("Couldn't parse line " + lineNum, e);
}
if (a == null || x == null) {
throw new RuntimeException("Values not initialized at line " + lineNum);
}
results.add(Double.valueOf(a.doubleValue() - x.doubleValue()));
}
}
// ** finished parsing file, return results
return results;
}

Parsing multiple large csv files and adding all the records to ArrayList

Currently I have about 12 csv files, each having about 1.5 million records.
I'm using univocity-parsers as my csv reader/parser library.
Using univocity-parsers, I read each file and add all the records to an arraylist with the addAll() method. When all 12 files are parsed and added to the array list, my code prints the size of the arraylist at the end.
for (int i = 0; i < 12; i++) {
myList.addAll(parser.parseAll(getReader("file-" + i + ".csv")));
}
It works fine at first until I reach my 6th consecutive file, then it seem to take forever in my IntelliJ IDE output window, never printing out the arraylist size even after an hour, where before my 6th file it was rather fast.
If it helps I'm running on a macbook pro (mid 2014) OSX Yosemite.
It was a textbook problem on forks and joins.
I'm the creator of this library. If you want to just count rows, use a
RowProcessor. You don't even need to count the rows yourself as the parser does that for you:
// Let's create our own RowProcessor to analyze the rows
static class RowCount extends AbstractRowProcessor {
long rowCount = 0;
#Override
public void processEnded(ParsingContext context) {
// this returns the number of the last valid record.
rowCount = context.currentRecord();
}
}
public static void main(String... args) throws FileNotFoundException {
// let's measure the time roughly
long start = System.currentTimeMillis();
//Creates an instance of our own custom RowProcessor, defined above.
RowCount myRowCountProcessor = new RowCount();
CsvParserSettings settings = new CsvParserSettings();
//Here you can select the column indexes you are interested in reading.
//The parser will return values for the columns you selected, in the order you defined
//By selecting no indexes here, no String objects will be created
settings.selectIndexes(/*nothing here*/);
//When you select indexes, the columns are reordered so they come in the order you defined.
//By disabling column reordering, you will get the original row, with nulls in the columns you didn't select
settings.setColumnReorderingEnabled(false);
//We instruct the parser to send all rows parsed to your custom RowProcessor.
settings.setRowProcessor(myRowCountProcessor);
//Finally, we create a parser
CsvParser parser = new CsvParser(settings);
//And parse! All rows are sent to your custom RowProcessor (CsvDimension)
//I'm using a 150MB CSV file with 3.1 million rows.
parser.parse(new File("c:/tmp/worldcitiespop.txt"));
//Nothing else to do. The parser closes the input and does everything for you safely. Let's just get the results:
System.out.println("Rows: " + myRowCountProcessor.rowCount);
System.out.println("Time taken: " + (System.currentTimeMillis() - start) + " ms");
}
Output
Rows: 3173959
Time taken: 1062 ms
Edit: I saw your comment regarding your need to use the actual data in the rows. In this case, process the rows in the rowProcessed() method of the RowProcessor class, that's the most efficient way to handle this.
Edit 2:
If you want to just count rows use getInputDimension from CsvRoutines:
CsvRoutines csvRoutines = new CsvRoutines();
InputDimension d = csvRoutines.getInputDimension(new File("/path/to/your.csv"));
System.out.println(d.rowCount());
System.out.println(d.columnCount());
In parseAll they use 10000 elements for preallocation.
/**
* Parses all records from the input and returns them in a list.
*
* #param reader the input to be parsed
* #return the list of all records parsed from the input.
*/
public final List<String[]> parseAll(Reader reader) {
List<String[]> out = new ArrayList<String[]>(10000);
beginParsing(reader);
String[] row;
while ((row = parseNext()) != null) {
out.add(row);
}
return out;
}
If you have millions of records (lines in file I guess) it is not good for performance and memory allocation because it will double the size and copy when allocate new space.
You could try to implement your own parseAll method like this:
public List<String[]> parseAll(Reader reader, int numberOfLines) {
List<String[]> out = new ArrayList<String[]>(numberOfLines);
parser.beginParsing(reader);
String[] row;
while ((row = parser.parseNext()) != null) {
out.add(row);
}
return out;
}
And check if it helps.
The problem is that you are running out of memory. When this happens, the computer begins to crawl, since it starts to swap memory to disk, and viceversa.
Reading the whole contents into memory is definitely not the best strategy to follow. And since you are only interested in calculating some statistics, you do not even need to use addAll() at all.
The objective in computer science is always to meet an equilibrium between memory spent and execution speed. You can always deal with both concepts, trading memory for more speed or speed for memory savings.
So, loading the whole files into memory is comfortable for you, but not a solution, not even in the future, when computers will include terabytes of memory.
public int getNumRecords(CsvParser parser, int start) {
int toret = start;
parser.beginParsing(reader);
while (parser.parseNext() != null) {
++toret;
}
return toret;
}
As you can see, there is no memory spent in this function (except each single row); you can use it inside a loop for your CSV files, and finish with the total count of rows. The next step is to create a class for all your statistics, substituting that int start with your object.
class Statistics {
public Statistics() {
numRows = 0;
numComedies = 0;
}
public countRow() {
++numRows;
}
public countComedies() {
++numComedies;
}
// more things...
private int numRows;
private int numComedies;
}
public int calculateStatistics(CsvParser parser, Statistics stats) {
int toret = start;
parser.beginParsing(reader);
while (parser.parseNext() != null) {
stats.countRow();
}
return toret;
}
Hope this helps.

what is the fastest way to get dimensions of a csv file in java

My regular procedure when coming to the task on getting dimensions of a csv file as following:
Get how many rows it has:
I use a while loop to read every lines and count up through each successful read. The cons is that it takes time to read the whole file just to count how many rows it has.
then get how many columns it has:
I use String[] temp = lineOfText.split(","); and then take the size of temp.
Is there any smarter method? Like:
file1 = read.csv;
xDimention = file1.xDimention;
yDimention = file1.yDimention;
I guess it depends on how regular the structure is, and whether you need an exact answer or not.
I could imagine looking at the first few rows (or randomly skipping through the file), and then dividing the file size by average row size to determine a rough row count.
If you control how these files get written, you could potentially tag them or add a metadata file next to them containing row counts.
Strictly speaking, the way you're splitting the line doesn't cover all possible cases. "hello, world", 4, 5 should read as having 3 columns, not 4.
Your approach won't work with multi-line values (you'll get an invalid number of rows) and quoted values that might happen to contain the deliminter (you'll get an invalid number of columns).
You should use a CSV parser such as the one provided by univocity-parsers.
Using the uniVocity CSV parser, that fastest way to determine the dimensions would be with the following code. It parses a 150MB file to give its dimensions in 1.2 seconds:
// Let's create our own RowProcessor to analyze the rows
static class CsvDimension extends AbstractRowProcessor {
int lastColumn = -1;
long rowCount = 0;
#Override
public void rowProcessed(String[] row, ParsingContext context) {
rowCount++;
if (lastColumn < row.length) {
lastColumn = row.length;
}
}
}
public static void main(String... args) throws FileNotFoundException {
// let's measure the time roughly
long start = System.currentTimeMillis();
//Creates an instance of our own custom RowProcessor, defined above.
CsvDimension myDimensionProcessor = new CsvDimension();
CsvParserSettings settings = new CsvParserSettings();
//This tells the parser that no row should have more than 2,000,000 columns
settings.setMaxColumns(2000000);
//Here you can select the column indexes you are interested in reading.
//The parser will return values for the columns you selected, in the order you defined
//By selecting no indexes here, no String objects will be created
settings.selectIndexes(/*nothing here*/);
//When you select indexes, the columns are reordered so they come in the order you defined.
//By disabling column reordering, you will get the original row, with nulls in the columns you didn't select
settings.setColumnReorderingEnabled(false);
//We instruct the parser to send all rows parsed to your custom RowProcessor.
settings.setRowProcessor(myDimensionProcessor);
//Finally, we create a parser
CsvParser parser = new CsvParser(settings);
//And parse! All rows are sent to your custom RowProcessor (CsvDimension)
//I'm using a 150MB CSV file with 1.3 million rows.
parser.parse(new FileReader(new File("c:/tmp/worldcitiespop.txt")));
//Nothing else to do. The parser closes the input and does everything for you safely. Let's just get the results:
System.out.println("Columns: " + myDimensionProcessor.lastColumn);
System.out.println("Rows: " + myDimensionProcessor.rowCount);
System.out.println("Time taken: " + (System.currentTimeMillis() - start) + " ms");
}
The output will be:
Columns: 7
Rows: 3173959
Time taken: 1279 ms
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
IMO, What you are doing is an acceptable way to do it. But here are some ways you could make it faster:
Rather than reading lines, which creates a new String Object for each line, just use String.indexOf to find the bounds of your lines
Rather than using line.split, again use indexOf to count the number of commas
Multithreading
I guess here are the options which will depend on how you use the data:
Store dimensions of your csv file when writing the file (in the first row or as in an additional file)
Use a more efficient way to count lines - maybe http://docs.oracle.com/javase/6/docs/api/java/io/LineNumberReader.html
Instead of creating an arrays of fixed size (assuming thats what you need the line count for) use array lists - this may or may not be more efficient depending on size of file.
To find number of rows you have to read the whole file. There is nothing you can do here. However your method of finding number of cols is a bit inefficient. Instead of split just count how many times "," appeard in the line. You might also include here special condition about fields put in the quotas as mentioned by #Vlad.
String.split method creates an array of strings as a result and splits using regexp which is not very efficient.
I find this short but interesting solution here:
https://stackoverflow.com/a/5342096/4082824
LineNumberReader lnr = new LineNumberReader(new FileReader(new File("File1")));
lnr.skip(Long.MAX_VALUE);
System.out.println(lnr.getLineNumber() + 1); //Add 1 because line index starts at 0
lnr.close();
My solution is simply and correctly process CSV with multiline cells or quoted values.
for example We have csv-file:
1,"""2""","""111,222""","""234;222""","""""","1
2
3"
2,"""2""","""111,222""","""234;222""","""""","2
3"
3,"""5""","""1112""","""10;2""","""""","1
2"
And my solution snippet is:
import java.io.*;
public class CsvDimension {
public void parse(Reader reader) throws IOException {
long cells = 0;
int lines = 0;
int c;
boolean qouted = false;
while ((c = reader.read()) != -1) {
if (c == '"') {
qouted = !qouted;
}
if (!qouted) {
if (c == '\n') {
lines++;
cells++;
}
if (c == ',') {
cells++;
}
}
}
System.out.printf("lines : %d\n cells %d\n cols: %d\n", lines, cells, cells / lines);
reader.close();
}
public static void main(String args[]) throws IOException {
new CsvDimension().parse(new BufferedReader(new FileReader(new File("test.csv"))));
}
}

How can parsing a 17MB text file into List cause OutOfMemory with 128MB heap?

In some part of my application, I am parsing a 17MB log file into a list structure - one LogEntry per line. There are approximately 100K lines/log entries, meaning approx. 170 bytes per line. What surprised me is that I run out of heap space, even when I specify 128MB (256MB seems sufficient). How can 10MB of text turned into a list of objects cause a tenfold increase in space?
I understand that String objects use at least twice the amount of space compared to ANSI text (Unicode, one char=2 bytes), but this consumes at least four times that.
What I am looking for is an approximation for how much an ArrayList of n LogEntries will consume, or how my method might create extraneous objects that aggravate the situation (see comment below on String.trim())
This is the data part of my LogEntry class
public class LogEntry {
private Long id;
private String system, version, environment, hostName, userId, clientIP, wsdlName, methodName;
private Date timestamp;
private Long milliSeconds;
private Map<String, String> otherProperties;
This is the part doing the reading
public List<LogEntry> readLogEntriesFromFile(File f) throws LogImporterException {
CSVReader reader;
final String ISO_8601_DATE_PATTERN = "yyyy-MM-dd HH:mm:ss,SSS";
List<LogEntry> logEntries = new ArrayList<LogEntry>();
String[] tmp;
try {
int lineNumber = 0;
final char DELIM = ';';
reader = new CSVReader(new InputStreamReader(new FileInputStream(f)), DELIM);
while ((tmp = reader.readNext()) != null) {
lineNumber++;
if (tmp.length < LogEntry.getRequiredNumberOfAttributes()) {
String tmpString = concat(tmp);
if (tmpString.trim().isEmpty()) {
logger.debug("Empty string");
} else {
logger.error(String.format(
"Invalid log format in %s:L%s. Not enough attributes (%d/%d). Was %s . Continuing ...",
f.getAbsolutePath(), lineNumber, tmp.length, LogEntry.getRequiredNumberOfAttributes(), tmpString)
);
}
continue;
}
List<String> values = new ArrayList<String>(Arrays.asList(tmp));
String system, version, environment, hostName, userId, wsdlName, methodName;
Date timestamp;
Long milliSeconds;
Map<String, String> otherProperties;
system = values.remove(0);
version = values.remove(0);
environment = values.remove(0);
hostName = values.remove(0);
userId = values.remove(0);
String clientIP = values.remove(0);
wsdlName = cleanLogString(values.remove(0));
methodName = cleanLogString(stripNormalPrefixes(values.remove(0)));
timestamp = new SimpleDateFormat(ISO_8601_DATE_PATTERN).parse(values.remove(0));
milliSeconds = Long.parseLong(values.remove(0));
/* remaining properties are the key-value pairs */
otherProperties = parseOtherProperties(values);
logEntries.add(new LogEntry(system, version, environment, hostName, userId, clientIP,
wsdlName, methodName, timestamp, milliSeconds, otherProperties));
}
reader.close();
} catch (IOException e) {
throw new LogImporterException("Error reading log file: " + e.getMessage());
} catch (ParseException e) {
throw new LogImporterException("Error parsing logfile: " + e.getMessage(), e);
}
return logEntries;
}
Utility function used for populating the map
private Map<String, String> parseOtherProperties(List<String> values) throws ParseException {
HashMap<String, String> map = new HashMap<String, String>();
String[] tmp;
for (String s : values) {
if (s.trim().isEmpty()) {
continue;
}
tmp = s.split(":");
if (tmp.length != 2) {
throw new ParseException("Could not split string into key:value :\"" + s + "\"", s.length());
}
map.put(tmp[0], tmp[1]);
}
return map;
}
You also have a Map there, where you store other properties. Your code doesn't show how this Map is populated, but keep in mind that Maps may have a hefty memory overhead compared to the memory needed for the entries themselves.
The size of the array that backs the Map (at least 16 entries * 4 bytes) + one key/value pair per entry + the size of data themselves. Two map entries, each using 10 chars for key and 10 chars for value, would consume 16*4 + 2*2*4 + 2*10*2 + 2*10*2 + 2*2*8= 64+16+40+40+24 = 184 bytes (1 char = 2 byte, a String object consumes min 8 byte). That alone would almost double the space requirements for the entire log string.
Add to this that the LogEntry contains 12 Objects, i.e. at least 96 bytes. Hence the log objects alone would need around 100 bytes, give or take some, without the Map and without actual string data. Plus all the pointers for the references (4B each). I count at least 18 with the Map, meaning 72 bytes.
Adding the data (-object references and object "headers" mentioned in the last paragraph):
2 longs = 16B, 1 date stored as long = 8B, the map = 184B. In addition comes the string content, say 90 chars = 180 byte. Perhaps a byte or two in each end of the list item when put in the list, so in total somewhere around 100+72+16+8+184+180=560 ~ 600 byte per log line.
So around 600 byte per log line, meaning 100K lines would consume around 60MB minimum. This would place it at least in the same order of magnitude as the heap size that was set asize. In addition comes the fact that tmpString.trim() in a loop might be creating copies of string. Similarly String.format() may also be creating copies. The rest of the application must also fit within this heap space, and might explain where the rest of memory is going.
Don't forget that each String object consumes space (24 bytes ?) for the actual Object definition, plus the reference to the char array, the offset (for substring() usage) etc. So representing a line as 'n' strings will add that additional storage requirement. Can you lazily evaluate these instead within your LogEntry class ?
(re. the String offset usage - prior to Java 7b6 String.substring() acts as a window onto an existing char array and consequently you need an offset. This has recently changed and it may be worth determining if a later JDK build is more memory efficient)

How to speed up/optimize file write in my program

Ok. I am supposed to write a program to take a 20 GB file as input with 1,000,000,000 records and create some kind of an index for faster access. I have basically decided to split the 1 bil records into 10 buckets and 10 sub-buckets within those. I am calculating two hash values for the record to locate its appropriate bucket. Now, i create 10*10 files, one for each sub-bucket. And as i hash the record from the input file, i decide which of the 100 files it goes to; then append the record's offset to that particular file.
I have tested this with a sample file with 10,000 records. I have repeated the process 10 times. Effectively emulating a 100,000 record file. For this it takes me around 18 seconds. This means its gonna take me forever to do the same for a 1 bil record file.
Is there anyway i can speed up/ optimize my writing.
And i am going through all this because i can't store all the records in main memory.
import java.io.*;
// PROGRAM DOES THE FOLLOWING
// 1. READS RECORDS FROM A FILE.
// 2. CALCULATES TWO SETS OF HASH VALUES N, M
// 3. APPENDING THE OFFSET OF THAT RECORD IN THE ORIGINAL FILE TO ANOTHER FILE "NM.TXT" i.e REPLACE THE VALUES OF N AND M.
// 4.
class storage
{
public static int siz=10;
public static FileWriter[][] f;
}
class proxy
{
static String[][] virtual_buffer;
public static void main(String[] args) throws Exception
{
virtual_buffer = new String[storage.siz][storage.siz]; // TEMPORARY STRING BUFFER TO REDUCE WRITES
String s,tes;
for(int y=0;y<storage.siz;y++)
{
for(int z=0;z<storage.siz;z++)
{
virtual_buffer[y][z]=""; // INITIALISING ALL ELEMENTS TO ZERO
}
}
int offset_in_file = 0;
long start = System.currentTimeMillis();
// READING FROM THE SAME IP FILE 20 TIMES TO EMULATE A SINGLE BIGGER FILE OF SIZE 20*IP FILE
for(int h=0;h<20;h++){
BufferedReader in = new BufferedReader(new FileReader("outTest.txt"));
while((s = in.readLine() )!= null)
{
tes = (s.split(";"))[0];
int n = calcHash(tes); // FINDING FIRST HASH VALUE
int m = calcHash2(tes); // SECOND HASH
index_up(n,m,offset_in_file); // METHOD TO WRITE TO THE APPROPRIATE FILE I.E NM.TXT
offset_in_file++;
}
in.close();
}
System.out.println(offset_in_file);
long end = System.currentTimeMillis();
System.out.println((end-start));
}
static int calcHash(String s) throws Exception
{
char[] charr = s.toCharArray();;
int i,tot=0;
for(i=0;i<charr.length;i++)
{
if(i%2==0)tot+= (int)charr[i];
}
tot = tot % storage.siz;
return tot;
}
static int calcHash2(String s) throws Exception
{
char[] charr = s.toCharArray();
int i,tot=1;
for(i=0;i<charr.length;i++)
{
if(i%2==1)tot+= (int)charr[i];
}
tot = tot % storage.siz;
if (tot<0)
tot=tot*-1;
return tot;
}
static void index_up(int a,int b,int off) throws Exception
{
virtual_buffer[a][b]+=Integer.toString(off)+"'"; // THIS BUFFER STORES THE DATA TO BE WRITTEN
if(virtual_buffer[a][b].length()>2000) // TO A FILE BEFORE WRITING TO IT, TO REDUCE NO. OF WRITES
{ .
String file = "c:\\adsproj\\"+a+b+".txt";
new writethreader(file,virtual_buffer[a][b]); // DOING THE ACTUAL WRITE PART IN A THREAD.
virtual_buffer[a][b]="";
}
}
}
class writethreader implements Runnable
{
Thread t;
String name, data;
writethreader(String name, String data)
{
this.name = name;
this.data = data;
t = new Thread(this);
t.start();
}
public void run()
{
try{
File f = new File(name);
if(!f.exists())f.createNewFile();
FileWriter fstream = new FileWriter(name,true); //APPEND MODE
fstream.write(data);
fstream.flush(); fstream.close();
}
catch(Exception e){}
}
}
Consider using VisualVM to pinpoint the bottlenecks. Everything else below is based on guesswork - and performance guesswork is often really, really wrong.
I think you have two issues with your write strategy.
The first is that you're starting a new thread on each write; the second is that you're re-opening the file on each write.
The thread problem is especially bad, I think, because I don't see anything preventing one thread writing on a file from overlapping with another. What happens then? Frankly, I don't know - but I doubt it's good.
Consider, instead, creating an array of open files for all 100. Your OS may have a problem with this - but I think probably not. Then create a queue of work for each file. Create a set of worker threads (100 is too many - think 10 or so) where each "owns" a set of files that it loops through, outputting and emptying the queue for each file. Pay attention to the interthread interaction between queue reader and writer - use an appropriate queue class.
I would throw away the entire requirement and use a database.

Categories

Resources