Storing a very large set of numbers in Java

Storing a very large set of numbers in Java - java

I am trying to store a set of numbers that range from 0 to ~60 billion, where the set starts out empty and gradually becomes denser until it contains every number in the range. The set does not have to be capable of removing numbers. Currently my approach is to represent the set as a very long boolean array and store that array in a text file. I have made a class for this, and have tested both RandomAccessFile and FileChannel with the range of the numbers restricted from 0 to 2 billion, but in both cases the class is much slower at adding and querying numbers than using a regular boolean array.
Here is the current state of my class:
import java.io.*;
import java.nio.ByteBuffer;
import java.nio.channels.*;
import java.util.*;
public class FileSet {
private static final int BLOCK=10_000_000;
private final long U;
private final String fname;
private final FileChannel file;
public FileSet(long u, String fn) throws IOException {
U=u;
fname=fn;
BufferedOutputStream out=new BufferedOutputStream(new FileOutputStream(fname));
long n=u/8+1;
for (long rep=0; rep<n/BLOCK; rep++) out.write(new byte[BLOCK]);
out.write(new byte[(int)(n%BLOCK)]);
out.close();
file=new RandomAccessFile(fn,"rw").getChannel();
}
public void add(long v) throws IOException {
if (v<0||v>=U) throw new RuntimeException(v+" out of range [0,"+U+")");
file.position(v/8);
ByteBuffer b=ByteBuffer.allocate(1); file.read(b);
file.position(v/8);
file.write(ByteBuffer.wrap(new byte[] {(byte)(b.get(0)|(1<<(v%8)))}));
}
public boolean has(long v) throws IOException {
if (v<0||v>=U) return false;
file.position(v/8);
ByteBuffer b=ByteBuffer.allocate(1); file.read(b);
return ((b.get(0)>>(v%8))&1)!=0;
}
public static void main(String[] args) throws IOException {
long U=2000_000_000;
SplittableRandom rnd=new SplittableRandom(1);
List<long[]> actions=new ArrayList<>();
for (int i=0; i<1000000; i++) actions.add(new long[] {rnd.nextInt(2),rnd.nextLong(U)});
StringBuilder ret=new StringBuilder(); {
System.out.println("boolean[]:");
long st=System.currentTimeMillis();
boolean[] b=new boolean[(int)U];
System.out.println("init time="+(System.currentTimeMillis()-st));
st=System.currentTimeMillis();
for (long[] act:actions)
if (act[0]==0) b[(int)act[1]]=true;
else ret.append(b[(int)act[1]]?"1":"0");
System.out.println("query time="+(System.currentTimeMillis()-st));
}
StringBuilder ret2=new StringBuilder(); {
System.out.println("FileSet:");
long st=System.currentTimeMillis();
FileSet fs=new FileSet(U,"FileSet/"+U+"div8.txt");
System.out.println("init time="+(System.currentTimeMillis()-st));
st=System.currentTimeMillis();
for (long[] act:actions) {
if (act[0]==0) fs.add(act[1]);
else ret2.append(fs.has(act[1])?"1":"0");
}
System.out.println("query time="+(System.currentTimeMillis()-st));
fs.file.close();
}
if (!ret.toString().equals(ret2.toString())) System.out.println("MISMATCH");
}
}
and the output:
boolean[]:
init time=1248
query time=148
FileSet:
init time=269
query time=3014
Additionally, when increasing the range from 2 billion to 10 billion, there is a large jump in total running time for the queries, even though in theory the total running time should stay roughly constant. When I use the class by itself (since a boolean array no longer works for this big of a range), the query time goes from ~3 seconds to ~50 seconds. When I increase the range to 60 billion, the time increases to ~240 seconds.
My questions are: is there a faster way of accessing and modifying very large files at arbitrary indices? and is there an entirely different approach to storing large integer sets that is faster than my current approach?

Boolean arrays are a very inefficient way to store information as each boolean takes up 8 bits. You should use a BitSet instead. But BitSets also have the 2 billion limit as it uses primitive int values as parameters (and Integer.MAX_VALUE limits the size of the internal long array).
A space efficient in-memory alternative that spans beyond 2 billion entries would be to create your own BitSet wrapper that splits the data into subsets and does the indexing for you:
public class LongBitSet {
// TODO: Initialize array and add error checking.
private final BitSet bitSets = new BitSet[64];
public void set(long index) {
bitSets[(int) (index / Integer.MAX_VALUE)]
.set((int) (index % Integer.MAX_VALUE));
}
}
But there are other alternatives too. If you have a very dense data, using run length encoding would be a cheap way to increase memory capacity. But that would likely involve a B-tree structure to make accessing it more efficient. These are just pointers. A lot of what makes the correct answer depend solely on how you actually use the data structure.

Turns out the simplest solution is to use a 64-bit JVM and increase Java heap space by running my Java program in the terminal with a flag like -Xmx10g. Then I can simply use an array of longs to implicitly store the entire set.

Related

Parsing multiple large csv files and adding all the records to ArrayList

Currently I have about 12 csv files, each having about 1.5 million records.
I'm using univocity-parsers as my csv reader/parser library.
Using univocity-parsers, I read each file and add all the records to an arraylist with the addAll() method. When all 12 files are parsed and added to the array list, my code prints the size of the arraylist at the end.
for (int i = 0; i < 12; i++) {
myList.addAll(parser.parseAll(getReader("file-" + i + ".csv")));
}
It works fine at first until I reach my 6th consecutive file, then it seem to take forever in my IntelliJ IDE output window, never printing out the arraylist size even after an hour, where before my 6th file it was rather fast.
If it helps I'm running on a macbook pro (mid 2014) OSX Yosemite.
It was a textbook problem on forks and joins.

I'm the creator of this library. If you want to just count rows, use a
RowProcessor. You don't even need to count the rows yourself as the parser does that for you:
// Let's create our own RowProcessor to analyze the rows
static class RowCount extends AbstractRowProcessor {
long rowCount = 0;
#Override
public void processEnded(ParsingContext context) {
// this returns the number of the last valid record.
rowCount = context.currentRecord();
}
}
public static void main(String... args) throws FileNotFoundException {
// let's measure the time roughly
long start = System.currentTimeMillis();
//Creates an instance of our own custom RowProcessor, defined above.
RowCount myRowCountProcessor = new RowCount();
CsvParserSettings settings = new CsvParserSettings();
//Here you can select the column indexes you are interested in reading.
//The parser will return values for the columns you selected, in the order you defined
//By selecting no indexes here, no String objects will be created
settings.selectIndexes(/*nothing here*/);
//When you select indexes, the columns are reordered so they come in the order you defined.
//By disabling column reordering, you will get the original row, with nulls in the columns you didn't select
settings.setColumnReorderingEnabled(false);
//We instruct the parser to send all rows parsed to your custom RowProcessor.
settings.setRowProcessor(myRowCountProcessor);
//Finally, we create a parser
CsvParser parser = new CsvParser(settings);
//And parse! All rows are sent to your custom RowProcessor (CsvDimension)
//I'm using a 150MB CSV file with 3.1 million rows.
parser.parse(new File("c:/tmp/worldcitiespop.txt"));
//Nothing else to do. The parser closes the input and does everything for you safely. Let's just get the results:
System.out.println("Rows: " + myRowCountProcessor.rowCount);
System.out.println("Time taken: " + (System.currentTimeMillis() - start) + " ms");
}
Output
Rows: 3173959
Time taken: 1062 ms
Edit: I saw your comment regarding your need to use the actual data in the rows. In this case, process the rows in the rowProcessed() method of the RowProcessor class, that's the most efficient way to handle this.
Edit 2:
If you want to just count rows use getInputDimension from CsvRoutines:
CsvRoutines csvRoutines = new CsvRoutines();
InputDimension d = csvRoutines.getInputDimension(new File("/path/to/your.csv"));
System.out.println(d.rowCount());
System.out.println(d.columnCount());

In parseAll they use 10000 elements for preallocation.
/**
* Parses all records from the input and returns them in a list.
*
* #param reader the input to be parsed
* #return the list of all records parsed from the input.
*/
public final List<String[]> parseAll(Reader reader) {
List<String[]> out = new ArrayList<String[]>(10000);
beginParsing(reader);
String[] row;
while ((row = parseNext()) != null) {
out.add(row);
}
return out;
}
If you have millions of records (lines in file I guess) it is not good for performance and memory allocation because it will double the size and copy when allocate new space.
You could try to implement your own parseAll method like this:
public List<String[]> parseAll(Reader reader, int numberOfLines) {
List<String[]> out = new ArrayList<String[]>(numberOfLines);
parser.beginParsing(reader);
String[] row;
while ((row = parser.parseNext()) != null) {
out.add(row);
}
return out;
}
And check if it helps.

The problem is that you are running out of memory. When this happens, the computer begins to crawl, since it starts to swap memory to disk, and viceversa.
Reading the whole contents into memory is definitely not the best strategy to follow. And since you are only interested in calculating some statistics, you do not even need to use addAll() at all.
The objective in computer science is always to meet an equilibrium between memory spent and execution speed. You can always deal with both concepts, trading memory for more speed or speed for memory savings.
So, loading the whole files into memory is comfortable for you, but not a solution, not even in the future, when computers will include terabytes of memory.
public int getNumRecords(CsvParser parser, int start) {
int toret = start;
parser.beginParsing(reader);
while (parser.parseNext() != null) {
++toret;
}
return toret;
}
As you can see, there is no memory spent in this function (except each single row); you can use it inside a loop for your CSV files, and finish with the total count of rows. The next step is to create a class for all your statistics, substituting that int start with your object.
class Statistics {
public Statistics() {
numRows = 0;
numComedies = 0;
}
public countRow() {
++numRows;
}
public countComedies() {
++numComedies;
}
// more things...
private int numRows;
private int numComedies;
}
public int calculateStatistics(CsvParser parser, Statistics stats) {
int toret = start;
parser.beginParsing(reader);
while (parser.parseNext() != null) {
stats.countRow();
}
return toret;
}
Hope this helps.

Storing and comparing a large quantity of Strings in Java

My application stores a large number (about 700,000) of strings in an ArrayList. The strings are loaded from a text file like this:
List<String> stringList = new ArrayList<String>(750_000);
//there's a try catch here but I omitted it for this example
Scanner fileIn = new Scanner(new FileInputStream(listPath), "UTF-8");
while (fileIn.hasNext()) {
String s = fileIn.nextLine().trim();
if (s.isEmpty()) continue;
if (s.startsWith("#")) continue; //ignore comments
stringList.add(s);
}
fileIn.close();
Later on, Other strings are compared to this list, using this code:
String example = "Something";
if (stringList.contains(example))
doSomething();
This comparison will happen many hundreds (thousands?) of times.
This all works, but I want to know if there's anything I can do to make it better. I notice that the JVM increases in size from about 100MB to 600MB when it loads the 700K Strings. The strings are mainly about this size:
Blackened Recordings
Divergent Series: Insurgent
Google
Pixels Movie Money
X Ambassadors
Power Path Pro Advanced
CYRFZQ
Is there anything I can do to reduce the memory, or is that to be expected? Any suggestions in general?

ArrayList is a memory effective. Probably your issue is caused by java.util.Scanner. Scanner creates a lot of temp objects during parsing (Patterns, Matchers etc) and not suitable for big files.
Try to replace it with java.io.BufferedReader:
List<String> stringList = new ArrayList<String>();
BufferedReader fileIn = new BufferedReader(new FileReader("UTF-8"));
String line = null;
while ((line = fileIn.readLine()) != null) {
line = line.trim();
if (line.isEmpty()) continue;
if (line.startsWith("#")) continue; //ignore comments
stringList.add(line);
}
fileIn.close();
See java.util.Scanner source code
To pinpoint memory issue attach to your JVM any memory profiler, for example VisualVM from JDK tools.
Added:
Let's make few assumtions:
you have 700000 string with 20 characters each.
object reference size is 32 bits, object header - 24, array header - 16, char - 16, int 32.
Then every string will consume 24+32*2+32+(16+20*16) = 456 bits.
Whole ArrayList with string object will consume about 700000*(32*2+456) = 364000000 bits = 43.4 MB (very roughly).

Not quite an answer, but:
Your scenario uses around 70mb on my machine:
long usedMemory = -(Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory());
{//
String[] strings = new String[700_000];
for (int i = 0; i < strings.length; i++) {
strings[i] = new String(new char[20]);
}
}//
usedMemory += Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
System.out.println(usedMemory / 1_000_000d + " mb");
How did you reach 500mb there? As far as I know, String has internally a char[], and each char has 16 bits. Taking the Object and String overhead in account, 500mb is still quite much for the strings only. You may perform some benchmarking tests on your machine.
As others already mentioned, you should change the datastructure for element look-ups/comparison.

You're likely going to be better off using a HashSet instead of an ArrayList as both add and contains are constant time operations in a HashSet.
However, it does assume that your object's hashCode implementation (which is part of Object, but can be overridden) is evenly distributed.

There is a Trie data structure which can be used as dictionary, with so many strings they can occur multiple times. https://en.wikipedia.org/wiki/Trie . It seems to fit your case.
UPDATE:
An alternative can be HashSet or HashMap string -> something if you want occurrences of strings for example. Hashed collection will be faster than list for sure.
I would start with HashSet.

Using an ArrayList is a very bad idea for your use case, because it is not sorted, and hence you cannot efficiently search for an entry.
The best built-in type for your case is a is a TreeSet<String>. It guarantees O(log(n)) Performance for add() and contains().
Be aware that TreeSet is not thread-safe in the basic implementation. Use an mt-safe wrapper (see the JavaDocs of TreeSet for this).

Here is a Java 8 approach. It uses Files.lines() method which take advantage of Stream API. This method reads all lines from a file as a Stream.
As a consequence no String objects are created till the terminal operation which is a static method MyExecutor.doSomething(String).
/**
* Process lines from a file.
* Uses Files.lines() method which take advantage of Stream API introduced in Java 8.
*/
private static void processStringsFromFile(final Path file) {
try (Stream<String> lines = Files.lines(file)) {
lines.map(s -> s.trim())
.filter(s -> !s.isEmpty())
.filter(s -> !s.startsWith("#"))
.filter(s -> s.contains("Something"))
.forEach(MyExecutor::doSomething);
} catch (IOException ex) {
logProcessStringsFailed(ex);
}
}
I conducted an Analysis of Memory Usage in NetBeans and here are the Memory Results for empty implementation of doSomething()
public static void doSomething(final String s) {
}
Live Bytes = 6702720 ≈ 6.4MB.

Get random boolean in Java

Okay, I implemented this SO question to my code: Return True or False Randomly
But, I have strange behavior: I need to run ten instances simultaneously, where every instance returns true or false just once per run. And surprisingly, no matter what I do, every time i get just false
Is there something to improve the method so I can have at least roughly 50% chance to get true?
To make it more understandable: I have my application builded to JAR file which is then run via batch command
java -jar my-program.jar
pause
Content of the program - to make it as simple as possible:
public class myProgram{
public static boolean getRandomBoolean() {
return Math.random() < 0.5;
// I tried another approaches here, still the same result
}
public static void main(String[] args) {
System.out.println(getRandomBoolean());
}
}
If I open 10 command lines and run it, I get false as result every time...

I recommend using Random.nextBoolean()
That being said, Math.random() < 0.5 as you have used works too. Here's the behavior on my machine:
$ cat myProgram.java
public class myProgram{
public static boolean getRandomBoolean() {
return Math.random() < 0.5;
//I tried another approaches here, still the same result
}
public static void main(String[] args) {
System.out.println(getRandomBoolean());
}
}
$ javac myProgram.java
$ java myProgram ; java myProgram; java myProgram; java myProgram
true
false
false
true
Needless to say, there are no guarantees for getting different values each time. In your case however, I suspect that
A) you're not working with the code you think you are, (like editing the wrong file)
B) you havn't compiled your different attempts when testing, or
C) you're working with some non-standard broken implementation.

Have you tried looking at the Java Documentation?
Returns the next pseudorandom, uniformly distributed boolean value from this random number generator's sequence ... the values true and false are produced with (approximately) equal probability.
For example:
import java.util.Random;
Random random = new Random();
random.nextBoolean();

You could also try nextBoolean()-Method
Here is an example: http://www.tutorialspoint.com/java/util/random_nextboolean.htm

Java 8: Use random generator isolated to the current thread: ThreadLocalRandom nextBoolean()
Like the global Random generator used by the Math class, a ThreadLocalRandom is initialized with an internally generated seed that may not otherwise be modified. When applicable, use of ThreadLocalRandom rather than shared Random objects in concurrent programs will typically encounter much less overhead and contention.
java.util.concurrent.ThreadLocalRandom.current().nextBoolean();

Why not use the Random class, which has a method nextBoolean:
import java.util.Random;
/** Generate 10 random booleans. */
public final class MyProgram {
public static final void main(String... args){
Random randomGenerator = new Random();
for (int idx = 1; idx <= 10; ++idx){
boolean randomBool = randomGenerator.nextBoolean();
System.out.println("Generated : " + randomBool);
}
}
}

You can use the following for an unbiased result:
Random random = new Random();
//For 50% chance of true
boolean chance50oftrue = (random.nextInt(2) == 0) ? true : false;
Note: random.nextInt(2) means that the number 2 is the bound. the counting starts at 0. So we have 2 possible numbers (0 and 1) and hence the probability is 50%!
If you want to give more probability to your result to be true (or false) you can adjust the above as following!
Random random = new Random();
//For 50% chance of true
boolean chance50oftrue = (random.nextInt(2) == 0) ? true : false;
//For 25% chance of true
boolean chance25oftrue = (random.nextInt(4) == 0) ? true : false;
//For 40% chance of true
boolean chance40oftrue = (random.nextInt(5) < 2) ? true : false;

The easiest way to initialize a random number generator is to use the parameterless constructor, for example
Random generator = new Random();
However, in using this constructor you should recognize that algorithmic random number generators are not truly random, they are really algorithms that generate a fixed but random-looking sequence of numbers.
You can make it appear more 'random' by giving the Random constructor the 'seed' parameter, which you can dynamically built by for example using system time in milliseconds (which will always be different)

you could get your clock() value and check if it is odd or even. I dont know if it is %50 of true
And you can custom-create your random function:
static double s=System.nanoTime();//in the instantiating of main applet
public static double randoom()
{
s=(double)(((555555555* s+ 444444)%100000)/(double)100000);
return s;
}
numbers 55555.. and 444.. are the big numbers to get a wide range function
please ignore that skype icon :D

You can also make two random integers and verify if they are the same, this gives you more control over the probabilities.
Random rand = new Random();
Declare a range to manage random probability.
In this example, there is a 50% chance of being true.
int range = 2;
Generate 2 random integers.
int a = rand.nextInt(range);
int b = rand.nextInt(range);
Then simply compare return the value.
return a == b;
I also have a class you can use.
RandomRange.java

Words in a text are always a source of randomness. Given a certain word, nothing can be inferred about the next word. For each word, we can take the ASCII codes of its letters, add those codes to form a number. The parity of this number is a good candidate for a random boolean.
Possible drawbacks:
this strategy is based upon using a text file as a source for the words. At some point,
the end of the file will be reached. However, you can estimate how many times you are expected to call the randomBoolean()
function from your app. If you will need to call it about 1 million times, then a text file with 1 million words will be enough.
As a correction, you can use a stream of data from a live source like an online newspaper.
using some statistical analysis of the common phrases and idioms in a language, one can estimate the next word in a phrase,
given the first words of the phrase, with some degree of accuracy. But statistically, these cases are rare, when we can accuratelly
predict the next word. So, in most cases, the next word is independent on the previous words.
package p01;
import java.io.File;
import java.nio.file.Files;
import java.nio.file.Paths;
public class Main {
String words[];
int currentIndex=0;
public static String readFileAsString()throws Exception
{
String data = "";
File file = new File("the_comedy_of_errors");
//System.out.println(file.exists());
data = new String(Files.readAllBytes(Paths.get(file.getName())));
return data;
}
public void init() throws Exception
{
String data = readFileAsString();
words = data.split("\\t| |,|\\.|'|\\r|\\n|:");
}
public String getNextWord() throws Exception
{
if(currentIndex>words.length-1)
throw new Exception("out of words; reached end of file");
String currentWord = words[currentIndex];
currentIndex++;
while(currentWord.isEmpty())
{
currentWord = words[currentIndex];
currentIndex++;
}
return currentWord;
}
public boolean getNextRandom() throws Exception
{
String nextWord = getNextWord();
int asciiSum = 0;
for (int i = 0; i < nextWord.length(); i++){
char c = nextWord.charAt(i);
asciiSum = asciiSum + (int) c;
}
System.out.println(nextWord+"-"+asciiSum);
return (asciiSum%2==1) ;
}
public static void main(String args[]) throws Exception
{
Main m = new Main();
m.init();
while(true)
{
System.out.println(m.getNextRandom());
Thread.sleep(100);
}
}
}
In Eclipse, in the root of my project, there is a file called 'the_comedy_of_errors' (no extension) - created with File> New > File , where I pasted some content from here: http://shakespeare.mit.edu/comedy_errors/comedy_errors.1.1.html

For a flexible boolean randomizer:
public static rbin(bias){
bias = bias || 50;
return(Math.random() * 100 <= bias);
/*The bias argument is optional but will allow you to put some weight
on the TRUE side. The higher the bias number, the more likely it is
true.*/
}
Make sure to use numbers 0 - 100 or you might lower the bias and get more common false values.
PS: I do not know anything about Java other than it has a few features in common with JavaScript. I used my JavaScript knowledge plus my inferring power to construct this code. Expect my answer to not be functional. Y'all can edit this answer to fix any issues I am not aware of.

How to speed up/optimize file write in my program

Ok. I am supposed to write a program to take a 20 GB file as input with 1,000,000,000 records and create some kind of an index for faster access. I have basically decided to split the 1 bil records into 10 buckets and 10 sub-buckets within those. I am calculating two hash values for the record to locate its appropriate bucket. Now, i create 10*10 files, one for each sub-bucket. And as i hash the record from the input file, i decide which of the 100 files it goes to; then append the record's offset to that particular file.
I have tested this with a sample file with 10,000 records. I have repeated the process 10 times. Effectively emulating a 100,000 record file. For this it takes me around 18 seconds. This means its gonna take me forever to do the same for a 1 bil record file.
Is there anyway i can speed up/ optimize my writing.
And i am going through all this because i can't store all the records in main memory.
import java.io.*;
// PROGRAM DOES THE FOLLOWING
// 1. READS RECORDS FROM A FILE.
// 2. CALCULATES TWO SETS OF HASH VALUES N, M
// 3. APPENDING THE OFFSET OF THAT RECORD IN THE ORIGINAL FILE TO ANOTHER FILE "NM.TXT" i.e REPLACE THE VALUES OF N AND M.
// 4.
class storage
{
public static int siz=10;
public static FileWriter[][] f;
}
class proxy
{
static String[][] virtual_buffer;
public static void main(String[] args) throws Exception
{
virtual_buffer = new String[storage.siz][storage.siz]; // TEMPORARY STRING BUFFER TO REDUCE WRITES
String s,tes;
for(int y=0;y<storage.siz;y++)
{
for(int z=0;z<storage.siz;z++)
{
virtual_buffer[y][z]=""; // INITIALISING ALL ELEMENTS TO ZERO
}
}
int offset_in_file = 0;
long start = System.currentTimeMillis();
// READING FROM THE SAME IP FILE 20 TIMES TO EMULATE A SINGLE BIGGER FILE OF SIZE 20*IP FILE
for(int h=0;h<20;h++){
BufferedReader in = new BufferedReader(new FileReader("outTest.txt"));
while((s = in.readLine() )!= null)
{
tes = (s.split(";"))[0];
int n = calcHash(tes); // FINDING FIRST HASH VALUE
int m = calcHash2(tes); // SECOND HASH
index_up(n,m,offset_in_file); // METHOD TO WRITE TO THE APPROPRIATE FILE I.E NM.TXT
offset_in_file++;
}
in.close();
}
System.out.println(offset_in_file);
long end = System.currentTimeMillis();
System.out.println((end-start));
}
static int calcHash(String s) throws Exception
{
char[] charr = s.toCharArray();;
int i,tot=0;
for(i=0;i<charr.length;i++)
{
if(i%2==0)tot+= (int)charr[i];
}
tot = tot % storage.siz;
return tot;
}
static int calcHash2(String s) throws Exception
{
char[] charr = s.toCharArray();
int i,tot=1;
for(i=0;i<charr.length;i++)
{
if(i%2==1)tot+= (int)charr[i];
}
tot = tot % storage.siz;
if (tot<0)
tot=tot*-1;
return tot;
}
static void index_up(int a,int b,int off) throws Exception
{
virtual_buffer[a][b]+=Integer.toString(off)+"'"; // THIS BUFFER STORES THE DATA TO BE WRITTEN
if(virtual_buffer[a][b].length()>2000) // TO A FILE BEFORE WRITING TO IT, TO REDUCE NO. OF WRITES
{ .
String file = "c:\\adsproj\\"+a+b+".txt";
new writethreader(file,virtual_buffer[a][b]); // DOING THE ACTUAL WRITE PART IN A THREAD.
virtual_buffer[a][b]="";
}
}
}
class writethreader implements Runnable
{
Thread t;
String name, data;
writethreader(String name, String data)
{
this.name = name;
this.data = data;
t = new Thread(this);
t.start();
}
public void run()
{
try{
File f = new File(name);
if(!f.exists())f.createNewFile();
FileWriter fstream = new FileWriter(name,true); //APPEND MODE
fstream.write(data);
fstream.flush(); fstream.close();
}
catch(Exception e){}
}
}

Consider using VisualVM to pinpoint the bottlenecks. Everything else below is based on guesswork - and performance guesswork is often really, really wrong.
I think you have two issues with your write strategy.
The first is that you're starting a new thread on each write; the second is that you're re-opening the file on each write.
The thread problem is especially bad, I think, because I don't see anything preventing one thread writing on a file from overlapping with another. What happens then? Frankly, I don't know - but I doubt it's good.
Consider, instead, creating an array of open files for all 100. Your OS may have a problem with this - but I think probably not. Then create a queue of work for each file. Create a set of worker threads (100 is too many - think 10 or so) where each "owns" a set of files that it loops through, outputting and emptying the queue for each file. Pay attention to the interthread interaction between queue reader and writer - use an appropriate queue class.

I would throw away the entire requirement and use a database.

Huge performance difference between Vector and HashSet

I have a program which fetches records from database (using Hibernate) and fills them in a Vector. There was an issue regarding the performance of the operation and I did a test with the Vector replaced by a HashSet. With 300000 records, the speed gain is immense - 45 mins to 2 mins!
So my question is, what is causing this huge difference? Is it just the point that all methods in Vector are synchronized or the point that internally Vector uses an array whereas HashSet does not? Or something else?
The code is running in a single thread.
EDIT:
The code is only inserting the values in the Vector (and in the other case, HashSet).

If it's trying to use the Vector as a set, and checking for the existence of a record before adding it, then filling the vector becomes an O(n^2) operation, compared with O(n) for HashSet. It would also become an O(n^2) operation if you insert each element at the start of the vector instead of at the end.
If you're just using collection.add(item) then I wouldn't expect to see that sort of difference - synchronization isn't that slow.
If you can try to test it with different numbers of records, you could see how each version grows as n increases - that would make it easier to work out what's going on.
EDIT: If you're just using Vector.add then it sounds like something else could be going on - e.g. your database was behaving differently between your different test runs. Here's a little test application:
import java.util.*;
public class Test {
public static void main(String[] args) {
long start = System.currentTimeMillis();
Vector<String> vector = new Vector<String>();
for (int i = 0; i < 300000; i++) {
vector.add("dummy value");
}
long end = System.currentTimeMillis();
System.out.println("Time taken: " + (end - start) + "ms");
}
}
Output:
Time taken: 38ms
Now obviously this isn't going to be very accurate - System.currentTimeMillis isn't the best way of getting accurate timing - but it's clearly not taking 45 minutes. In other words, you should look elsewhere for the problem, if you really are just calling Vector.add(item).
Now, changing the code above to use
vector.add(0, "dummy value"); // Insert item at the beginning
makes an enormous difference - it takes 42 seconds instead of 38ms. That's clearly a lot worse - but it's still a long way from being 45 minutes - and I doubt that my desktop is 60 times as fast as yours.

If you are inserting them at the middle or beginning instead of at the end, then the Vector needs to move them all along. Every insert. The hashmap, on the other hand, doesn't really care or have to do anything.

Vector is outdated and should not be used anymore. Profile with ArrayList or LinkedList (depends on how you use the list) and you will see the difference (sync vs unsync).
Why are you using Vector in a single threaded application at all?

Vector is synchronized by default; HashSet is not. That's my guess. Obtaining a monitor for access takes time.
I don't know if there are reads in your test, but Vector and HashSet are both O(1) if get() is used to access Vector entries.

Under normal circumstances, it is totally implausible that inserting 300,000 records into a Vector will take 43 minutes longer than inserting the same records into a HashSet.
However, I think there is a possible explanation of what might be going on.
First, the records coming out of the database must have a very high proportion of duplicates. Or at least, they must be duplicates according to the semantics of the equals/hashcode methods of your record class.
Next, I think you must be pushing very close to filling up the heap.
So the reason that the HashSet solution is so much faster is that it is most of the records are being replaced by the set.add operation. By contrast the Vector solution is keeping all of the records, and the JVM is spending most of its time trying to squeeze that last 0.05% of memory by running the GC over, and over and over.
One way to test this theory is to run the Vector version of the application with a much bigger heap.
Irrespective, the best way to investigate this kind of problem is to run the application using a profiler, and see where all the CPU time is going.

import java.util.*;
public class Test {
public static void main(String[] args) {
long start = System.currentTimeMillis();
Vector<String> vector = new Vector<String>();
for (int i = 0; i < 300000; i++) {
if(vector.contains(i)) {
vector.add("dummy value");
}
}
long end = System.currentTimeMillis();
System.out.println("Time taken: " + (end - start) + "ms");
}
}
If you check for duplicate element before insert the element in the vector, it will take more time depend upon the size of vector. best way is to use the HashSet for high performance, because Hashset will not allow duplicate and no need to check for duplicate element before inserting.

According to Dr Heinz Kabutz, he said this in one of his newsletters.
The old Vector class implements serialization in a naive way. They simply do the default serialization, which writes the entire Object[] as-is into the stream. Thus if we insert a bunch of elements into the List, then clear it, the difference between Vector and ArrayList is enormous.
import java.util.*;
import java.io.*;
public class VectorWritingSize {
public static void main(String[] args) throws IOException {
test(new LinkedList<String>());
test(new ArrayList<String>());
test(new Vector<String>());
}
public static void test(List<String> list) throws IOException {
insertJunk(list);
for (int i = 0; i < 10; i++) {
list.add("hello world");
}
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ObjectOutputStream out = new ObjectOutputStream(baos);
out.writeObject(list);
out.close();
System.out.println(list.getClass().getSimpleName() +
" used " + baos.toByteArray().length + " bytes");
}
private static void insertJunk(List<String> list) {
for(int i = 0; i<1000 * 1000; i++) {
list.add("junk");
}
list.clear();
}
}
When we run this code, we get the following output:
LinkedList used 107 bytes
ArrayList used 117 bytes
Vector used 1310926 bytes
Vector can use a staggering amount of bytes when being serialized. The lesson here? Don't ever use Vector as Lists in objects that are Serializable. The potential for disaster is too great.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Storing a very large set of numbers in Java - java

Turns out the simplest solution is to use a 64-bit JVM and increase Java heap space by running my Java program in the terminal with a flag like -Xmx10g. Then I can simply use an array of longs to implicitly store the entire set.

Related

Parsing multiple large csv files and adding all the records to ArrayList

Storing and comparing a large quantity of Strings in Java

Get random boolean in Java

How to speed up/optimize file write in my program

Huge performance difference between Vector and HashSet

Categories

Resources