generating int up to 100 million throws exception - java

I am trying to generate integers up to 100 million, then combine it with pre-defined integer/string.
Example: predefined = 1010 Generated: gen = 5020315 Combined =
10105020315
then save that number to .txt file, so text file should have 100 million lines.
Here is a code I wrote:
import java.io.FileNotFoundException;
import java.io.PrintWriter;
import java.io.UnsupportedEncodingException;
public class exec{
public static void main(String[] args) throws FileNotFoundException, UnsupportedEncodingException
{
int initial = 6618;
PrintWriter writer = new PrintWriter("variations.txt", "UTF-8");
for(int a = 0; a < 100000000; a++){
int a2 = Integer.parseInt(Integer.toString(initial) + Integer.toString(a));
writer.println(a2);
}
writer.close();
}
}
But it throws the following error:
Exception in thread "main" java.lang.NumberFormatException: For input
string: "6618100000"
Why does this happen? Where is the problem?

you need long , and you can use Long.parseLong();.
largest value for int is long 2^31-1 but for long is 2^63-1.

Combining 6618 which the value of a leads to a number too large to be held in an int variable (For example, 6618100000 is too large to be held in an int variable). The largest value for int is 2^31-1. You can use Long.parseLong() instead.

Whenever you are parsing, you need to make sure that the Integer you want to create from the string is smaller than Integer.MAX.
Integer.MAX is equal to 2147483647, and so any value bigger than this will cause an exception.

Related

Storing a very large set of numbers in Java

I am trying to store a set of numbers that range from 0 to ~60 billion, where the set starts out empty and gradually becomes denser until it contains every number in the range. The set does not have to be capable of removing numbers. Currently my approach is to represent the set as a very long boolean array and store that array in a text file. I have made a class for this, and have tested both RandomAccessFile and FileChannel with the range of the numbers restricted from 0 to 2 billion, but in both cases the class is much slower at adding and querying numbers than using a regular boolean array.
Here is the current state of my class:
import java.io.*;
import java.nio.ByteBuffer;
import java.nio.channels.*;
import java.util.*;
public class FileSet {
private static final int BLOCK=10_000_000;
private final long U;
private final String fname;
private final FileChannel file;
public FileSet(long u, String fn) throws IOException {
U=u;
fname=fn;
BufferedOutputStream out=new BufferedOutputStream(new FileOutputStream(fname));
long n=u/8+1;
for (long rep=0; rep<n/BLOCK; rep++) out.write(new byte[BLOCK]);
out.write(new byte[(int)(n%BLOCK)]);
out.close();
file=new RandomAccessFile(fn,"rw").getChannel();
}
public void add(long v) throws IOException {
if (v<0||v>=U) throw new RuntimeException(v+" out of range [0,"+U+")");
file.position(v/8);
ByteBuffer b=ByteBuffer.allocate(1); file.read(b);
file.position(v/8);
file.write(ByteBuffer.wrap(new byte[] {(byte)(b.get(0)|(1<<(v%8)))}));
}
public boolean has(long v) throws IOException {
if (v<0||v>=U) return false;
file.position(v/8);
ByteBuffer b=ByteBuffer.allocate(1); file.read(b);
return ((b.get(0)>>(v%8))&1)!=0;
}
public static void main(String[] args) throws IOException {
long U=2000_000_000;
SplittableRandom rnd=new SplittableRandom(1);
List<long[]> actions=new ArrayList<>();
for (int i=0; i<1000000; i++) actions.add(new long[] {rnd.nextInt(2),rnd.nextLong(U)});
StringBuilder ret=new StringBuilder(); {
System.out.println("boolean[]:");
long st=System.currentTimeMillis();
boolean[] b=new boolean[(int)U];
System.out.println("init time="+(System.currentTimeMillis()-st));
st=System.currentTimeMillis();
for (long[] act:actions)
if (act[0]==0) b[(int)act[1]]=true;
else ret.append(b[(int)act[1]]?"1":"0");
System.out.println("query time="+(System.currentTimeMillis()-st));
}
StringBuilder ret2=new StringBuilder(); {
System.out.println("FileSet:");
long st=System.currentTimeMillis();
FileSet fs=new FileSet(U,"FileSet/"+U+"div8.txt");
System.out.println("init time="+(System.currentTimeMillis()-st));
st=System.currentTimeMillis();
for (long[] act:actions) {
if (act[0]==0) fs.add(act[1]);
else ret2.append(fs.has(act[1])?"1":"0");
}
System.out.println("query time="+(System.currentTimeMillis()-st));
fs.file.close();
}
if (!ret.toString().equals(ret2.toString())) System.out.println("MISMATCH");
}
}
and the output:
boolean[]:
init time=1248
query time=148
FileSet:
init time=269
query time=3014
Additionally, when increasing the range from 2 billion to 10 billion, there is a large jump in total running time for the queries, even though in theory the total running time should stay roughly constant. When I use the class by itself (since a boolean array no longer works for this big of a range), the query time goes from ~3 seconds to ~50 seconds. When I increase the range to 60 billion, the time increases to ~240 seconds.
My questions are: is there a faster way of accessing and modifying very large files at arbitrary indices? and is there an entirely different approach to storing large integer sets that is faster than my current approach?
Boolean arrays are a very inefficient way to store information as each boolean takes up 8 bits. You should use a BitSet instead. But BitSets also have the 2 billion limit as it uses primitive int values as parameters (and Integer.MAX_VALUE limits the size of the internal long array).
A space efficient in-memory alternative that spans beyond 2 billion entries would be to create your own BitSet wrapper that splits the data into subsets and does the indexing for you:
public class LongBitSet {
// TODO: Initialize array and add error checking.
private final BitSet bitSets = new BitSet[64];
public void set(long index) {
bitSets[(int) (index / Integer.MAX_VALUE)]
.set((int) (index % Integer.MAX_VALUE));
}
}
But there are other alternatives too. If you have a very dense data, using run length encoding would be a cheap way to increase memory capacity. But that would likely involve a B-tree structure to make accessing it more efficient. These are just pointers. A lot of what makes the correct answer depend solely on how you actually use the data structure.
Turns out the simplest solution is to use a 64-bit JVM and increase Java heap space by running my Java program in the terminal with a flag like -Xmx10g. Then I can simply use an array of longs to implicitly store the entire set.

Generating millions of random string in Java

I would like to generate millions of passwords randomly between 4 millions and 50 millions. The problem is the time it takes the processor to process it.
I would like to know if there is a solution to generate a lot of passwords in only a few seconds (max 1 minute for 50 millions).
I've done that for now but it takes me more than 3 min (with a very good config and I would like to run it on small config).
private final static String policy = "azertyuiopqsdfghjklmwxcvbnAZERTYUIOPQSDFGHJKLMWXCVBN1234567890";
private static List<String> names = new ArrayList<String>();
public static void main(String[] args) {
names.add("de");
init();
}
private static String generator(){
String password="";
int randomWithMathRandom = (int) ((Math.random() * ( - 6)) + 6);
for(var i=0;i<8;i++){
randomWithMathRandom = (int) ((Math.random() * ( - 6)) + 6);
password+= policy.charAt(randomWithMathRandom);
}
return password;
}
public static void init() {
for (int i = 0; i < 40000000; i++) {
names.add(generator());
}
}
btw I can't take a ready-made list. I think the most 'expensive' waste of time is the input into the list.
My current config :
ryzen 7 4800h
rtx 2600
SSD NVME
RAM 3200MHZ
UPDATE :
I tried with 20Millions and it's display an error: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "main"
Storing 50 million passwords as Strings in-memory could cause problems since either the stack or the heap may overflow. From this point of view, I think the best we can do is to generate a chunk of passwords, store them in a file, generate the next chunk, append them to the file... until the desired amount of passwords is created. I hacked together a small program that generates random Strings of length 32. As alphabet, I used all ASCII-characters between '!' (ASCII-value 33) and '~' (ASCII-value 126).
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.StandardOpenOption;
import java.text.DecimalFormat;
import java.text.DecimalFormatSymbols;
import java.util.Random;
import java.util.concurrent.TimeUnit;
class Scratch {
private static final int MIN = '!';
private static final int MAX = '~';
private static final Random RANDOM = new Random();
public static void main(final String... args) throws IOException {
final Path passwordFile = Path.of("passwords.txt");
if (!Files.exists(passwordFile)) {
Files.createFile(passwordFile);
}
final DecimalFormat df = new DecimalFormat();
final DecimalFormatSymbols ds = df.getDecimalFormatSymbols();
ds.setGroupingSeparator('_');
df.setDecimalFormatSymbols(ds);
final int numberOfPasswordsToGenerate = 50_000_000;
final int chunkSize = 1_000_000;
final int passwordLength = 32;
int generated = 0;
int chunk = 0;
final long start = System.nanoTime();
while (generated < numberOfPasswordsToGenerate) {
final StringBuilder passwords = new StringBuilder();
for (
int index = chunk * chunkSize;
index < (chunk + 1) * chunkSize && index < numberOfPasswordsToGenerate;
++index) {
final StringBuilder password = new StringBuilder();
for (int character = 0; character < passwordLength; ++character) {
password.append(fetchRandomLetterFromAlphabet());
}
passwords.append(password.toString()).append(System.lineSeparator());
++generated;
if (generated % 500_000 == 0) {
System.out.printf(
"%s / %s%n",
df.format(generated),
df.format(numberOfPasswordsToGenerate));
}
}
++chunk;
Files.writeString(passwordFile, passwords.toString(), StandardOpenOption.APPEND);
}
final long consumed = System.nanoTime() - start;
System.out.printf("Done. Took %d seconds%n", TimeUnit.NANOSECONDS.toSeconds(consumed));
}
private static char fetchRandomLetterFromAlphabet() {
return (char) (RANDOM.nextInt(MAX - MIN + 1) + MIN);
}
}
On my laptop, the program yields good results. It completes in about 33 seconds and all passwords are stored in a single file.
This program is a proof of concept and not production-ready. For example, if a password.txt does already exist, the content will be appended. For me, the file already has 1.7 GB after one run, so be aware of this. Furthermore, the generated passwords are temporarily stored in a StringBuilder, which may present a security risk since a StringBuilder cannot be cleared (i.e. its internal memory structured cannot be zeroed). Performance could further be improved by running the password generation multi-threaded, but I will leave this as an exercise to the reader.
To use the alphabet presented in the question, we can remove static fields MIN and MAX, define one new static field private static final char[] ALPHABET = "azertyuiopqsdfghjklmwxcvbnAZERTYUIOPQSDFGHJKLMWXCVBN1234567890".toCharArray(); and re-implement fetchRandomLetterFromAlphabet as:
private static char fetchRandomLetterFromAlphabet() {
return ALPHABET[RANDOM.nextInt(ALPHABET.length)];
}
We can use the following code-snippet to read-back the n-th (starting at 0) password from the file in constant time:
final int n = ...;
final RandomAccessFile raf = new RandomAccessFile(passwordFile.toString(), "r");
final long start = System.nanoTime();
final byte[] bytes = new byte[passwordLength];
// byte-length of the first n passwords, including line breaks:
final int offset = (passwordLength + System.lineSeparator().toCharArray().length) * n;
raf.seek(offset); // skip the first n passwords
raf.read(bytes);
// reset to the beginning of the file, in case we want to read more passwords later:
raf.seek(0);
System.out.println(new String(bytes));
I can give you some tips to optimize your code and make it faster, you can use them along with other.
If you know the number of passwords you need, you should create a string array and fill it with the variable in your loop.
If you have to use a dynamic size data structure, use linked list.
Linked list is better than array list when adding elements is your main target, and worse if you want to access them more then add them.
Use string builder instead of += operator on strings.
The += operator is very 'expensive' in complexity of time, because it always creates new strings. Using string builder append method can speed up your code.
Instead of using Math.random() and multiple the result to your range number, create a static Random object and use yourRandomInstance.next(int range).
Consider use the ascii table to get random character instead of using str.charAt(int index) method, it may speed up your code too, i offer you to check it.

Is there a way to use the Scanner class inputFile.nextLine()); but have the selection be random? [duplicate]

Say there is a file too big to be put to memory. How can I get a random line from it? Thanks.
Update:
I want to the probabilities of getting each line to be equal.
Reading the entire file if you want only one line seems a bit excessive. The following should be more efficient:
Use RandomAccessFile to seek to a random byte position in the file.
Seek left and right to the next line terminator. Let L the line between them.
With probability (MIN_LINE_LENGTH / L.length) return L. Otherwise, start over at step 1.
This is a variant of rejection sampling.
Line lengths include the line terminator character(s), hence MIN_LINE_LENGTH >= 1. (All the better if you know a tighter bound on line length).
It is worth noting that the runtime of this algorithm does not depend on file size, only on line length, i.e. it scales much better than reading the entire file.
Here's a solution. Take a look at the choose() method which does the real thing (the main() method repeatedly exercises choose(), to show that the distribution is indeed quite uniform).
The idea is simple: when you read the first line it has a 100% chance of being chosen as the result. When you read the 2nd line it has a 50% chance of replacing the first line as the result. When you read the 3rd line it has a 33% chance of becoming the result. The fourth line has a 25%, and so on....
import java.io.*;
import java.util.*;
public class B {
public static void main(String[] args) throws FileNotFoundException {
Map<String,Integer> map = new HashMap<String,Integer>();
for(int i = 0; i < 1000; ++i)
{
String s = choose(new File("g:/temp/a.txt"));
if(!map.containsKey(s))
map.put(s, 0);
map.put(s, map.get(s) + 1);
}
System.out.println(map);
}
public static String choose(File f) throws FileNotFoundException
{
String result = null;
Random rand = new Random();
int n = 0;
for(Scanner sc = new Scanner(f); sc.hasNext(); )
{
++n;
String line = sc.nextLine();
if(rand.nextInt(n) == 0)
result = line;
}
return result;
}
}
Either you
read the file twice - once to count the number of lines, the second time to extract a random line, or
use reservoir sampling
Looking over Itay's answer, it looks as though it reads the file a thousand times over after sampling one line of the code, whereas true reservoir sampling should only go over the 'tape' once. I've devised some code to go over code once with real reservoir sampling, based on this and the various descriptions on the web.
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.List;
public class reservoirSampling {
public static void main(String[] args) throws FileNotFoundException, IOException{
Sampler mySampler = new Sampler();
List<String> myList = mySampler.sampler(10);
for(int index = 0;index<myList.size();index++){
System.out.println(myList.get(index));
}
}
}
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.Scanner;
public class Sampler {
public Sampler(){}
public List<String> sampler (int reservoirSize) throws FileNotFoundException, IOException
{
String currentLine=null;
//reservoirList is where our selected lines stored
List <String> reservoirList= new ArrayList<String>(reservoirSize);
// we will use this counter to count the current line number while iterating
int count=0;
Random ra = new Random();
int randomNumber = 0;
Scanner sc = new Scanner(new File("Open_source.html")).useDelimiter("\n");
while (sc.hasNext())
{
currentLine = sc.next();
count ++;
if (count<=reservoirSize)
{
reservoirList.add(currentLine);
}
else if ((randomNumber = (int) ra.nextInt(count))<reservoirSize)
{
reservoirList.set(randomNumber, currentLine);
}
}
return reservoirList;
}
}
The basic premise is that you fill up the reservoir, and then go back to it and fill in random lines with a 1/ReservoirSize chance. I hope this provides more efficient code. Please let me know if this doesn't work for you, as I've literally knocked it up in half an hour.
Use RandomAccessFile:
Construct a RandomAccessFile, file
Get the length of that file, filelen, by calling file.length()
Generate a random number, pos, between 0 and filelen
Call file.seek(pos) to seek to the random position
Call file.readLine() to get to the end of the current line
Read the next line by calling file.readLine() again
Using this method, I've been sampling lines from the Brown Corpus at random, and can easily retrieve a 1000 random samples from randomly chosen files in a few seconds. If I tried to do the same by reading through each file line-by-line it would take me much longer.
The same principle can be used for selecting random elements from a list. Rather than reading through the list and stopping at a random place, if you generate a random number between 0 and the length of the list, then you can index directly into the list.
Reading a random line from a file in java:
public String getRandomLineFromTheFile(String filePathWithFileName) throws Exception {
File file = new File(filePathWithFileName);
final RandomAccessFile f = new RandomAccessFile(file, "r");
final long randomLocation = (long) (Math.random() * f.length());
f.seek(randomLocation);
f.readLine();
String randomLine = f.readLine();
f.close();
return randomLine;
}
Use a BufferedReader and read line wise. Use the java.util.Random object to stop randomly ;)

counting elements in an array imported from a data file

I am writing a program that will import values from a txt file in to an array, I then need to count how many of those elements are greater than or equal to 36. The data imports fine, and the total amount of values it displays is correct, but I can not get it display the amount of times the number 36 is found in the file. Thanks for any help!
public static void main(String[] args) throws Exception {
int[] enrollments = new int [100];
int count;
int FullClass;
double ClassPercentage;
return count (number of data items)
count = CreateArray(enrollments);
System.out.println (count );
FullClass = AddValues (enrollments);
System.out.println (FullClass)
ClassPercentage= FullClass/count;
System.out.print(ClassPercentage +"% of classes are full");
}//end main
/**
*
* #param classSizes
*/
public static int CreateArray(int[] classSizes) throws Exception{
int count = 0;
File enrollments = new File("enrollments.txt");
Scanner infile = new Scanner (enrollments);
while (infile.hasNextInt()){
classSizes[count] = infile.nextInt();
count++}//end while
return count; //number of items in an array
} // end CreateArray
/**************************************************************************/
/**
*
* #throws java.lang.Exception
*/
public static int AddValues (int[] enrollments) throws Exception{
{
int number = 0;
int countOf36s = 0;
while (infile.hasNextInt()) {
number = infile.next();
classSizes[count] = number;
if(number>=36) {
countOf36s++;
}
count++;
}
return countOf36s;
}// end AddValues
}//end main
Try this code to count the numbers that are greater than or equal to 36 while you are reading the file only. Change the code in your createArray method or write the below logic where ever you want to.
I tried executing this program. It works as expected. See below code
import java.util.*;
import java.io.*;
public class Test { //Name this to your actual class name
public static void main(String[] args) throws Exception {
int[] enrollments = new int [100]; //assuming not more than 100 numbers in the text file
int count; //count of all the numbers in text file
int FullClass; //count of numbers whose value is >=36
double ClassPercentage;
count = CreateArray(enrollments);
System.out.println (count);
FullClass = AddValues (enrollments);
System.out.println (FullClass);
ClassPercentage= FullClass/count;
System.out.print(ClassPercentage +"% of classes are full");
}
//Method to read all the numbers from the text file and store them in the array
public static int CreateArray(int[] classSizes) throws Exception {
int count = 0;
File enrollments = new File("enrollments.txt"); //path should be correct or else you get an exception.
Scanner infile = new Scanner (enrollments);
while (infile.hasNextInt()) {
classSizes[count] = infile.nextInt();
count++;
}
return count; //number of items in an array
}
//Method to read numbers from the array and store the count of numbers >=36
public static int AddValues (int[] enrollments) throws Exception{
int number = 0;
int countOf36s = 0;
for(int i=0; i<enrollments.length; i++) {
number = enrollments[i];
if(number>=36) {
countOf36s++;
}
}
return countOf36s;
}
}
Your code indicates that you might have misunderstood a couple of concepts and stylistic things. As you say in your comments you are new at this and would like some guidance as well as the answer to the question - here it is:
Style
Method names and variable names are by convention written starting with a lower case letter and then in camel case. This is in contrast to classes that are named starting with an upper case letter and camel case. Sticking to these conventions make code easier to read and maintain. A full list of conventions is published - this comment particularly refers to naming conventions.
Similarly, by convention, closing braces are put on a separate line when they close loops or if-else blocks.
throws Exception is very general - it's usual to limit as much as possible what Exceptions your code actually throws - in your case throws FileNotFoundException should be sufficient as this is what Scanner or File can throw at runtime. This specificity can be useful to any code that uses any of your code in the future.
Substance
You are creating the array up front with 100 members. You then call CreateArray which reads from a file while that file has more integers in it. Your code does not know how many that is - let's call it N. If N <= 100 (there are 100 integers or less), that's fine and your array will be populated from 0 to N-1. This approach is prone to confusion, though - the length of your array will be 100 no matter how many values it has read from the file - so you have to keep track of the count returned by CreateArray.
If N > 100 you have trouble - the file reading code will keep going, trying to add numbers to the array beyond its maximum index and you will get a runtime error (index out of bounds)
A better approach might be to have CreateArray return an ArrayList, which can have dynamic length and you can check how many there are using ArrayList.size()
Your original version of AddValues called CreateArray a second time, even though you pass in the array which already contains the values read from file. This is inefficient as it does all the file I/O again. Not a problem with this small example, but you should avoid duplication in general.
The main problem. As per prudhvi you are checking the number of integers in the file against 36, not each value. You can rectify this as suggested in that answer.
You do ClassPercentage= FullClass/count; Although ClassPercentage is a double, somewhat counter intuitively - because both the variables on the Right Hand Side (RHS) are int, you will have an int returned from the division which will always round down to zero. To make this work properly - you have to change (cast) one of the variables on the RHS to double before division e.g. ClassPercentage= ((double)FullClass)/count;.
If you do keep using arrays rather than ArrayList, be careful what happens when you pass them into methods. You are passing by reference, which means that if you change an element of an array in your method, it remains changed when you return from that method.
In your new version you do
...
classSizes[count] = number;
if(number>=36) {
...
You almost certainly mean
...
number = classSizes[count];
if(number>=36) {
...
which is to say in programing the order of the assignment equals is important, so a = b is not equivalent to b = a
Code
A cleaned up version of your code - observing all the above (I hope):
import java.io.File;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.Scanner;
public class ClassCounter
{
public static void main(String[] args) throws FileNotFoundException
{
int count;
int fullClass;
double classPercentage;
ArrayList<Integer> enrollments = createArray();
count = enrollments.size();
System.out.println(count);
fullClass = addValues(enrollments);
System.out.println(fullClass);
classPercentage = fullClass / count;
System.out.print(classPercentage + "% of classes are full");
}
/**
* scans file "enrollments.txt", which must contain a list of integers, and
* returns an ArrayList populated with those integers.
*
* #throws FileNotFoundException
*/
public static ArrayList<Integer> createArray() throws FileNotFoundException
{
ArrayList<Integer> listToReturn = new ArrayList<Integer>();
File enrollments = new File("enrollments.txt");
Scanner infile = new Scanner(enrollments);
while (infile.hasNextInt())
{
listToReturn.add(infile.nextInt());
}
return listToReturn;
}
/**
* returns the number of cases where enrollments >= 36 from the list of
* all enrollments
*
* #param enrollments - the list of enrollments in each class
* #throws FileNotFoundException
*/
public static int addValues(ArrayList<Integer> enrollments)
{
int number = 0;
int countOf36s = 0;
int i = 0;
while (i < enrollments.size())
{
number = enrollments.get(i);
if (number >= 36)
{
countOf36s++;
}
}
return countOf36s;
}
}

How to get a random line of a text file in Java?

Say there is a file too big to be put to memory. How can I get a random line from it? Thanks.
Update:
I want to the probabilities of getting each line to be equal.
Reading the entire file if you want only one line seems a bit excessive. The following should be more efficient:
Use RandomAccessFile to seek to a random byte position in the file.
Seek left and right to the next line terminator. Let L the line between them.
With probability (MIN_LINE_LENGTH / L.length) return L. Otherwise, start over at step 1.
This is a variant of rejection sampling.
Line lengths include the line terminator character(s), hence MIN_LINE_LENGTH >= 1. (All the better if you know a tighter bound on line length).
It is worth noting that the runtime of this algorithm does not depend on file size, only on line length, i.e. it scales much better than reading the entire file.
Here's a solution. Take a look at the choose() method which does the real thing (the main() method repeatedly exercises choose(), to show that the distribution is indeed quite uniform).
The idea is simple: when you read the first line it has a 100% chance of being chosen as the result. When you read the 2nd line it has a 50% chance of replacing the first line as the result. When you read the 3rd line it has a 33% chance of becoming the result. The fourth line has a 25%, and so on....
import java.io.*;
import java.util.*;
public class B {
public static void main(String[] args) throws FileNotFoundException {
Map<String,Integer> map = new HashMap<String,Integer>();
for(int i = 0; i < 1000; ++i)
{
String s = choose(new File("g:/temp/a.txt"));
if(!map.containsKey(s))
map.put(s, 0);
map.put(s, map.get(s) + 1);
}
System.out.println(map);
}
public static String choose(File f) throws FileNotFoundException
{
String result = null;
Random rand = new Random();
int n = 0;
for(Scanner sc = new Scanner(f); sc.hasNext(); )
{
++n;
String line = sc.nextLine();
if(rand.nextInt(n) == 0)
result = line;
}
return result;
}
}
Either you
read the file twice - once to count the number of lines, the second time to extract a random line, or
use reservoir sampling
Looking over Itay's answer, it looks as though it reads the file a thousand times over after sampling one line of the code, whereas true reservoir sampling should only go over the 'tape' once. I've devised some code to go over code once with real reservoir sampling, based on this and the various descriptions on the web.
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.List;
public class reservoirSampling {
public static void main(String[] args) throws FileNotFoundException, IOException{
Sampler mySampler = new Sampler();
List<String> myList = mySampler.sampler(10);
for(int index = 0;index<myList.size();index++){
System.out.println(myList.get(index));
}
}
}
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.Scanner;
public class Sampler {
public Sampler(){}
public List<String> sampler (int reservoirSize) throws FileNotFoundException, IOException
{
String currentLine=null;
//reservoirList is where our selected lines stored
List <String> reservoirList= new ArrayList<String>(reservoirSize);
// we will use this counter to count the current line number while iterating
int count=0;
Random ra = new Random();
int randomNumber = 0;
Scanner sc = new Scanner(new File("Open_source.html")).useDelimiter("\n");
while (sc.hasNext())
{
currentLine = sc.next();
count ++;
if (count<=reservoirSize)
{
reservoirList.add(currentLine);
}
else if ((randomNumber = (int) ra.nextInt(count))<reservoirSize)
{
reservoirList.set(randomNumber, currentLine);
}
}
return reservoirList;
}
}
The basic premise is that you fill up the reservoir, and then go back to it and fill in random lines with a 1/ReservoirSize chance. I hope this provides more efficient code. Please let me know if this doesn't work for you, as I've literally knocked it up in half an hour.
Use RandomAccessFile:
Construct a RandomAccessFile, file
Get the length of that file, filelen, by calling file.length()
Generate a random number, pos, between 0 and filelen
Call file.seek(pos) to seek to the random position
Call file.readLine() to get to the end of the current line
Read the next line by calling file.readLine() again
Using this method, I've been sampling lines from the Brown Corpus at random, and can easily retrieve a 1000 random samples from randomly chosen files in a few seconds. If I tried to do the same by reading through each file line-by-line it would take me much longer.
The same principle can be used for selecting random elements from a list. Rather than reading through the list and stopping at a random place, if you generate a random number between 0 and the length of the list, then you can index directly into the list.
Reading a random line from a file in java:
public String getRandomLineFromTheFile(String filePathWithFileName) throws Exception {
File file = new File(filePathWithFileName);
final RandomAccessFile f = new RandomAccessFile(file, "r");
final long randomLocation = (long) (Math.random() * f.length());
f.seek(randomLocation);
f.readLine();
String randomLine = f.readLine();
f.close();
return randomLine;
}
Use a BufferedReader and read line wise. Use the java.util.Random object to stop randomly ;)

Categories

Resources