I have a massive 25GB CSV file. I know that there are ~500 Million records in the file.
I want to do some basic analysis with the data. Nothing too fancy.
I don't want to use Hadoop/Pig, not yet atleast.
I have written a java program to do my analysis concurrently. Here is what I am doing.
class MainClass {
public static void main(String[] args) {
long start = 1;
long increment = 10000000;
OpenFileAndDoStuff a = new OpenFileAndDoStuff[50];
for(int i=0;i<50;i++) {
a[i] = new OpenFileAndDoStuff("path/to/50GB/file.csv",start,start+increment-1);
a[i].start();
start += increment;
}
for(OpenFileAndDoStuff obj : a) {
obj.join();
}
//do aggregation
}
}
class OpenFileAndDoStuff extends Thread {
volatile HashMap<Integer, Integer> stuff = new HashMap<>();
BufferedReader _br;
long _end;
OpenFileAndDoStuff(String filename, long startline, long endline) throws IOException, FileNotFoundException {
_br = new BufferedReader(new FileReader(filename));
long counter=0;
//move the bufferedReader pointer to the startline specified
while(counter++ < start)
_br.readLine();
this._end = end;
}
void doStuff() {
//read from buffered reader until end of file or until the specified endline is reached and do stuff
}
public void run() {
doStuff();
}
public HashMap<Integer, Integer> getStuff() {
return stuff;
}
}
I thought doing this I could open 50 bufferedReaders, all reading 10 million lines chucks in parallel and once all of them are done doing their stuff, I'd aggregate them.
But, the problem I face is that even though I ask 50 threads to start, only two start at a time and can read from the file at a time.
Is there a way I can make all 50 of them open the file and read form it at the same time ? Why am I limited to only two readers at a time ?
The file is on a windows 8 machine and java is also on the same machine.
Any ideas ?
Here is a similar post: Concurrent reading of a File (java preffered)
The most important question here is what is the bottleneck in your case?
If the bottleneck is your disk IO, then there isn't much you can do at the software part. Parallelizing the computation will only make things worse, because reading the file from different parts simultaneously will degrade disk performance.
If the bottleneck is processing power, and you have multiple CPU cores, then you can take an advantage of starting multiple threads to work on different parts of the file. You can safely create several InputStreams or Readers to read different parts of the file in parallel (as long as you don't go over your operating system's limit for the number of open files). You could separate the work into tasks and run them in parallel
See the referred post for an example that reads a single file in parallel with FileInputStream, which should be significantly faster than using BufferedReader according to these benchmarks: http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly#FileReaderandBufferedReader
One issue I see is that when a Thread is being asked to read, for example, lines 80000000 through 90000000, you are still reading in the first 80000000 lines (and ignoring them).
Maybe try java.io.RandomAccessFile.
In order to do this, you need all of the lines to be the same number of Bytes. If you cannot adjust the structure of your file, then this would not be an option. But if you can, this should allow for greater concurrency.
Related
This is my current thread, I use it to stress test the CPU, I need to output the "Hcount" every hour to a .txt file, currently, it will print it but only from one thread ,when another hour passes it deletes what is written on the .txt file and rewrite the new "Hcount"
I'm Running 3 threads.
import java.util.Random;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
public class MyThread extends Thread{
public void run() {
String B;//Will hold the value of Cpointer
String A;//will hold the string value of Hcount
Path fileName =
Path.of("D:/EEoutput/Out.txt");
Random rand = new Random();
long Hcount = 0;//counts the number of iterations in an hour
long t = System.currentTimeMillis();
long end = t + 3800000*5;//a minute
double a1 = 0; //random holder 1
double a2 = 0;//random holder 2
double answer = 0; // answer placeholder
Long hour = System.currentTimeMillis()+3600000;//will be used to say when are we outputing the number of iterations to the file
int Cpointer = 1;//will tell how many hours has passed
while (System.currentTimeMillis() < end) {
a1 = rand.nextDouble();
a2 = rand.nextDouble();
answer = a1 * 23 / (a2 + a1) + a2;
Hcount++;
if (System.currentTimeMillis() >= hour)// checks if the program needs to
{
B = String.valueOf(Cpointer);
A=String.valueOf(Hcount);
try {
Files.writeString(fileName, A);
} catch (IOException e) {
e.printStackTrace();
}
hour = System.currentTimeMillis()+3600000;//sets stop to next hour
Cpointer++;//declares that another hour has passed, will be used to tell how many iterations are there in a certain hour
Hcount = 0;
}
}
}
}
'''
Writing into file from multiple threads is a bad idea. I suggest you create a queue (even if just in memory queue) and have all your threads writing the info that they want to write into your file into this queue. In other words your queue will have multiple producers. And than have a single consumer on your queue that will read from the queue and write it into your file. This way you will have only one thread writing into file
You have two separate issues here.
Files.writeString replaces content by default. You want Files.writeString(fileName, A, StandardOpenOption.APPEND).
Writing to the same file from simultaneous threads isn't going to work (think about it. The OS cannot promise that your write will be atomic, that should be obvious). So even if you fix it, it'll seem to work but every so often fail on you: A race condition.
The usual strategy to work around that last part is to use locks of some kind. If a single JVM is the only one doing those file writes, you can use whatever you like that java offers: synchronized, for example. Or an ReadWriteLock from the j.u.concurrent package.
But, this does mean your CPU stresser thread will be doing the waiting for the lock. You may instead want to start a separate thread, and have a single ConcurrentBlockingQueue. Your CPU stress testers send log messages to the queue, and your log writer thread will just be doing a 5-liner loop: Endlessly, fetch an item from the queue (this blocks until there is something), write it to the file, flush the stream, and loop.
This solves a bunch of problems, given that now only one thread writes.
If it's multiple JVMs, that's trickier - then lock with a .lock file. You can use Files.createFile() to create logfile.lock; this will fail if the file is already there. Then wait some time (you can't ask the OS to tell you when the file is deleted, so you have to wait half a second or so and check again, forever, until the file is gone), until it succeeds, then write, then delete the lock file.
A major downside to .lock files is: If your process hard-crashes, the lock file sticks around. Which you don't want. One solution to that is to write your own PID (Process ID) to it, and thus anybody doing a check can at least see that the process it belongs to is dead. Except this is tricky; modern OSes don't just let you check for existence, neccessarily, and it's all very OS-dependent (no java libraries that automate this stuff, as far as I know). This all gets quite complicated, so, let's keep it simple:
If you want to write to the same file simultaneously from different JVMs / processes on the same system, you can do that 'safely', but it is rather complicated.
I have big text file that contains source-target nodes and threshold.I store all the distinct nodes in HashSet,then filter the edges based on user threshold and store the filtered nodes in separated Hash Set.So i want to find a way to do the processing as fast as possible.
public class Simulator {
static HashSet<Integer> Alledgecount = new HashSet<>();
static HashSet<Integer> FilteredEdges = new HashSet<>();
static void process(BufferedReader reader,double userThres) throws IOException {
String line = null;
int l = 0;
BufferedWriter writer = new BufferedWriter( new FileWriter("C:/users/mario/desktop/edgeList.txt"));
while ((line = reader.readLine()) != null & l < 50_000_000) {
String[] intArr = line.split("\\s+");
checkDuplicate(Integer.parseInt(intArr[1]), Integer.parseInt(intArr[2]), Alledgecount);
double threshold = Double.parseDouble(intArr[3]);
if(threshold > userThres) {
writeToFile(intArr[1],intArr[2],writer);
checkDuplicate(Integer.parseInt(intArr[1]), Integer.parseInt(intArr[2]), FilteredEdges);
}
l++;
}
writer.close();
}
static void writeToFile(String param1,String param2,Writer writer) throws IOException {
writer.write(param1+","+param2);
writer.write("\r\n");
}
The graph class does BFS and writes the nodes in separated file.I have done the processing excluding some functionalities and the timings are below.
Timings with 50 million lines read in process()
without calling BFS(),checkDuplicates,writeAllEdgesToFile() -> 54s
without calling BFS(),writeAllEdgesToFile() -> 50s
without calling writeAllEdgesToFile() -> 1min
Timings with 300 million lines read in process()
without calling writeAllEdges() 5 min
Reading a file doesn't depend only on CPU cores.
IO operations on a file will be limited by physical constraints of classic disks that contrary to CPU core cannot parallel operations.
What you could do is having a thread for IO operations and other(s) for data processing but it makes sense only if data processing is long enough to make relevant to create a Thread for this task as Threads have a cost in terms of CPU scheduling.
Getting a multi-threaded Java program to run correctly can be very tricky. It needs some deep understanding of things like synchronization issues etc. Without the knowledge/experience necessary, you'll have a hard time searching for bugs that occur sometimes but aren't reliably reproducible.
So, before trying multi-threading, find out if there are easier ways to achieve acceptable performance:
Find the part of your program that takes the time!
First question: is it I/O or CPU? Have a look at Task Manager. Does your single-threaded program occupy one core (e.g. CPU close to 25% on a 4-core machine)? If it's far below that, then I/O must be the limiting factor, and changing your program probably won't help much - buy a faster HD. (In some situations, the software style of doing I/O might influence the hardware performance, but that's rare.)
If it's CPU, use a profiler, e.g. the JVisualVM contained in the JDK, to find the method that takes most of the runtime and think about alternatives. One candidate might be the line.split("\\s+"), using a regular expression. They are slow, especially if the expression isn't compiled to a Pattern beforehand - but that's nothing more than a guess, and the profiler will most probably tell you some very different place.
Well, this might be a silly problem.
I just want a faster implementation of following problem
I want to take three integer input in a single line eg:
10 34 54
One way is to make a BufferedReader and then use readLine()
which will read the whole line as a string
then we can use StringTokenizer to separate three integer. (Slow implemetation)
Another way is to use 'Scanner' and take input by nextInt() method. (Slower than previous method)
I want a fast implementation to take such type of inputs since I have to read more than 2,000,000 lines and these implementations are very slow.
My implementation:
BufferedReader br=new BufferedReader(new InputStreamReader(System.in));
for(i=0;i<n;i++) {
str=br.readLine();
st = new StringTokenizer(str);
t1=Integer.parseInt(st.nextElement().toString());
t2=Integer.parseInt(st.nextElement().toString());
z=Long.parseLong(st.nextElement().toString());
}
This one is looped for n times. ( n is number of entries)
Since I know each line will contain only three integer there is no need to check for hasMoreElements()
I just want a faster implementation of following problem.
The chances are that you DON'T NEED a faster implementation. Seriously. Not even with a 2 million line input file.
The chances are that:
more time is spent processing the file than reading it, and
most of the "read time" is spent doing things at the operating system level, or simply waiting for the next disk block to be read.
My advice is to not bother optimizing this unless the application as a whole takes too long to run. And when you find that this is the case, profile the application, and use the profile stats to tell you where it could be worthwhile spending effort on optimization.
(My gut feeling is that there is not much to be gained by optimizing this part of your application. But don't rely on that. Profile it!)
Here's a basic example that will be pretty fast:
public static void main(String[] args) throws IOException {
BufferedReader reader = new BufferedReader(new FileReader("myfile.txt"));
String line;
while ((line = reader.readLine()) != null) {
for (String s : line.split(" ")) {
final int i = Integer.parseInt(s);
// do something with i...
}
}
reader.close();
}
However your task is fundamentally going to take time.
If you are doing this on a website and reaching a timeout, you should consider doing it in a background thread, and send a response to the user saying that the data is being processed. You'll probably need to add a way for the user to check on the progress.
Here is what I mean when I say "specialized scanner". Depending upon parser's (or split's) efficiency, this might be a bit faster (it probably is not):
BufferedReader br=new BufferedReader(...);
for(i=0;i<n;i++)
{
String str=br.readLine();
long[] resultLongs = {-1,-1,-1};
int startPos=0;
int nextLongIndex=0;
for (int p=0;p<str.length();p++)
{
if (str.charAt(p)== ' ')
{
String intAsStr=str.substring(startPos, p-1);
resultLongs[nextLongIndex++]=Integer.parseInt(intAsStr);
startpos=p+1;
}
}
// t1, t2 and z are in resultLongs[0] through resultLongs[2]
}
Hths.
And of course this fails miserably if the input file contains garbage, i.e. anything else but longs separated by blanks.
And in addition, to minimize the "roundtrips" to the OS, it is a good idea to supply the buffered reader with a nonstandard (bigger-than-standard) buffer.
The other hint I gave in the comment refined: If you have to read such a huge text file more than once, i.e. more than once after it has been updated, you could read all longs into a data structure (maybe a List of elements that hold three longs), and stream that into a "cache" file. Next time, compare the text file's timestamp to the "cache" file's. If it is older, read the cache file. Since stream I/O does not serialize longs into its string representation, you will see much, much better reading times.
EDIT: Missed the startPos reassignment.
EDIT2: Added the cache idea explanation.
i want three thread reading the single file for example the size of the file is 900 kb i want the first thread read the file 1kb to 300 and in the same fashion the other thread do (2 thread read from 301 kb to 600 kb AND 3 thread read 601kb to 900kb) does this approach make reading faster output has to be shown on the console may be the output get mixed it does not matter for me The main matter is that how to read the faster as comparison to single thread plz plz give me a suggestion or coding if somebody have plz
I'm not a Java expert, but I believe that if your goal is performance, you should not bother reading a single megabyte file in several threads. Most of the time is probably spent in doing the actual IO operation, that is reading from the disk (recall that disk operations are millions times slower than memory operations). Of course, on some occasions, it could be faster (e.g. on Linux system, the file data could have been cached, it it has been read or written some time before).
But when reading (rather small, i.e. megabyte sized) files, most of the time is spent in the system, and your coding won't impact that.
And reading a megabyte file should go fast on today's machines. You might use some dirty system tricks to improve it (e.g. the Linux readahead system call), if absolutely necessary.
Actually, your question surprises me. Reading one megabyte is quick today!
Regards.
Basile Starynkevitch
public static void main(String[] args){
//
String filePath = args[0];
//Create runnable objects
//Load the file
BufferedReader br = new BufferedReader(new FileReader(filePath));
//share this object among threads you want
MyFileReader mf1 = new MyFileReader(br);
MyFileReader mf2 = new MyFileReader(br);
MyFileReader mf3 = new MyFileReader(br);
new Thread(mf1).start();
new Thread(mf2).start();
new Thread(mf3).start();
//code to detect thread ends
//close br here
}
public class MyFileReader implements Runnable{
private BufferedReader br=null;
public MyFileReader(BufferedReader br){
this.br = br
}
public void run(){
String line=null;
while((line=br.readLine())!=null){
//do your thing here e.g.
System.out.println(line);
}
}
For searching a string in a file and writing the lines with matched string to another
file it takes 15 - 20 mins for a single zip file of 70MB(compressed state).
Is there any ways to minimise it.
my source code:
getting Zip file entries
zipFile = new ZipFile(source_file_name);
entries = zipFile.entries();
while (entries.hasMoreElements())
{ ZipEntry entry = (ZipEntry)entries.nextElement();
if (entry.isDirectory())
{
continue;
}
searchString(Thread.currentThread(),entry.getName(), new BufferedInputStream (zipFile.getInputStream(entry)), Out_File, search_string, stats); }
zipFile.close();
Searching String
public void searchString(Thread CThread, String Source_File, BufferedInputStream in, File outfile, String search, String stats) throws IOException
{
int count = 0;
int countw = 0;
int countl = 0;
String s;
String[] str;
BufferedReader br2 = new BufferedReader(new InputStreamReader(in));
System.out.println(CThread.currentThread());
while ((s = br2.readLine()) != null)
{
str = s.split(search);
count = str.length - 1;
countw += count; //word count
if (s.contains(search))
{
countl++; //line count
WriteFile(CThread,s, outfile.toString(), search);
}
}
br2.close();
in.close();
}
--------------------------------------------------------------------------------
public void WriteFile(Thread CThread,String line, String out, String search) throws IOException
{
BufferedWriter bufferedWriter = null;
System.out.println("writre thread"+CThread.currentThread());
bufferedWriter = new BufferedWriter(new FileWriter(out, true));
bufferedWriter.write(line);
bufferedWriter.newLine();
bufferedWriter.flush();
}
Please help me. Its really taking 40 mins for 10 files using threads and 15 - 20 mins for a single file of 70MB after being compressed. Any ways to minimise the time.
You are reopening the file output handle for every single line you write.
This is likely to have a massive performance impact, far outweighing other performance issues. Instead I would recommend creating the BufferedWriter once (e.g. upon the first match) and then keeping it open, writing each matching line and then closing the Writer upon completion.
Also, remove the call to flush(); there is no need to flush each line as the call to Writer.close() will automatically flush any unwritten data to disk.
Finally, as a side note your variable and method naming style does not follow the Java camel case convention; you might want to consider changing it.
I'm not sure if the cost you are seeing is from disk operations or from string manipulations. I'll assume for now that the problem is the strings, you can check that by writing a test driver that runs your code with the same line over and over.
I can tell you that split() is going to be very expensive in your case because you are producing strings you don't need and then recycling them, creating much overhead. You may want to increase the amount of space available to your JVM with -Xmx.
If you merely separate words by the presence of whitespace, then you would do much better by using a regular expression matcher that you create before the loop and apply it to the string The number of matches when applied to a given string will be your word count, and that should not create an array of strings (which is very wasteful and which you don't use). You will see in the JavaDocs that split does work via regular expressions; that is true, but split does the extra step of creating separate strings and that's where your waste might be.
You can also use a regular expression to search for the match instead of contains though that may not be significantly faster.
You could make things parallel by using multiple threads. However, if split() is the cause of your grief, your problem is the overhead and running out of heap space, so you won't necessarily benefit from it.
More generally, if you need to do this a lot, you may want to write a script in a language more "friendly" to string manipulation. A 10-line script in Python can do this much faster.
wow, what are you doing in this method
WriteFile(CThread,s, outfile.toString(), search);
every time you got the line containing your text, you are creating BufferedWriter(new FileWriter(out, true));
Just create a bufferedWriter in your searchString method and use that to insert lines. No need to open that again and again. It will drastically improve the performance.
One problem here might be that you stop reading when you write. I would probably use one thread for reading and another thread for writing the file. As an extra optimization the thread writing the results could buffer them into memory and write them to the file as a batch, say every ten entries or something.
In the writing thread you should queue the incoming entries before handling them.
Of course, you should maybe first debug where that time is spent, is it the IO or something else.
There are too many potential bottlenecks in this code for anyone to be sure what the critical ones are. Therefore you should profile the application to determine what it causing it to be slow.
Armed with that information, decide whether the problem is in reading the ZIP file, soing the searching or writing the matches to the output file.
(Repeatedly opening and closing the output file is a bad idea, but if you only get a tiny number of search hits it won't make much difference to the overall performance.)