Efficient read and write from big text file in java

Efficient read and write from big text file in java - java

I have big text file that contains source-target nodes and threshold.I store all the distinct nodes in HashSet,then filter the edges based on user threshold and store the filtered nodes in separated Hash Set.So i want to find a way to do the processing as fast as possible.
public class Simulator {
static HashSet<Integer> Alledgecount = new HashSet<>();
static HashSet<Integer> FilteredEdges = new HashSet<>();
static void process(BufferedReader reader,double userThres) throws IOException {
String line = null;
int l = 0;
BufferedWriter writer = new BufferedWriter( new FileWriter("C:/users/mario/desktop/edgeList.txt"));
while ((line = reader.readLine()) != null & l < 50_000_000) {
String[] intArr = line.split("\\s+");
checkDuplicate(Integer.parseInt(intArr[1]), Integer.parseInt(intArr[2]), Alledgecount);
double threshold = Double.parseDouble(intArr[3]);
if(threshold > userThres) {
writeToFile(intArr[1],intArr[2],writer);
checkDuplicate(Integer.parseInt(intArr[1]), Integer.parseInt(intArr[2]), FilteredEdges);
}
l++;
}
writer.close();
}
static void writeToFile(String param1,String param2,Writer writer) throws IOException {
writer.write(param1+","+param2);
writer.write("\r\n");
}
The graph class does BFS and writes the nodes in separated file.I have done the processing excluding some functionalities and the timings are below.
Timings with 50 million lines read in process()
without calling BFS(),checkDuplicates,writeAllEdgesToFile() -> 54s
without calling BFS(),writeAllEdgesToFile() -> 50s
without calling writeAllEdgesToFile() -> 1min
Timings with 300 million lines read in process()
without calling writeAllEdges() 5 min

Reading a file doesn't depend only on CPU cores.
IO operations on a file will be limited by physical constraints of classic disks that contrary to CPU core cannot parallel operations.
What you could do is having a thread for IO operations and other(s) for data processing but it makes sense only if data processing is long enough to make relevant to create a Thread for this task as Threads have a cost in terms of CPU scheduling.

Getting a multi-threaded Java program to run correctly can be very tricky. It needs some deep understanding of things like synchronization issues etc. Without the knowledge/experience necessary, you'll have a hard time searching for bugs that occur sometimes but aren't reliably reproducible.
So, before trying multi-threading, find out if there are easier ways to achieve acceptable performance:
Find the part of your program that takes the time!
First question: is it I/O or CPU? Have a look at Task Manager. Does your single-threaded program occupy one core (e.g. CPU close to 25% on a 4-core machine)? If it's far below that, then I/O must be the limiting factor, and changing your program probably won't help much - buy a faster HD. (In some situations, the software style of doing I/O might influence the hardware performance, but that's rare.)
If it's CPU, use a profiler, e.g. the JVisualVM contained in the JDK, to find the method that takes most of the runtime and think about alternatives. One candidate might be the line.split("\\s+"), using a regular expression. They are slow, especially if the expression isn't compiled to a Pattern beforehand - but that's nothing more than a guess, and the profiler will most probably tell you some very different place.

Related

How to detect memory-pressure in a java program?

I have a batch process, written in java, that analyzes extremely long sequences of tokens (maybe billions or even trillions of them!) and observes bi-gram patterns (aka, word-pairs).
In this code, bi-grams are represented as Pairs of Strings, using the ImmutablePair class from Apache commons. I won't know in advance the cardinality of the tokens. They might be very repetitive, or each token might be totally unique.
The more data I can fit into memory, the better the analysis will be!
But I definitely can't process the whole job at once. So I need to load as much data as possible into a buffer, perform a partial analysis, flush my partial results to a file (or to an API, or whatever), then clear my caches and start over.
One way I'm optimizing memory usage is by using Guava interners to de-duplicate my String instances.
Right now, my code looks essentially like this:
int BUFFER_SIZE = 100_000_000;
Map<Pair<String, String>, LongAdder> bigramCounts = new HashMap<>(BUFFER_SIZE);
Interner<String> interner = Interners.newStrongInterner();
String prevToken = null;
Iterator<String> tokens = getTokensFromSomewhere();
while (tokens.hasNest()) {
String token = interner.intern(tokens.next());
if (prevToken != null) {
Pair<String, String> bigram = new ImmutablePair(prevToken, token);
LongAdder bigramCount = bigramCounts.computeIfAbsent(
bigram,
(c) -> new LongAdder()
);
bigramCount.increment();
// If our buffer is full, we need to flush!
boolean tooMuchMemoryPressure = bigramCounts.size() > BUFFER_SIZE;
if (tooMuchMemoryPressure) {
// Analyze the data, and write the partial results somewhere
doSomeFancyAnalysis(bigramCounts);
// Clear the buffer and start over
bigramCounts.clear();
}
}
prevToken = token;
}
The trouble with this code is that this is a very crude way of determining whether there is tooMuchMemoryPressure.
I want to run this job on many different kinds of hardware, with varying amounts of memory. No matter the instance, I want this code to automatically adjust to maximize the memory consumption.
Rather than using some hard-coded constant like BUFFER_SIZE (derived through experimentation, heuristics, guesswork), I actually just want ask the JVM whether the memory is almost full. But that's a very complicated question, considering the complexities of mark/sweep algorithms, and all the different generational collectors.
What would be a good general-purpose approach for accomplishing something like this, assuming this batch-job might run on a variety of different machines, with different amounts of available memory? I don't need this to be extremely precise... I'm just looking for a rough signal to know that I need to flush the buffer soon, based on the state of the actual heap.

The simplest way to get a first glimpse of what is going on with the process' heap space is Runtime.freeMemory() together with .maxMemory and .totalMemory. Yet the first does not factor in garbage and so is an under-estimation at best and may be completely misleading just before the GC kicks in.
Assuming that for your application "memory pressure" basically means "(soon) not enough", the interesting value is free memory right after a GC.
This is available by using GarbageCollectorMXBean
which provides GcInfo with memory usage after the GC.
The bean can be watched exactly after GC since it is a NotificationEmitter, despite this is not being advertised in the Javadoc. Some minimal code, patterned after a longer example is
void registerCallback() {
List<GarbageCollectorMXBean> gcbeans =
java.lang.management.ManagementFactory.getGarbageCollectorMXBeans();
for (GarbageCollectorMXBean gcbean : gcbeans) {
System.out.println(gcbean.getName());
NotificationEmitter emitter = (NotificationEmitter) gcbean;
emitter.addNotificationListener(this::handle, null, null);
}
}
private void handle(Notification notification, Object handback) {
if (!notification.getType()
.equals(GarbageCollectionNotificationInfo.GARBAGE_COLLECTION_NOTIFICATION)) {
return;
}
GarbageCollectionNotificationInfo info = GarbageCollectionNotificationInfo
.from((CompositeData) notification.getUserData());
GcInfo gcInfo = info.getGcInfo();
gcInfo.getMemoryUsageAfterGc().forEach((name, memUsage) -> {
System.err.println(name+ "->" + memUsage);
});
}
There will be several memUsage entries and this will also differ depending on the GC. But from the values provided, used, committed and max we can derive upper limits on free memory which again should give the "rough signal" the OP is asking for.
The doSomeFancyAnalysis will certainly also need its share of fresh memory, so with a very rough estimate how much that will be per bigramm to analyze, this could be the limit to watch for.

How many filereaders can concurrently read from the same file?

I have a massive 25GB CSV file. I know that there are ~500 Million records in the file.
I want to do some basic analysis with the data. Nothing too fancy.
I don't want to use Hadoop/Pig, not yet atleast.
I have written a java program to do my analysis concurrently. Here is what I am doing.
class MainClass {
public static void main(String[] args) {
long start = 1;
long increment = 10000000;
OpenFileAndDoStuff a = new OpenFileAndDoStuff[50];
for(int i=0;i<50;i++) {
a[i] = new OpenFileAndDoStuff("path/to/50GB/file.csv",start,start+increment-1);
a[i].start();
start += increment;
}
for(OpenFileAndDoStuff obj : a) {
obj.join();
}
//do aggregation
}
}
class OpenFileAndDoStuff extends Thread {
volatile HashMap<Integer, Integer> stuff = new HashMap<>();
BufferedReader _br;
long _end;
OpenFileAndDoStuff(String filename, long startline, long endline) throws IOException, FileNotFoundException {
_br = new BufferedReader(new FileReader(filename));
long counter=0;
//move the bufferedReader pointer to the startline specified
while(counter++ < start)
_br.readLine();
this._end = end;
}
void doStuff() {
//read from buffered reader until end of file or until the specified endline is reached and do stuff
}
public void run() {
doStuff();
}
public HashMap<Integer, Integer> getStuff() {
return stuff;
}
}
I thought doing this I could open 50 bufferedReaders, all reading 10 million lines chucks in parallel and once all of them are done doing their stuff, I'd aggregate them.
But, the problem I face is that even though I ask 50 threads to start, only two start at a time and can read from the file at a time.
Is there a way I can make all 50 of them open the file and read form it at the same time ? Why am I limited to only two readers at a time ?
The file is on a windows 8 machine and java is also on the same machine.
Any ideas ?

Here is a similar post: Concurrent reading of a File (java preffered)
The most important question here is what is the bottleneck in your case?
If the bottleneck is your disk IO, then there isn't much you can do at the software part. Parallelizing the computation will only make things worse, because reading the file from different parts simultaneously will degrade disk performance.
If the bottleneck is processing power, and you have multiple CPU cores, then you can take an advantage of starting multiple threads to work on different parts of the file. You can safely create several InputStreams or Readers to read different parts of the file in parallel (as long as you don't go over your operating system's limit for the number of open files). You could separate the work into tasks and run them in parallel
See the referred post for an example that reads a single file in parallel with FileInputStream, which should be significantly faster than using BufferedReader according to these benchmarks: http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly#FileReaderandBufferedReader

One issue I see is that when a Thread is being asked to read, for example, lines 80000000 through 90000000, you are still reading in the first 80000000 lines (and ignoring them).
Maybe try java.io.RandomAccessFile.
In order to do this, you need all of the lines to be the same number of Bytes. If you cannot adjust the structure of your file, then this would not be an option. But if you can, this should allow for greater concurrency.

Java NIO MappedByteBuffer OutOfMemoryException

I am really in trouble: I want to read HUGE files over several GB using FileChannels and MappedByteBuffers - all the documentation I found implies it's rather simple to map a file using the FileChannel.map() method.
Of course there is a limit at 2GB as all the Buffer methods use int for position, limit and capacity - but what about the system implied limits below that?
In reality, I get lots of problems regarding OutOfMemoryExceptions! And no documentation at all that really defines the limits!
So - how can I map a file that fits into the int-limit safely into one or several MappedByteBuffers without just getting exceptions?
Can I ask the system which portion of a file I can safely map before I try FileChannel.map()? How?
Why is there so little documentation about this feature??

I can offer some working code. Whether this solves your problem or not is difficult to say. This hunts through a file for a pattern recognised by the Hunter.
See the excellent article Java tip: How to read files quickly for the original research (not mine).
// 4k buffer size.
static final int SIZE = 4 * 1024;
static byte[] buffer = new byte[SIZE];
// Fastest because a FileInputStream has an associated channel.
private static void ScanDataFile(Hunter p, FileInputStream f) throws FileNotFoundException, IOException {
// Use a mapped and buffered stream for best speed.
// See: http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly
FileChannel ch = f.getChannel();
long red = 0L;
do {
long read = Math.min(Integer.MAX_VALUE, ch.size() - red);
MappedByteBuffer mb = ch.map(FileChannel.MapMode.READ_ONLY, red, read);
int nGet;
while (mb.hasRemaining() && p.ok()) {
nGet = Math.min(mb.remaining(), SIZE);
mb.get(buffer, 0, nGet);
for (int i = 0; i < nGet && p.ok(); i++) {
p.check(buffer[i]);
}
}
red += read;
} while (red < ch.size() && p.ok());
// Finish off.
p.close();
ch.close();
f.close();
}

What I use is a List<ByteBuffer> where each ByteBuffer maps to the file in block of 16 MB to 1 GB. I uses powers of 2 to simplify the logic. I have used this to map in files up to 8 TB.
A key limitation of memory mapped files is that you are limited by your virtual memory. If you have a 32-bit JVM you won't be able to map in very much.
I wouldn't keep creating new memory mappings for a file because these are never cleaned up. You can create lots of these but there appears to be a limit of about 32K of them on some systems (no matter how small they are)
The main reason I find MemoryMappedFiles useful is that they don't need to be flushed (if you can assume the OS won't die) This allows you to write data in a low latency way, without worrying about losing too much data if the application dies or too much performance by having to write() or flush().

You don't use the FileChannel API to write the entire file at once. Instead, you send the file in parts. See example code in Martin Thompson's post comparing performance of Java IO techniques: Java Sequential IO Performance
In addition, there is not much documentation because you are making a platform-dependent call. from the map() JavaDoc:
Many of the details of memory-mapped files are inherently dependent
upon the underlying operating system and are therefore unspecified.

The bigger the file, the less you want it all in memory at once. Devise a way to process the file a buffer at a time, a line at a time, etc.
MappedByteBuffers are especially problematic, as there is no defined release of the mapped memory, so using more than one at a time is essentially bound to fail.

Why is Java HashMap slowing down?

I try to build a map with the content of a file and my code is as below:
System.out.println("begin to build the sns map....");
String basePath = PropertyReader.getProp("oldbasepath");
String pathname = basePath + "\\user_sns.txt";
FileReader fr;
Map<Integer, List<Integer>> snsMap =
new HashMap<Integer, List<Integer>>(2000000);
try {
fr = new FileReader(pathname);
BufferedReader br = new BufferedReader(fr);
String line;
int i = 1;
while ((line = br.readLine()) != null) {
System.out.println("line number: " + i);
i++;
String[] strs = line.split("\t");
int key = Integer.parseInt(strs[0]);
int value = Integer.parseInt(strs[1]);
List<Integer> list = snsMap.get(key);
//if the follower is not in the map
if(snsMap.get(key) == null)
list = new LinkedList<Integer>();
list.add(value);
snsMap.put(key, list);
System.out.println("map size: " + snsMap.size());
}
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("finish building the sns map....");
return snsMap;
The program is very fast at first but gets much slowly when the information printed is :
map size: 1138338
line number: 30923602
map size: 1138338
line number: 30923603
....
I try to find to reason with two System.out.println() clauses to judge the preformance of BufferedReader and HashMap instead of a Java profiler.
Sometimes it takes a while to get the information of the map size after getting the line number information, and sometimes, it takes a while to get the information of the line number information after get the map size. My question is: which makes my program slow? the BufferedReader for a big file or HashMap for a big map?

If you are testing this from inside Eclipse, you should be aware of the huge performance penalty of writing to stdout/stderr, due to Eclipse's capturing that ouptut in the Console view. Printing inside a tight loop is always a performance issue, even outside of Eclipse.
But, if what you are complaining about is the slowdown experienced after processing 30 million lines, then I bet it's a memory issue. First it slows down due to intense GC'ing and then it breaks with OutOfMemoryError.

You will have to check you program with some profiling tools to understand why it is slow.
In general file access is much more slower than in memory operations (unless you are constrained in memory and doing excess GC) so the guess would be that reading file could be the slower here.

Before you profiled, you will not know what is slow and what isn't.
Most likely, the System.out will show up as being the bottleneck, and you'll then have to profile without them again. System.out is the worst thing you can do for finding performance bottlenecks, because in doing so you usually add an even worse bottleneck.
An obivous optimization for your code is to move the line
snsMap.put(key, list);
into the if statement. You only need to put this when you created a new list. Otherwise, the put will just replace the current value with itself.
Java cost associated with Integer objects (and in particular the use of Integers in the Java Collections API) is largely a memory (and thus Garbage Collection!) issue. You can sometimes get significant gains by using primitive collections such as GNU trove, depending how well you can adjust your code to use them efficiently. Most of the gains of Trove are in memory usage. Definitely try rewriting your code to use TIntArrayList and TIntObjectMap from GNU trove. I'd avoid linked lists, too, in particular for primitive types.
Roughly estimated, a HashMap<Integer, List<Integer>> needs at least 3*16 bytes per entry. The doubly linked list again needs at least 2*16 bytes per entry stored. 1m keys + 30m values ~ 1 GB. No overhead included yet. With GNU trove TIntObjectHash<TIntArrayList> that should be 4+4+16 bytes per key and 4 bytes per value, so 144 MB. The overhead is probably similar for both.
The reason that Trove uses less memory is because the types are specialized for primitive values such as int. They will store the int values directly, thus using 4 bytes to store each.
A Java collections HashMap consists of many objects. It roughly looks like this: there are Entry objects that point to a key and a value object each. These must be objects, because of the way generics are handled in Java. In your case, the key will be an Integer object, which uses 16 bytes (4 bytes mark, 4 bytes type, 4 bytes actual int value, 4 bytes padding) AFAIK. These are all 32 bit system estimates. So a single entry in the HashMap will probably need some 16 (entry) + 16 (Integer key) + 32 (yet empty LinkedList) bytes of memory that all need to be considered for garbage collection.
If you have lots of Integer objects, it just will take 4 times as much memory as if you could store everything using int primitives. This is the cost you pay for the clean OOP principles realized in Java.

The best way is to run your program with profiler (for example, JProfile) and see what parts are slow. Also debug output can slow your program, for example.

Hash Map is not slow, but in reality its the fastest among the maps. HashTable is the only thread safe among maps, and can be slow sometimes.
Important note: Close the BufferedReader and File after u read the data... this might help.
eg: br.close()
file.close()
Please check you system processes from task manager, there may be too may processes running in the background.
Sometimes eclipse is real resource heavy, so try to run it from console to check it.

Search a string in a file and write the matched lines to another file in Java

For searching a string in a file and writing the lines with matched string to another
file it takes 15 - 20 mins for a single zip file of 70MB(compressed state).
Is there any ways to minimise it.
my source code:
getting Zip file entries
zipFile = new ZipFile(source_file_name);
entries = zipFile.entries();
while (entries.hasMoreElements())
{ ZipEntry entry = (ZipEntry)entries.nextElement();
if (entry.isDirectory())
{
continue;
}
searchString(Thread.currentThread(),entry.getName(), new BufferedInputStream (zipFile.getInputStream(entry)), Out_File, search_string, stats); }
zipFile.close();
Searching String
public void searchString(Thread CThread, String Source_File, BufferedInputStream in, File outfile, String search, String stats) throws IOException
{
int count = 0;
int countw = 0;
int countl = 0;
String s;
String[] str;
BufferedReader br2 = new BufferedReader(new InputStreamReader(in));
System.out.println(CThread.currentThread());
while ((s = br2.readLine()) != null)
{
str = s.split(search);
count = str.length - 1;
countw += count; //word count
if (s.contains(search))
{
countl++; //line count
WriteFile(CThread,s, outfile.toString(), search);
}
}
br2.close();
in.close();
}
--------------------------------------------------------------------------------
public void WriteFile(Thread CThread,String line, String out, String search) throws IOException
{
BufferedWriter bufferedWriter = null;
System.out.println("writre thread"+CThread.currentThread());
bufferedWriter = new BufferedWriter(new FileWriter(out, true));
bufferedWriter.write(line);
bufferedWriter.newLine();
bufferedWriter.flush();
}
Please help me. Its really taking 40 mins for 10 files using threads and 15 - 20 mins for a single file of 70MB after being compressed. Any ways to minimise the time.

You are reopening the file output handle for every single line you write.
This is likely to have a massive performance impact, far outweighing other performance issues. Instead I would recommend creating the BufferedWriter once (e.g. upon the first match) and then keeping it open, writing each matching line and then closing the Writer upon completion.
Also, remove the call to flush(); there is no need to flush each line as the call to Writer.close() will automatically flush any unwritten data to disk.
Finally, as a side note your variable and method naming style does not follow the Java camel case convention; you might want to consider changing it.

I'm not sure if the cost you are seeing is from disk operations or from string manipulations. I'll assume for now that the problem is the strings, you can check that by writing a test driver that runs your code with the same line over and over.
I can tell you that split() is going to be very expensive in your case because you are producing strings you don't need and then recycling them, creating much overhead. You may want to increase the amount of space available to your JVM with -Xmx.
If you merely separate words by the presence of whitespace, then you would do much better by using a regular expression matcher that you create before the loop and apply it to the string The number of matches when applied to a given string will be your word count, and that should not create an array of strings (which is very wasteful and which you don't use). You will see in the JavaDocs that split does work via regular expressions; that is true, but split does the extra step of creating separate strings and that's where your waste might be.
You can also use a regular expression to search for the match instead of contains though that may not be significantly faster.
You could make things parallel by using multiple threads. However, if split() is the cause of your grief, your problem is the overhead and running out of heap space, so you won't necessarily benefit from it.
More generally, if you need to do this a lot, you may want to write a script in a language more "friendly" to string manipulation. A 10-line script in Python can do this much faster.

wow, what are you doing in this method
WriteFile(CThread,s, outfile.toString(), search);
every time you got the line containing your text, you are creating BufferedWriter(new FileWriter(out, true));
Just create a bufferedWriter in your searchString method and use that to insert lines. No need to open that again and again. It will drastically improve the performance.

One problem here might be that you stop reading when you write. I would probably use one thread for reading and another thread for writing the file. As an extra optimization the thread writing the results could buffer them into memory and write them to the file as a batch, say every ten entries or something.
In the writing thread you should queue the incoming entries before handling them.
Of course, you should maybe first debug where that time is spent, is it the IO or something else.

There are too many potential bottlenecks in this code for anyone to be sure what the critical ones are. Therefore you should profile the application to determine what it causing it to be slow.
Armed with that information, decide whether the problem is in reading the ZIP file, soing the searching or writing the matches to the output file.
(Repeatedly opening and closing the output file is a bad idea, but if you only get a tiny number of search hits it won't make much difference to the overall performance.)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.