I would like to know the time complexity of the following code snippet,
FileReader fr = new FileReader("myfile.txt");
BufferedReader br = new BufferedReader(fr);
for (long i = 0; i < n-1; i++ ) {
br.readLine();
}
System.out.println("Line content:" + br.readLine());
br.close();
fr.close();
Edit: I would like to say, n = a constant number, e.g. 100000
The complexity is O(n) but that doesn't tell you much because you don't know how much time each readLine() needs.
Calculating the complexity doesn't make much sense when individual operations have a very variable runtime behavior.
In this case, the loop is very cheap and will not contribute much to the runtime of the whole program. The loading from disk, on the other hand, will contribute very much to the runtime but it's hard to say without statistical information about the average number of lines per file and the average length of a line.
This is a very simple case, but here's how to find the time complexity. The same method can be applied for more complex algorithms.
For the following portion of code (and regardless of the complexity of readline())
for (long i = 0; i < n-1; i++ ) {
br.readLine();
}
i = 0 will be executed (n-1) times, i < n-1 will be executed n times, i++ will be executed n-1 times, and br.readline(); will be executed n-1 times.
this gives us n-1+n+n-1+n-1 = 4*n-3. This is proportional to n, so the complexity is O(n).
I'm not sure what you mean by "time complexity", but it would appear that it's performance is linear (AKA O(n)) with the size of the file it reads from.
The readLine() function has to scan every character of the input up to the next newline. This should be O(N), where N is the number of bytes in the first n lines (which you read). Using a buffered reader does not reduce algorithmic complexity, it just reduces the number of actual IO calls needed to read a given number of bytes (a good thing, since IO calls are expensive). In this case, the only way that would change things is if the buffer's read size was much larger than the total number of bytes you were going to read.
The time complexity of reading an entire file should be O(N) where N is the size of the file.
However, proving this would be difficult, given the amount of software that is involved. You have got the Java code in the main method, the Reader stack (including the Charset decoder) and the JVM. Then you have the code in the OS. Then you have to take into account file buffering in kernel memory, file system organizations, disk seek times, etcetera.
(It is not meaningful to just consider just the time taken by the application. We can safely predict that component of the total time taken will be dominated by the other components.)
And, as Aaron says the complexity measure is not going to be a reliable predictor of the actual file read time.
Related
I have big text file that contains source-target nodes and threshold.I store all the distinct nodes in HashSet,then filter the edges based on user threshold and store the filtered nodes in separated Hash Set.So i want to find a way to do the processing as fast as possible.
public class Simulator {
static HashSet<Integer> Alledgecount = new HashSet<>();
static HashSet<Integer> FilteredEdges = new HashSet<>();
static void process(BufferedReader reader,double userThres) throws IOException {
String line = null;
int l = 0;
BufferedWriter writer = new BufferedWriter( new FileWriter("C:/users/mario/desktop/edgeList.txt"));
while ((line = reader.readLine()) != null & l < 50_000_000) {
String[] intArr = line.split("\\s+");
checkDuplicate(Integer.parseInt(intArr[1]), Integer.parseInt(intArr[2]), Alledgecount);
double threshold = Double.parseDouble(intArr[3]);
if(threshold > userThres) {
writeToFile(intArr[1],intArr[2],writer);
checkDuplicate(Integer.parseInt(intArr[1]), Integer.parseInt(intArr[2]), FilteredEdges);
}
l++;
}
writer.close();
}
static void writeToFile(String param1,String param2,Writer writer) throws IOException {
writer.write(param1+","+param2);
writer.write("\r\n");
}
The graph class does BFS and writes the nodes in separated file.I have done the processing excluding some functionalities and the timings are below.
Timings with 50 million lines read in process()
without calling BFS(),checkDuplicates,writeAllEdgesToFile() -> 54s
without calling BFS(),writeAllEdgesToFile() -> 50s
without calling writeAllEdgesToFile() -> 1min
Timings with 300 million lines read in process()
without calling writeAllEdges() 5 min
Reading a file doesn't depend only on CPU cores.
IO operations on a file will be limited by physical constraints of classic disks that contrary to CPU core cannot parallel operations.
What you could do is having a thread for IO operations and other(s) for data processing but it makes sense only if data processing is long enough to make relevant to create a Thread for this task as Threads have a cost in terms of CPU scheduling.
Getting a multi-threaded Java program to run correctly can be very tricky. It needs some deep understanding of things like synchronization issues etc. Without the knowledge/experience necessary, you'll have a hard time searching for bugs that occur sometimes but aren't reliably reproducible.
So, before trying multi-threading, find out if there are easier ways to achieve acceptable performance:
Find the part of your program that takes the time!
First question: is it I/O or CPU? Have a look at Task Manager. Does your single-threaded program occupy one core (e.g. CPU close to 25% on a 4-core machine)? If it's far below that, then I/O must be the limiting factor, and changing your program probably won't help much - buy a faster HD. (In some situations, the software style of doing I/O might influence the hardware performance, but that's rare.)
If it's CPU, use a profiler, e.g. the JVisualVM contained in the JDK, to find the method that takes most of the runtime and think about alternatives. One candidate might be the line.split("\\s+"), using a regular expression. They are slow, especially if the expression isn't compiled to a Pattern beforehand - but that's nothing more than a guess, and the profiler will most probably tell you some very different place.
I have huge files (4.5 GB each) and need to count the number of lines in each file that start with a given token. There can be up to 200k occurrences of the token per file.
What would be the fastest way to achieve such a huge file traversal and String detection? Is there a more efficient approach than the following implementation using a Scanner and String.startsWith()?
public static int countOccurences(File inputFile, String token) throws FileNotFoundException {
int counter = 0;
try (Scanner scanner = new Scanner(inputFile)) {
while (scanner.hasNextLine()) {
if (scanner.nextLine().startsWith(token)) {
counter++;
}
}
}
return counter;
}
Note:
So far it looks like the Scanner is the bottleneck (i.e. if I add more complex processing than token detection and apply it on all lines, the overall execution time is more or less the same.)
I'm using SSDs so there is no room for improvement on the hardware side
Thanks in advance for your help.
A few pointers (assumption is that the lines are relatively short and the data is really ASCII or similar) :
read a huge buffer of bytes at a time, (say 1/4 GB), then chop off the incomplete line to prepend to the next read.
search for bytes, do not waste time converting to chars
indicate "beginning of line by starting your search pattern with '\n', treat first line specially
use high-speed search that reduces search time at the expense of pre-processing (google for "fast substring search")
if actual line numbers (rather than the lines) are needed, count the lines in a separate stage
We can reduce the problem to searching for \n<token> in a bytestream. In that case, one quick way is to read a chunk of data sequentially from disk (The size is determined empirically, but a good starting point is 1024 pages), and hand that data to a different thread for processing.
I have a very large digit (the length number is 200000). And when i want to print this number
System.out.println(myBigInt);
a lot of time is spent on an implicit call method toString() of BigInteger number(about 5 sec)
Do you know any way how i can print biginteger on console? May be first need convert to byte array but i don't know what i do next.
I need print on console a very large biginteger number faster then
System.out.println(myBigInt);
By default, System.out.print() is only line-buffered and does a lot work related to Unicode handling. Because of its small buffer size, System.out.println() is not well suited to handle many repetitive outputs in a batch mode. Each line is flushed right away. If your output is mainly ASCII-based then by removing the Unicode-related activities, the overall execution time will be better.
you can use the following to increase the speed.
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new
FileOutputStream(java.io.FileDescriptor.out), "ASCII"), 512);
out.write(String.valueOf(myBigInt));
I try to build a map with the content of a file and my code is as below:
System.out.println("begin to build the sns map....");
String basePath = PropertyReader.getProp("oldbasepath");
String pathname = basePath + "\\user_sns.txt";
FileReader fr;
Map<Integer, List<Integer>> snsMap =
new HashMap<Integer, List<Integer>>(2000000);
try {
fr = new FileReader(pathname);
BufferedReader br = new BufferedReader(fr);
String line;
int i = 1;
while ((line = br.readLine()) != null) {
System.out.println("line number: " + i);
i++;
String[] strs = line.split("\t");
int key = Integer.parseInt(strs[0]);
int value = Integer.parseInt(strs[1]);
List<Integer> list = snsMap.get(key);
//if the follower is not in the map
if(snsMap.get(key) == null)
list = new LinkedList<Integer>();
list.add(value);
snsMap.put(key, list);
System.out.println("map size: " + snsMap.size());
}
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("finish building the sns map....");
return snsMap;
The program is very fast at first but gets much slowly when the information printed is :
map size: 1138338
line number: 30923602
map size: 1138338
line number: 30923603
....
I try to find to reason with two System.out.println() clauses to judge the preformance of BufferedReader and HashMap instead of a Java profiler.
Sometimes it takes a while to get the information of the map size after getting the line number information, and sometimes, it takes a while to get the information of the line number information after get the map size. My question is: which makes my program slow? the BufferedReader for a big file or HashMap for a big map?
If you are testing this from inside Eclipse, you should be aware of the huge performance penalty of writing to stdout/stderr, due to Eclipse's capturing that ouptut in the Console view. Printing inside a tight loop is always a performance issue, even outside of Eclipse.
But, if what you are complaining about is the slowdown experienced after processing 30 million lines, then I bet it's a memory issue. First it slows down due to intense GC'ing and then it breaks with OutOfMemoryError.
You will have to check you program with some profiling tools to understand why it is slow.
In general file access is much more slower than in memory operations (unless you are constrained in memory and doing excess GC) so the guess would be that reading file could be the slower here.
Before you profiled, you will not know what is slow and what isn't.
Most likely, the System.out will show up as being the bottleneck, and you'll then have to profile without them again. System.out is the worst thing you can do for finding performance bottlenecks, because in doing so you usually add an even worse bottleneck.
An obivous optimization for your code is to move the line
snsMap.put(key, list);
into the if statement. You only need to put this when you created a new list. Otherwise, the put will just replace the current value with itself.
Java cost associated with Integer objects (and in particular the use of Integers in the Java Collections API) is largely a memory (and thus Garbage Collection!) issue. You can sometimes get significant gains by using primitive collections such as GNU trove, depending how well you can adjust your code to use them efficiently. Most of the gains of Trove are in memory usage. Definitely try rewriting your code to use TIntArrayList and TIntObjectMap from GNU trove. I'd avoid linked lists, too, in particular for primitive types.
Roughly estimated, a HashMap<Integer, List<Integer>> needs at least 3*16 bytes per entry. The doubly linked list again needs at least 2*16 bytes per entry stored. 1m keys + 30m values ~ 1 GB. No overhead included yet. With GNU trove TIntObjectHash<TIntArrayList> that should be 4+4+16 bytes per key and 4 bytes per value, so 144 MB. The overhead is probably similar for both.
The reason that Trove uses less memory is because the types are specialized for primitive values such as int. They will store the int values directly, thus using 4 bytes to store each.
A Java collections HashMap consists of many objects. It roughly looks like this: there are Entry objects that point to a key and a value object each. These must be objects, because of the way generics are handled in Java. In your case, the key will be an Integer object, which uses 16 bytes (4 bytes mark, 4 bytes type, 4 bytes actual int value, 4 bytes padding) AFAIK. These are all 32 bit system estimates. So a single entry in the HashMap will probably need some 16 (entry) + 16 (Integer key) + 32 (yet empty LinkedList) bytes of memory that all need to be considered for garbage collection.
If you have lots of Integer objects, it just will take 4 times as much memory as if you could store everything using int primitives. This is the cost you pay for the clean OOP principles realized in Java.
The best way is to run your program with profiler (for example, JProfile) and see what parts are slow. Also debug output can slow your program, for example.
Hash Map is not slow, but in reality its the fastest among the maps. HashTable is the only thread safe among maps, and can be slow sometimes.
Important note: Close the BufferedReader and File after u read the data... this might help.
eg: br.close()
file.close()
Please check you system processes from task manager, there may be too may processes running in the background.
Sometimes eclipse is real resource heavy, so try to run it from console to check it.
I had a challenge to print out multiples of 7 (non-negative) to the 50th multiple in the simplest way humanly possible using for loops.
I came up with this (Ignoring the data types)
for(int i = 0; i <= 350; i += 7)
{System.out.println(i);}
The other guy came up with this
for(int i=0;i <=50; i++)
{
System.out.println(7*i);
}
However, I feel the two code snippets could be further optimized. If it actually can please tell. And what are the advantages/disadvantages of one over the other?
If you really want to optimize it, do this:
System.out.print("0\n7\n14\n21\n28\n35\n42\n49\n56\n63\n70\n77\n84\n91\n98\n105\n112\n119\n126\n133\n140\n147\n154\n161\n168\n175\n182\n189\n196\n203\n210\n217\n224\n231\n238\n245\n252\n259\n266\n273\n280\n287\n294\n301\n308\n315\n322\n329\n336\n343\n350");
and it's O(1) :)
The first one technically performs less operations (no multiplication).
The second one is slightly more readable (50 multiples of 7 vs. multiples of 7 up to 350).
Probably can't be optimized any further.
Unless you're willing to optimize away multiple println calls by doing:
StringBuilder s = new StringBuilder();
for(int i = 0; i <= 350; i += 7) s.append(i).append(", ");
System.out.println(s.toString());
(IIRC printlns are relatively expensive.)
This is getting to the point where you gain a tiny bit of optimization at the expense of simplicity.
In theory, your code is faster since it does not need one less multiplication instruction per loop.
However, the multiple calls to System.out.println (and the integer-to-string conversion) will dwarf the runtime the multiplication takes. To optimize, aggregate the Strings with a StringBuilder and output the whole result (or output the result when memory becomes a problem).
However, in real-world code, this is extremely unlikely to be the bottleneck. Profile, then optimize.
The second function is the best you would get:
O(n)