I've written a program for class which takes in data from a URL, parses it for key phrases and then writes to a text file the phrase, line number, and column number.
Currently I am doing this as a single operation where the URL is fed to a BufferedReader for reading, to a Scanner for Parsing and then into a loop where each line is combed through and a series of conditional statements are used to check for the presence of said key phrases. When a match is found I write to a file.
The file read is about 60K lines of text and it takes about 4000ms on average to run this full operation from start to finish. Would it be more efficient to break apart the tasks and first read through the file into a Data Structure and then output the results to the file instead of doing both at the same time?
Also, how big of an impact would pulling the data from the URL have vs. reading it locally? I have the option to do both and but figure this would depend upon my broadband speeds.
EDIT: Somewhat of a nice test case. Over the week we've changed our ISP and upgraded our broadband speeds from 6Mb/sec to 30Mb/sec. This is brought my average read/parse/write times down to 1500ms. Interesting to see how small variances can make such impacts in performance.
This depends on the way you implement parallelism in your data crunching part.
At the moment you sequentially read everything - then crunch the data - then write it. So even if you broke it into 3 threads each one depends on the result of the previous.
So unless you start processing the data before it is fully received, this would not make a difference but only add overhead.
You would have to model a producer/consumer like flow where e.g. lines are read individually and then put on a work queue for processing. Same for processed lines which are then put on a queue to be written to a file.
This would allow parallel read / process / write actions to take place.
Btw - probably you are mostly limited by the speed to read the file from an URL, since all other steps happen locally and are orders of magnitudes faster.
Related
I want to benchmark 2 Libraries to evaluate them, with nearly Identical code.
But the issue is, that Java needs some Time to warm up.
Do you have any Idea how to properly set up a Benchmark, especially for "read write" operations with Java?
Its hard for me to grasp, how Java "warms up" or "cashes" Input data via streams.
My Usecase:
I read a template file. And fill it.
What I did:
Mesuring the Time of each Library to read a template and fill a document.
The issue I stumbled upon
In the first iteration the first library is significantly faster than the second one.
But when I do like 1000+ iterations, they come out really close.
I read the same tamplate file multiple times, that could be a issue.
Do you have any suggestions to create a realistic benchmark?
Because there is no Usecase, where 1000 Documents will be generated at once.
But one per "workflow". But I need to take into account, that the JVM optimizes read data not only initialisation at runtime, and "is warmed up".
And for how long it is warmed up, because since the usecase never includes multiple files at once, a "1000 iterations szenario" is not realistic.
One is sometimes faced with the task of parsing data stored in files on the local system. A significant dilemma is whether to load and parse all of the file data at the beginning of the program run, or access the file throughout the run and read data on-demand (assuming the file is sorted, so search is performed in constant time).
When it comes to small data sets, the first approach seems favorable, but with larger ones the threat of clogging up the heap increases.
What are some general guidelines one can use in such scenarios?
That's the standard tradeoff in programming - memory vs performance, Space–time tradeoff etc. There is no "right" answer to that question. It depends on the memory you have, speed you need, size of files, how often you query them etc.
In your specific case and since it seems like a one time job (if you are able to read it in the beginning) then it probably won't matter that much ;)
That depends entirely on what your program needs to do. The general advice is to keep only as much data in memory as is necessary. For example, consider a simple program that reads each record from a file of transactions, and then reports the total number of transactions and the total dollar amount:
count = 0
dollars = 0
while not end of file
read record
parse record
increment count
add transaction amount to dollars
end
output count and dollars
Here, you clearly need to have only one transaction record in memory at a time. So you read a record, process it, and discard it. It makes no sense to load all of the records into a list or other data structure, and then iterate over the list to get the count and total dollar amount.
In some cases you do need multiple records, perhaps all of them, in memory. In those cases, all you do is re-structure the program a little bit. You keep the reading loop, but have it add records to a list. Then afterwards you can process the list:
list = []
while not end of file
read record
parse record
add record to list
end
process list
output results
It makes no sense to load the entire file into a list, and then scan the list sequentially to obtain count and dollar amount. Not only is that a waste of memory, it makes the program more complex, uses memory to no gain, will be slower, and will fail with large data sets. The "memory vs performance" tradeoff doesn't always apply. Often, as in this case, using more memory makes the program slower.
I generally find it a good practice to structure my solutions so that I keep as little data in memory as is practical. If the solution is simpler with sorted data, for example, I'll make sure that the input is sorted before I run the program.
That's the general advice. Without specific examples from you, it's hard to say what approach would be preferred.
Here is the description of the problem:
I have a large number of small log files in a directory, assuming:
all files follow the naming convention: yyyy-mm-dd.log, for example: 2013-01-01.log, 2013-01-02.log .
there is roughly 1,000,000 small files.
the combined size for all the files is several terabytes.
Now I have to prepend a line number for each line in each file, and the line number is cumulative, spreading amongst all files(files are ordered by timestamp) in the folder. For example:
in 2013-01-01.log, line number from 1~2500
in 2013-01-02.log, line number from 2501~7802
...
in 2016-03-26.log, line number from 1590321~3280165
All the files are overwritten to include the line number.
The constrains are:
the storage device is an SSD and can handle multiple IO requests simultaneously.
the CPU is powerful enough.
the total memory you can use is 100MB.
try to maximize the performance of the application.
implement and test in Java.
After thinking and searching, here is the best solution I've thought of. The code is a little
long, so I just give a brief description of each step:
count the number of lines of each file concurrently and save the mapping to a ConcurrentSkipListMap, the key is the file name, the value is the number of lines of the file, and the key is ordered.
count the start line number of each file by traversing the ConcurrentSkipListMap, for example, the start line number and line count of 2013-01-01.log are 1 and 1500 respectively, then the start line number of 2013-01-02.log is 1501.
prepend line number to each line of each file: read line by line of each file using BufferedReader, prepend line number and then write to a corresponding tmp file using BufferedWriter. Create a thread pool and process concurrently.
rename back all the tmp files to the original name concurrently using the thread pool.
I've tested the program on my MBP, step 1 and step 3 are bottlenecks as expected.
Do you have a better solution, or some optimization of my solution? Thanks in advance!
Not sure if this questions fits the SO model of Q&A, but I try some hints towards an answer.
Fact 1) Given 1M files and 100MB limit, there is nearly no way to keep information for all files in memory at the same time. Except potentially by doing a lot of bit fiddling like in the old days when we programmed in C.
Fact 2) I don't see a way to get around reading all files once to count the line numbers and then rewrite them all, which means to read them all again.
A) Is this a homework question? There may be a way to produce the file names from a folder lazily, one by one, in Java 7 or 8, but I am not aware of it. If there is, use it. If not, you might need to generate the file names instead of listing them. This would require that you can insert a start and an end date as input. Not sure if this is possible.
B) Given there is a lazy Iterator<File>, whether from the jdk to list files or self implemented to generate file names, get N of them to partition the work to N threads.
C) Now each thread takes care of its slice of files, reads them and keeps only the total number of lines of its slice.
D) From the totals for each slice compute the starting number for each slice.
E) Distribute iterators over N threads again to do the line numbering. Rename a tmp file immediately after it was written, don't wait for everything to finish as to not having to iterate over all files again.
At each point in time, the information kept in memory is rather small: one file name per thread, a line count over the whole slice, the current line of a file being read. 100MB is more than enough for this, if N is not outrageously large.
EDIT: Some say that Files.find() is lazily populated, yet I could not easily find the code behind it (some DirectoryStream in Java 8) to see if the lazyness pertains only to read the full contents of one folder at a time, or whether indeed one file name is read at a time. Or whether this even depends on the file system used.
We have an autosys job running in our production on daily basis. It calls a shell script which in turn calls a java servlet. This servlet reads these files and inserts the data into two different tables and then does some processing. Java version is 1.6 & application server is WAS7 and database is oracel-11g.
We get several issues with this process like it takes time, goes out of memory etc etc. Below are the details of the way we have coded this process. Please let me know if it can be improved.
When we read the file using BufferedReader, do we really get a lot of strings created in the memory as returned by readLine() method of BufferedReader? These files contain 4-5Lacs of line. All the records are separated by newline character. Is there a better way to read files in java to achieve efficiency? I couldnt find any provided the fact that all the record lines in the file are of variable length.
When we insert the data then we are doing a batch process with statement/prepared statement. We are making one batch containing all the records of the file. Does it really matter to break the batch size to have better performance?
If the tables have no indexes defined nor any other constraints and all the columns are VARCHAR type, then which operation will be faster:- inserting a new row or updating an existing row based upon some matching condition?
Reading the File
It is fine using BufferedReader. The key thing here is to read a bunch of lines, then process them. After that, read another bunch of lines, and so on. An important implication here is when you process the second bunch of lines, you no longer reference the previous bunch of lines. This way, you ensure you don't retain memory space unnecessarily. If, however, you retain all references to all the lines, you are likely running into memory issues.
If you do need to reference all the lines, you can either increase your heap size or, if many of the lines are duplicates, use the technique of intern() or something similar to save memory.
Modifying the Table
Always better to limit the size of a batch to a reasonable count. The larger the size, the more resource constraint you are imposing to the database end and probably your jvm side as well.
Insert or Update
If you have indexes defined, I would say updating performs better. However, if you don't have indexes, insert should be better. (You have access to the environment, perhaps you can do a test and share the result with us?)
Lastly, you can also consider using multiple threads to work on the part of 'Modifying the table' so as to improve overall performance and efficiency.
I have a file of size 2GB which has student records in it. I need to find students based on certain attributes in each record and create a new file with results. The order of the filtered students should be same as in the original file. What's the efficient & fastest way of doing this using Java IO API and threads without having memory issues? The maxheap size for JVM is set to 512MB.
What kind of file? Text-based, like CSV?
The easiest way would be to do something like grep does: Read the file line by line, parse the line, check your filter criterion, if matched, output a result line, then go to the next line, until the file is done. This is very memory efficient, as you only have the current line (or a buffer a little larger) loaded at the same time. Your process needs to read through the whole file just once.
I do not think multiple threads are going to help much. It would make things much more complicated, and since the process seems to be I/O bound anyway, trying to read the same file with multiple threads probably does not improve throughput.
If you find that you need to do this often, and going through the file each time is too slow, you need to build some kind of index. The easiest way to do that would be to import the file into a DB (can be an embedded DB like SQLite or HSQL) first.
I wouldn't overcomplicate this until you find that the boringly simple way doesn't work for what you need. Essentially you just need to:
open input stream to 2GB file, remembering to buffer (e.g. by wrapping with BufferedInputStream)
open output stream to filtered file you're going to create
read first record from input stream, look at whatever attribute to decide if you "need" it; if you do, write it to output file
repeat for remaining records
On one of my test systems with extremely modest hardware, BufferedInputStream around a FileInputStream out of the box read about 500 MB in 25 seconds, i.e. probably under 2 minutes to process your 2GB file, and the default buffer size is basically as good as it gets (see the BufferedInputStream timings I made for more details). I imagine with state of the art hardware it's quite possible the time would be halved.
Whether you need to go to a lot of effort to reduce the 2/3 minutes or just go for a wee while you're waiting for it to run is a decision that you'll have to make depending on your requirements. I think the database option won't buy you much unless you need to do a lot of different processing runs on the same set of data (and there are other solutions to this that don't automatically mean database).
2GB for a file is huge, you SHOULD go for a db.
If you really want to use Java I/O API, then try out this: Handling large data files efficiently with Java and this: Tuning Java I/O Performance
I think you should use memory mapped files.This will help you to map the bigger file to a
smaller memory.This will act like virtual memory and as far as performance is concerned mapped files are the faster than stream write/read.