I have a requirement where I have to retrieve the contents from excel rows, do some operations write the response in same excel rows using Java class.
So I decided to store the response in memory and write it once.
Is it advisable or I have write them in to file for every response?
Please advice me the best approach.
P.N:
The excel file will have more than 1000 rows with three individual work sheets.
I would keep it simple and keep everything in memory and write down the complete file when finished unless the dataset is to be very large. But as Excel itself keeps everything in memory you should have no problem (at least given todays computers with several GB of RAM).
Memory is inexpensive, programmers are not smile
Since File I/O operations are a bit expensive, it's advisable to go for a single write as you've done already, assuming each and every row is independent of the other. but, I'd go with a fixed no. of rows at a time, say 100/150, instead of writing all at once, because any operation failure on a single row might cause an exception, affecting the rows processed already.
It depends on the requirements. Do the changes have to be reflected in the Excel file as soon as they are made? If yes, then you'll have no choice but to write the file to disk after each change. If there's no problem on updating the file only after all changes are applied (or a "save" operation is invoked), then storing the spreadsheet data on memory is a better idea.
Related
We have an autosys job running in our production on daily basis. It calls a shell script which in turn calls a java servlet. This servlet reads these files and inserts the data into two different tables and then does some processing. Java version is 1.6 & application server is WAS7 and database is oracel-11g.
We get several issues with this process like it takes time, goes out of memory etc etc. Below are the details of the way we have coded this process. Please let me know if it can be improved.
When we read the file using BufferedReader, do we really get a lot of strings created in the memory as returned by readLine() method of BufferedReader? These files contain 4-5Lacs of line. All the records are separated by newline character. Is there a better way to read files in java to achieve efficiency? I couldnt find any provided the fact that all the record lines in the file are of variable length.
When we insert the data then we are doing a batch process with statement/prepared statement. We are making one batch containing all the records of the file. Does it really matter to break the batch size to have better performance?
If the tables have no indexes defined nor any other constraints and all the columns are VARCHAR type, then which operation will be faster:- inserting a new row or updating an existing row based upon some matching condition?
Reading the File
It is fine using BufferedReader. The key thing here is to read a bunch of lines, then process them. After that, read another bunch of lines, and so on. An important implication here is when you process the second bunch of lines, you no longer reference the previous bunch of lines. This way, you ensure you don't retain memory space unnecessarily. If, however, you retain all references to all the lines, you are likely running into memory issues.
If you do need to reference all the lines, you can either increase your heap size or, if many of the lines are duplicates, use the technique of intern() or something similar to save memory.
Modifying the Table
Always better to limit the size of a batch to a reasonable count. The larger the size, the more resource constraint you are imposing to the database end and probably your jvm side as well.
Insert or Update
If you have indexes defined, I would say updating performs better. However, if you don't have indexes, insert should be better. (You have access to the environment, perhaps you can do a test and share the result with us?)
Lastly, you can also consider using multiple threads to work on the part of 'Modifying the table' so as to improve overall performance and efficiency.
I have a large data set in the following format:
In total, there are 3687 object files. Each of which contains 2,000,000 records. Each file is 42MB in size.
Each record contains the following:
An id (Integer value)
Value1 (Integer)
Value2 (Integer)
Value3 (Integer)
The content of each file is not sorted or ordered in any way as they are observed during a data collection process.
Ideally, I want to build an index for this data. (Indexed by the id) which would mean the following:
Dividing the set of ids into manageable chunks.
Scanning the files to get data related to the current working set of ids.
Build the index.
Go over the next chunk and repeat 1,2,3.
To me this sounds fine but loading 152GB back and forth is time-consuming and wonder about the best possible approach or even whether Java is actually the right language to use for such a process.
I've 256GB of ram and 32 cores on my machine.
Update:
Let me modify this, putting aside I/O, and assuming the file is in-memory in a byte array.
What would be the fastest possible way to decode a 42MB Object file that have 2,000,000 records and each record contains 4 Integers serialized.
You've made a very poor choice of file format. I would convert the lot from serialized Integers to binary ints written with DataOutputStream.writeInt(), and read them with DataInputStream.readInt(). With buffered streams underneath in both cases. You will save masses of disk space, which will therefore save you I/O time as well, and you also save all the serialization overhead time. And change your collection software to use this format in future. The conversion will take a while, but it only happens once.
Or else use a database as suggested, again with native ints rather than serialized objects.
So, what I would do is just load up each file and store the id into some sort of sorted structure - std::map perhaps [or Java's equivalent, but given that it's probably about 10-20 lines of code to read in the filename and then read the contents of the file into a map, close the file and ask for the next file, I'd probably just write the C++ to do that].
I don't really see what else you can/should do, unless you actually want to load it into a dbms - which I don't think is at all unreasonable of a suggestion.
Hmm.. it seems the better way of doing it is to use some kind of DBMS. Load all your data into database, and you can leverage its indexing, storage and querying facility. Ofcourse this depends on what is your requirement -- and whether or now a DBMS solution suits this
Given that your available memory is > than your dataset and you want very high performance, have you considered Redis? It's well suited to operations on simple data structures and the performance is very fast.
Just be a bit careful about letting java do default serialization when storing values. I've previously run into issues with my primitives getting autoboxed prior to serialization.
If I have a CSV file, is it faster to keep the file as place text or to convert it to some other format? (for searching)
In terms of searching a CSV file, what is the fastest method of retrieving a particular row (by key)? Not referring to sorting the file sorry, what I mean was looking up a arbitrary key in the file.
Some updates:
the file will be read-only
the file can be read and kept in memory
There are several things to consider for this:
What kind of data do you store? Does it actually make sense, to convert this to a binary format? Will binary format take up less space (the time it takes to read the file is dependent on size)?
Do you have multiple queries for the same file, while the system is running, or do you have to load the file each time someone does a query?
Do you need to efficiently transfer the file between different systems?
All these factors are very important for a decision. The common case is that you only need to load the file once and then do many queries. In that case it hardly matters what format you store the data in, because it will be stored in memory afterwards anyway. Spend more time thinking about good data structures to handle the queries.
Another common case is, that you cannot keep the main application running and hence you cannot keep the file in memory. In that case, get rid of the file and use a database. Any database you can use will most likely be faster than anything you could come up with. However it is not easy to transfer a database between system.
Most likely though, the file format will not be a real issue to consider. I've read quite a few very long CSV files and most often the time it took to read the file was negligible compared to what I needed to do with the data afterwards.
If you have too much data and is very production level, then use Apache Lucene
If its small dataset or its about learning then read through Suffix tree and Tries
"Convert" it (i.e. import it) into a database table (or preferably normalised tables) with indexes on the searchable columns and a primary key on the column that has the highest cardinality - no need to re-invent the wheel... you'll save yourself a lot of issues - transaction management, concurrency.... really - if it will be in production, the chance that you will want to keep it in csv format is slim to zero.
If the file is too large to keep in memory, then just keep the keys in memory. Some number of rows can also be keep in memory, with least-recently-accessed rows paged out as additional rows are needed. Use fseeks (directed by keys) with the file to find the row in the file itself. Then load that row into memory in case other entries on that row might be needed.
I have a file where I store some data, this data should be used by every mapper for some calculations.
I know how to read the data from the file and this can be done inside the mapper function, however, this data is the same for every mapper so I would like to store it somewhere(variable) before the mapping process beings and then use the contents in the mappers.
if I do this in the map function and have for example a file with 10 lines as input, then the map function will be called 10 times, correct? so if I read the file contents in the map function I will read it 10 times which is unnecessary
thanks in advance
Because many of your Mappers run inside of a different JVM (and possibly on different machines), you cannot read the data into your application once prior to submitting it to Hadoop. However, you can use the Distributed Cache to "Distribute application-specific large, read-only files efficiently."
As per that link: "Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves."
If I understand right, you want to call only 1 function to read all the lines in a file. Assuming yes, here is my view on it.
The mapper allows you to read 1 line at a time for safety sake so that you can control how many lines of input to read. And this takes a certain amount of memory. For one example, what if the file is large like 1GB size. Are you willing to read all the contents? This will take up a considerable amount of memory and have impact on the performance.
This is the safety aspect that I mentioned earlier.
My conclusion is that there is no Mapper function that reads all the contents of a file.
Do you agree?
I have a file of size 2GB which has student records in it. I need to find students based on certain attributes in each record and create a new file with results. The order of the filtered students should be same as in the original file. What's the efficient & fastest way of doing this using Java IO API and threads without having memory issues? The maxheap size for JVM is set to 512MB.
What kind of file? Text-based, like CSV?
The easiest way would be to do something like grep does: Read the file line by line, parse the line, check your filter criterion, if matched, output a result line, then go to the next line, until the file is done. This is very memory efficient, as you only have the current line (or a buffer a little larger) loaded at the same time. Your process needs to read through the whole file just once.
I do not think multiple threads are going to help much. It would make things much more complicated, and since the process seems to be I/O bound anyway, trying to read the same file with multiple threads probably does not improve throughput.
If you find that you need to do this often, and going through the file each time is too slow, you need to build some kind of index. The easiest way to do that would be to import the file into a DB (can be an embedded DB like SQLite or HSQL) first.
I wouldn't overcomplicate this until you find that the boringly simple way doesn't work for what you need. Essentially you just need to:
open input stream to 2GB file, remembering to buffer (e.g. by wrapping with BufferedInputStream)
open output stream to filtered file you're going to create
read first record from input stream, look at whatever attribute to decide if you "need" it; if you do, write it to output file
repeat for remaining records
On one of my test systems with extremely modest hardware, BufferedInputStream around a FileInputStream out of the box read about 500 MB in 25 seconds, i.e. probably under 2 minutes to process your 2GB file, and the default buffer size is basically as good as it gets (see the BufferedInputStream timings I made for more details). I imagine with state of the art hardware it's quite possible the time would be halved.
Whether you need to go to a lot of effort to reduce the 2/3 minutes or just go for a wee while you're waiting for it to run is a decision that you'll have to make depending on your requirements. I think the database option won't buy you much unless you need to do a lot of different processing runs on the same set of data (and there are other solutions to this that don't automatically mean database).
2GB for a file is huge, you SHOULD go for a db.
If you really want to use Java I/O API, then try out this: Handling large data files efficiently with Java and this: Tuning Java I/O Performance
I think you should use memory mapped files.This will help you to map the bigger file to a
smaller memory.This will act like virtual memory and as far as performance is concerned mapped files are the faster than stream write/read.