If I have a CSV file, is it faster to keep the file as place text or to convert it to some other format? (for searching)
In terms of searching a CSV file, what is the fastest method of retrieving a particular row (by key)? Not referring to sorting the file sorry, what I mean was looking up a arbitrary key in the file.
Some updates:
the file will be read-only
the file can be read and kept in memory
There are several things to consider for this:
What kind of data do you store? Does it actually make sense, to convert this to a binary format? Will binary format take up less space (the time it takes to read the file is dependent on size)?
Do you have multiple queries for the same file, while the system is running, or do you have to load the file each time someone does a query?
Do you need to efficiently transfer the file between different systems?
All these factors are very important for a decision. The common case is that you only need to load the file once and then do many queries. In that case it hardly matters what format you store the data in, because it will be stored in memory afterwards anyway. Spend more time thinking about good data structures to handle the queries.
Another common case is, that you cannot keep the main application running and hence you cannot keep the file in memory. In that case, get rid of the file and use a database. Any database you can use will most likely be faster than anything you could come up with. However it is not easy to transfer a database between system.
Most likely though, the file format will not be a real issue to consider. I've read quite a few very long CSV files and most often the time it took to read the file was negligible compared to what I needed to do with the data afterwards.
If you have too much data and is very production level, then use Apache Lucene
If its small dataset or its about learning then read through Suffix tree and Tries
"Convert" it (i.e. import it) into a database table (or preferably normalised tables) with indexes on the searchable columns and a primary key on the column that has the highest cardinality - no need to re-invent the wheel... you'll save yourself a lot of issues - transaction management, concurrency.... really - if it will be in production, the chance that you will want to keep it in csv format is slim to zero.
If the file is too large to keep in memory, then just keep the keys in memory. Some number of rows can also be keep in memory, with least-recently-accessed rows paged out as additional rows are needed. Use fseeks (directed by keys) with the file to find the row in the file itself. Then load that row into memory in case other entries on that row might be needed.
Related
I need to check if the password that an user entered is contained in a 10k lines .txt file that is locally stored in my computer. I've been asked to do this for a college project and they've been very emphatic about achieving this in an efficent manner, not taking too long to find the match.
The thing is that reading the file line by line using a BufferedReader the match is done almost instantly.
I've tested it in two computers, one with an ssd and the other one with an hdd and I cannot tell the difference.
Am I missing something? Is there another and more efficent way to do it? For example I could load the file or chunks of the file into memory, but is it worth it?
10k passwords isn't all that much and should easily fit in RAM. You can read the file into memory when your application starts and then only access the in-memory structure. The in-memory structure could even be parsed to provide more efficient lookup (i.e. using a HashMap or HashSet) or sort it in memory for the one-time cost of O(n × log n) to enable binary-searching the list (10k items can be searched with max. 14 steps). Or you could use even fancier data structures such as a bloom filter.
Just keep in mind: when you write "it is almost instant", then it probably already is efficient enough. (Again, 10k passwords isn't all that much, probably the file is only ~100kB in size)
My application needs to use data in a text file which is up to 5 GB in size. I cannot load all of this data into RAM as it is far too large.
The data is stored like a table, 5 million records (rows) and 40 columns each containing text that will be converted in memory to either string, ints, or doubles.
I've tried caching only 10 - 100 MB of data in memory and reloading from the file when I need data outside but it is way too slow! When I run calculations because I can randomly jump from any row within the table it would constantly need to open the file, read and close.
I need something fast, I was thinking of using some sort of DB. I know calculations with large data like this may take a while which is fine. If I do use a DB it needs to be setup on launch of the desktop application and not require some sort of server component to be installed before.
Any tips? Thanks
I think you need to clarify some things:
This is desktop application (I assume yes), what is the memory limit for it?
Do you use your file in read-only mode?
What kind of calculations are you trying to do? (how often random rows are accessed, how often consequent rows are read, do you need to modify data)
Currently I see two ways for further investigation:
Use SQLite. This is small single-file DB, oriented mainly for desktop applications and single-user use. It's doesn't require any server, all you need is to have appropriate jdbc library.
Create some kind of index, using, for example, binary tree. First time you read your file, index the start position of the rows within the file. In conjunction with permanently open random access file this will help you to seek and read quickly desired row. For binary tree, your index may be approximately 120M. (it's RowsCount * 2 * IndexValueSize for binary tree)
You can use an embedded database, you can find a comparison here: Java Embedded Databases Comparison.
Or, depending on your use case you may even try to use Lucene which is a full text search engine.
I have a large data set in the following format:
In total, there are 3687 object files. Each of which contains 2,000,000 records. Each file is 42MB in size.
Each record contains the following:
An id (Integer value)
Value1 (Integer)
Value2 (Integer)
Value3 (Integer)
The content of each file is not sorted or ordered in any way as they are observed during a data collection process.
Ideally, I want to build an index for this data. (Indexed by the id) which would mean the following:
Dividing the set of ids into manageable chunks.
Scanning the files to get data related to the current working set of ids.
Build the index.
Go over the next chunk and repeat 1,2,3.
To me this sounds fine but loading 152GB back and forth is time-consuming and wonder about the best possible approach or even whether Java is actually the right language to use for such a process.
I've 256GB of ram and 32 cores on my machine.
Update:
Let me modify this, putting aside I/O, and assuming the file is in-memory in a byte array.
What would be the fastest possible way to decode a 42MB Object file that have 2,000,000 records and each record contains 4 Integers serialized.
You've made a very poor choice of file format. I would convert the lot from serialized Integers to binary ints written with DataOutputStream.writeInt(), and read them with DataInputStream.readInt(). With buffered streams underneath in both cases. You will save masses of disk space, which will therefore save you I/O time as well, and you also save all the serialization overhead time. And change your collection software to use this format in future. The conversion will take a while, but it only happens once.
Or else use a database as suggested, again with native ints rather than serialized objects.
So, what I would do is just load up each file and store the id into some sort of sorted structure - std::map perhaps [or Java's equivalent, but given that it's probably about 10-20 lines of code to read in the filename and then read the contents of the file into a map, close the file and ask for the next file, I'd probably just write the C++ to do that].
I don't really see what else you can/should do, unless you actually want to load it into a dbms - which I don't think is at all unreasonable of a suggestion.
Hmm.. it seems the better way of doing it is to use some kind of DBMS. Load all your data into database, and you can leverage its indexing, storage and querying facility. Ofcourse this depends on what is your requirement -- and whether or now a DBMS solution suits this
Given that your available memory is > than your dataset and you want very high performance, have you considered Redis? It's well suited to operations on simple data structures and the performance is very fast.
Just be a bit careful about letting java do default serialization when storing values. I've previously run into issues with my primitives getting autoboxed prior to serialization.
I have a swing application which works on CSV file. It reads full file line by line, computes some required statistics and shows output.
The Upper part of output screen shows each record from file in that order in JTable, whereas lower part shows statistics computed based on that data. The problem is that JVM take 4 times the memory than that of file size. (while processing 86MB of file Heap area uses 377 MB of space - memory utilization checked using jVisualVM).
Note:
I have used LineNumberReader for reading file (beacause of specific requirement, I can change it if that helps in memory usage)
For reading every line readLine() is used and then .split(',') of that line which is String is called for individual fields of that record.
Each record in stored in Vector for displaying in JTable, whereas other statisics are stored in HashMap, TreeMap and summary data in JavaBean class. Also one graph is plotted using JFreeChart.
Please suggest to reduce Memory utilization as I need to process 2GB file.
Try giving OpenCSV a shot. It only stores the last read line when you use readNext() method. For large files this is perfect.
From their website, the following are the features they support:
Arbitrary numbers of values per line
Ignoring commas in quoted elements
Handling quoted entries with embedded carriage returns (ie entries
that span multiple lines)
Configurable separator and quote characters (or use sensible
defaults)
Read all the entries at once, or use an Iterator style model
Creating csv files from String[] (ie. automatic escaping of embedded
quote chars)
Use best practices to upgrade your program
Write Multithread in program to get better cpu utilization.
Set heap minimum and maximum heap size to get better use of ram.
Use proper data structure and design.
Every Java object has a memory overhead, so if your Strings are really short, that could explain why you get 4 times the size of your file. You also have to compute the size of the Vector and it's internals. I don't think that a Map would improve memory usage, since Java Strings already try to point to the same address in memory whenever possible.
I think you should revise your design. Given your requirements
The Upper part of output screen shows each record from file in that
order in JTable, whereas lower part shows statistics computed based on
that data
you don't need to store the whole file in memory. You need to read it entirely to compute your statistics, and this can certainly be done using a very small amount of memory. Regarding the JTable part, this can be accomplished in a number of ways without requiring 2GB of heap space for your program! I think there must be something wrong when someone wants to keep a CSV in memory! Apache IO LineIterator
Increase the JVM heap size (-Xms and -Xmx). If you have the memory, this is the best solution. If you cannot do that, you will need to find a compromise that will be a combination of data model and presentation (GUI) changes, usually resulting in increased code complexity and potential for bugs.
Try modifying your statistics algorithms to do their work as the data is being read, and not require it all exist in memory. You may find algorithms that approximate the statistics to be sufficient.
If your data contains many duplicate String literals, using a HashSet to create a cache. Beware, caches are notorious for being memory leaks (e.g. not clearing them before loading different files).
Reduce the amount of data being displayed on the graph. It is common for a graph with lot of data to have many points being displayed at or near the same pixel. Consider truncating the data by merging multiple values at or near the same position on the x-axis. If your data set contains 2,000,000 points, for example, most of them will coincide with other nearby points, so your underlying data model does not need to store everything.
Beware of information overload. Will your JTable be meaningful to the user if it contains 2GB worth of data? Perhaps you should paginate the table, and read only 1000 entries from file at a time for display.
I'm hesitant to suggest this, but during the loading process, you could convert the CSV data into a file database (such as cdb). You could accumulate statistics and store some data for the graph during the conversion, and use the database to quickly read a page of data at a time for the JTable as suggested above.
I have a requirement where I have to retrieve the contents from excel rows, do some operations write the response in same excel rows using Java class.
So I decided to store the response in memory and write it once.
Is it advisable or I have write them in to file for every response?
Please advice me the best approach.
P.N:
The excel file will have more than 1000 rows with three individual work sheets.
I would keep it simple and keep everything in memory and write down the complete file when finished unless the dataset is to be very large. But as Excel itself keeps everything in memory you should have no problem (at least given todays computers with several GB of RAM).
Memory is inexpensive, programmers are not smile
Since File I/O operations are a bit expensive, it's advisable to go for a single write as you've done already, assuming each and every row is independent of the other. but, I'd go with a fixed no. of rows at a time, say 100/150, instead of writing all at once, because any operation failure on a single row might cause an exception, affecting the rows processed already.
It depends on the requirements. Do the changes have to be reflected in the Excel file as soon as they are made? If yes, then you'll have no choice but to write the file to disk after each change. If there's no problem on updating the file only after all changes are applied (or a "save" operation is invoked), then storing the spreadsheet data on memory is a better idea.