parse huge csv for using as key and value

parse huge csv for using as key and value - java

i have to parse a huge csv which contains 3 values per line.
At first i get the csv of the assets folder and read it line by line.
while ((line = reader.readLine()) != null) {
String[] string = line.split(",");
...
}
Now i want to read the values effectually, to use them later.
I need the first to values as a pair. Each pair pertains the last value in the same line.
for example :
1,2,3
4,5,6
...
results:
pair = [1,2];
pairValue = 3;
...
but i need all values of the .csv to work with them later as pair and as single (for calculation) so which method is the best to work with this data?
Maybe an <ArrayList> or a HashMap like
Map <String,String> map = new HashMap<String,String>();
//add items
map.put(pair,value);
//get items
String valueOfKeys =(String) map.get(pair);
I hope one of you understand me and can help.

The first question is will all your data fit in memory? Assuming that it does then you need to decide how you want to be able to look up the data:
do you need to access the values based on the row number, eg give me the pair and pairValue for row 1985?
do you need to access a pair based on the pairValue, eg give me the pair for pairValue = 3?
do you need to iterate through all the data from start to finish?
In the first case an array list would be quickest but be aware that this would involve allocating a large contiguous chunk of memory.
In the second case a hashmap would work, as you've already suggested.
In the third case, a LinkedList would work and would mean that you wouldn't have to allocate a contiguous chunk of memory, but accessing the nth element would be slower.
If the file is too large to fit into memory then you're going to have to write the data to a database table and query it from there.

Related

How to avoid out of memory when constructing a map of more than 100 thousand keys

I am working on a code in java where I am going through a database table and constructing the list of unique items from that table with a key which is a string value (item name and could be upto 1024 characters long) and a number (long). The table can contain 100K or more unique items. The code is something like that:
Map<String, long> getUniqueItems() {
Map<Stirng, long> map = new HashMap<>();
recordsItr = readDatabaseRecord();
while(recordItr.hasNext()) {
RecordItem item = recordItr.next();
map.put(item.getName(), item.getValue();
}
return map;
}
However, if the number of unique records is huge then ofcourse I will get out of memory in the application. How can I avoid getting out or memory while constructing this. At the end, once i get the map i use it to iterate through its entries and do something with each item.
Thanks,
Arsalan

As #Marco13's comment, 100k elements is not too huge to storage in main memory. But, If recordsItr is a array, the compiler will require consecutive pieces of memory to storage the elements, in this case you can get out of memory problem if your main memory is cramped. Instead of using array, you can you another data structure that does not require consecutive memory, ther are stack, queue, linked list,...I think they are enough for the 100k elements( if more, I am not sure). Further, don't forget to optimize your code to reduce using too much memory.

Data structure for holding the content of a parsed CSV file

I'm trying to figure out what the best approach would be to parse a csv file in Java. Now each line will have an X amount of information. For example, the first line can have up to 5 string words (with commas separating them) while the next few lines can have maybe 3 or 6 or what ever.
My problem isn't reading the strings from the file. Just to be clear. My problem is what data structure would be best to hold each line and also each word in that line?
At first I thought about using a 2D array, but the problem with that is that array sizes must be static (the 2nd index size would hold how many words there are in each line, which can be different from line to line).
Here's the first few lines of the CSV file:
0,MONEY
1,SELLING
2,DESIGNING
3,MAKING
DIRECTOR,3DENT95VGY,EBAD,SAGHAR,MALE,05/31/2011,null,0,10000,07/24/2011
3KEET95TGY,05/31/2011,04/17/2012,120050
3LERT9RVGY,04/17/2012,03/05/2013,132500
3MEFT95VGY,03/05/2013,null,145205
DIRECTOR,XKQ84P6CDW,AGHA,ZAIN,FEMALE,06/06/2011,null,1,1000,01/25/2012
XK4P6CDW,06/06/2011,09/28/2012,105000
XKQ8P6CW,09/28/2012,null,130900
DIRECTOR,YGUSBQK377,AYOUB,GRAMPS,FEMALE,10/02/2001,12/17/2007,2,12000,01/15/2002

You could use a Map<Integer, List<String>>. The keys being the line numbers in the csv file, and the List being the words in each line.
An additional point: you will probably end up using List#get(int) method quite often. Do not use a linked list if this is the case. This is because get(int) for linked list is O(n). I think an ArrayList is your best option here.
Edit (based on AlexWien's observation):
In this particular case, since the keys are line numbers, thus yielding a contiguous set of integers, an even better data structure could be ArrayList<ArrayList<String>>. This will lead to faster key retrievals.

Use Array List. They are arrays with dynamic size.

The best way is to use a CSV parser, like http://opencsv.sourceforge.net/. This parser uses List of String[] to hold data.

Use a List<String>(), which can expand dynamically in size.
If you want to have 2 dimensions, use a List<List<String>>().
Here's an example:
List<List<String>> data = new ArrayList<List<String>>();
List<String> temp = Arrays.asList(someString.split(","));
data.add(temp);
put this in some kind of loop and get your data like that.

Building an inverted index in Java-logic

I have a collection of around 1500 documents. I parsed through each document and extract tokens. These tokens are stored in an hashmap(as key) and the total number of times they occur in the collection (i.e. frequency) is stored as the value.
I have to extend this to build an inverted index. That is, the term(key)| number of documents it occurs it-->DocNo|Frequency in that document. For exmple,
Term DocFreq DocNum TermFreq
data 3 1 12
23 31
100 17
customer 2 22 43
19 2
Currently, I have the following in Java,
hashmap<string,integer>
for(each document)
{
extract line
for(each line)
{
extract word
for(each word)
{
perform some operations
get value for word from hashmap and increment by one
}
}
}
I have to build on this code. I can't really think of a good way to implement an inverted index.
So far, I thought of making value a 2D array. So the term would be the key and the value(i.e 2D array) would store the docId and termFreq.
Please let me know if my logic is correct.

I would do it by using a Map<String, TermFrequencies>. This map would maintain a TermFrequencies object for each term found. The TermFrequencies object would have the following methods:
void addOccurrence(String documentId);
int getTotalNumberOfOccurrences();
Set<String> getDocumentIds();
int getNumberOfOccurrencesInDocument(String documentId);
It would use a Map<String, Integer> internally to associate each document the term occurs in with the number of occurrences of the term in the document.
The algorithm would be extremely simple:
for(each document) {
extract line
for(each line) {
extract word
for(each word) {
TermFrequencies termFrequencies = map.get(word);
if (termFrequencies == null) {
termFrequencies = new TermFrequencies(word);
}
termFrequencies.addOccurrence(document);
}
}
}
The addOccurrence() method would simply increment a counter for the total number of occurrences, and would insert or update the number of occurrences in the internam map.

I think it is best to have two structures: a Map<docnum, Map<term,termFreq>> and a Map<term, Set<docnum>>. Your docFreqs can be read off as set.size in the values of the second map. This solution involves no custom classes and allows a quick retrieval of everything needed.
The first map contains all the informantion and the second one is a derivative that allows quick lookup by term. As you process a document, you fill the first map. You can derive the second map afterwards, but it is also easy to do it in one pass.

I once implemented what you're asking for. The problem with your approach is that it is not abstract enough. You should model Terms, Documents and their relationships using objects. In a first run, create the term index and document objects and iterate over all terms in the documents while populating the term index. Afterwards, you have a representation in memory that you can easily transform into the desired output.
Do not start by thinking about 2d-arrays in an object oriented language. Unless you want to solve a mathematical problem or optimize something it's not the right approach most of the time.

I dont know if this is still a hot question, but I would recommend you to do it like this:
You run over all your documents and give them an id in increasing order. For each document you run over all the words.
Now you have a Hashmap that maps Strings (your words) to an array of DocTermObjects. A DocTermObject contains a docId and a TermFrequency.
Now for each word in a document, you look it up in your HashMap, if it doesn't contain an Array of DocTermObjects you create it, else you look at its very LAST element only (this is important due to runtime, think about it). If this element has the docId that you treat at the moment, you increase the TermFrequency. Else or if the Array is empty, you add a new DocTermObject with your actual docId and set the TermFrequency to 1.
Later you can use this datastructure to compute scores for example. The scores you could also save in the DoctermObjects of course.
Hope it helped :)

Recommended way of storing and accessing key value pairs and multiple integer values with the help of memory mapped IO

In a java program, I have a requirement of temporarily storing many records- one record consists of a key, a object as well as an integer value. The total processing will be for millions of records, but I plan to delete one record after processing has been completed on it... After that another record is inserted, processing is done on it, then it is deleted.... and so on...
What is the best way of storing such values with the help of memory mapped IO?
I can see samples for mapping byte buffers, but how do I store multiple records, and then retrieve them... Do I have to store the position numbers as I add records to the file storage, and use these to retrieve the data? Then I will have to create another array to store position numbers... Is there any recommended way of storing/retrieving data using memory mapped IO in java?

Maybe this is what you are looking for:
http://code.google.com/p/vanilla-java/wiki/HugeCollections

Another way is to use WeakHashmap
http://docs.oracle.com/javase/6/docs/api/java/util/WeakHashMap.html

You can use Hashmap, if you can create combined key based on records key/int pair and value as object then you can store it in HashMap and delete values from hashMap in constant time operation.
Example:
//Constructs a new empty HashMap with default initial capacity
HashMap hashMap = new HashMap();
//Key would be combination of "record key/int pair"
hashMap.put(Key1, new Integer(1));
hashMap.put(Key2, new Integer(2));
hashMap.put(Key3, new Integer(3));
//You can remove values from HashMap in constant time using remove
hashMap.remove(Key1);

How to delete duplicate/aggregate rows faster in a file using Java (no DB)

I have a 2GB big text file, it has 5 columns delimited by tab.
A row will be called duplicate only if 4 out of 5 columns matches.
Right now, I am doing dduping by first loading each coloumn in separate List
, then iterating through lists, deleting the duplicate rows as it encountered and aggregating.
The problem: it is taking more than 20 hours to process one file.
I have 25 such files to process.
Can anyone please share their experience, how they would go about doing such dduping?
This dduping will be a throw away code. So, I was looking for some quick/dirty solution, to get job done as soon as possible.
Here is my pseudo code (roughly)
Iterate over the rows
i=current_row_no.
Iterate over the row no. i+1 to last_row
if(col1 matches //find duplicate
&& col2 matches
&& col3 matches
&& col4 matches)
{
col5List.set(i,get col5); //aggregate
}
Duplicate example
A and B will be duplicate A=(1,1,1,1,1), B=(1,1,1,1,2), C=(2,1,1,1,1) and output would be A=(1,1,1,1,1+2) C=(2,1,1,1,1) [notice that B has been kicked out]

A HashMap will be your best bet. In a single, constant time operation, you can both check for duplication and fetch the appropriate aggregation structure (a Set in my code). This means that you can traverse the entire file in O(n). Here's some example code:
public void aggregate() throws Exception
{
BufferedReader bigFile = new BufferedReader(new FileReader("path/to/file.csv"));
// Notice the paramter for initial capacity. Use something that is large enough to prevent rehashings.
Map<String, HashSet<String>> map = new HashMap<String, HashSet<String>>(500000);
while (bigFile.ready())
{
String line = bigFile.readLine();
int lastTab = line.lastIndexOf('\t');
String firstFourColumns = line.substring(0, lastTab);
// See if the map already contains an entry for the first 4 columns
HashSet<String> set = map.get(firstFourColumns);
// If set is null, then the map hasn't seen these columns before
if (set==null)
{
// Make a new Set (for aggregation), and add it to the map
set = new HashSet<String>();
map.put(firstFourColumns, set);
}
// At this point we either found set or created it ourselves
String lastColumn = line.substring(lastTab+1);
set.add(lastColumn);
}
bigFile.close();
// A demo that shows how to iterate over the map and set structures
for (Map.Entry<String, HashSet<String>> entry : map.entrySet())
{
String firstFourColumns = entry.getKey();
System.out.print(firstFourColumns + "=");
HashSet<String> aggregatedLastColumns = entry.getValue();
for (String column : aggregatedLastColumns)
{
System.out.print(column + ",");
}
System.out.println("");
}
}
A few points:
The initialCapaticy parameter for the HashMap is important. If the number of entries gets bigger than the capacity, then the structure is re-hashed, which is very slow. The default initial capacity is 16, which will cause many rehashes for you. Pick a value that you know is greater than the number of unique sets of the first four columns.
If ordered output in the aggregation is important, you can switch the HashSet for a TreeSet.
This implementation will use a lot of memory. If your text file is 2GB, then you'll probably need a lot of RAM in the jvm. You can add the jvm arg -Xmx4096m to increase the maximum heap size to 4GB. If you don't have at least 4GB this probably won't work for you.
This is also a parallelizable problem, so if you're desperate you could thread it. That would be a lot of effort for throw-away code, though. [Edit: This point is likely not true, as pointed out in the comments]

I would sort the whole list on the first four columns, and then traverse through the list knowing that all the duplicates are together. This would give you O(NlogN) for the sort and O(N) for the traverse, rather than O(N^2) for your nested loops.

I would use a HashSet of the records. This can lead to an O(n) timing instead of O(n^2). You can create a class which has each of the fields with one instance per row.
You need to have a decent amount of memory, but 16 to 32 GB is pretty cheap these days.

I would do something similar to Eric's solution, but instead of storing the actual strings in the HashMap, I'd just store line numbers. So for a particular four column hash, you'd store a list of line numbers which hash to that value. And then on a second path through the data, you can remove the duplicates at those line numbers/add the +x as needed.
This way, your memory requirements will be a LOT smaller.

The solutions already posted are nice if you have enough (free) RAM. As Java tends to "still work" even if it is heavily swapping, make sure you don't have too much swap activity if you presume RAM could have been the limiting factor.
An easy "throwaway" solution in case you really have too little RAM is partitioning the file into multiple files first, depending on data in the first four columns (for example, if the third column values are more or less uniformly distributed, partition by the last two digits of that column). Just go over the file once, and write the records as you read them into 100 different files, depending on the partition value. This will need minimal amount of RAM, and then you can process the remaining files (that are only about 20MB each, if the partitioning values were well distributed) with a lot less required memory, and concatenate the results again.
Just to be clear: If you have enough RAM (don't forget that the OS wants to have some for disk cache and background activity too), this solution will be slower (maybe even by a factor of 2, since twice the amount of data needs to be read and written), but in case you are swapping to death, it might be a lot faster :-)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

parse huge csv for using as key and value - java

Related

How to avoid out of memory when constructing a map of more than 100 thousand keys

Data structure for holding the content of a parsed CSV file

Building an inverted index in Java-logic

Recommended way of storing and accessing key value pairs and multiple integer values with the help of memory mapped IO

How to delete duplicate/aggregate rows faster in a file using Java (no DB)

Categories

Resources