Data structure for holding the content of a parsed CSV file

Data structure for holding the content of a parsed CSV file - java

I'm trying to figure out what the best approach would be to parse a csv file in Java. Now each line will have an X amount of information. For example, the first line can have up to 5 string words (with commas separating them) while the next few lines can have maybe 3 or 6 or what ever.
My problem isn't reading the strings from the file. Just to be clear. My problem is what data structure would be best to hold each line and also each word in that line?
At first I thought about using a 2D array, but the problem with that is that array sizes must be static (the 2nd index size would hold how many words there are in each line, which can be different from line to line).
Here's the first few lines of the CSV file:
0,MONEY
1,SELLING
2,DESIGNING
3,MAKING
DIRECTOR,3DENT95VGY,EBAD,SAGHAR,MALE,05/31/2011,null,0,10000,07/24/2011
3KEET95TGY,05/31/2011,04/17/2012,120050
3LERT9RVGY,04/17/2012,03/05/2013,132500
3MEFT95VGY,03/05/2013,null,145205
DIRECTOR,XKQ84P6CDW,AGHA,ZAIN,FEMALE,06/06/2011,null,1,1000,01/25/2012
XK4P6CDW,06/06/2011,09/28/2012,105000
XKQ8P6CW,09/28/2012,null,130900
DIRECTOR,YGUSBQK377,AYOUB,GRAMPS,FEMALE,10/02/2001,12/17/2007,2,12000,01/15/2002

You could use a Map<Integer, List<String>>. The keys being the line numbers in the csv file, and the List being the words in each line.
An additional point: you will probably end up using List#get(int) method quite often. Do not use a linked list if this is the case. This is because get(int) for linked list is O(n). I think an ArrayList is your best option here.
Edit (based on AlexWien's observation):
In this particular case, since the keys are line numbers, thus yielding a contiguous set of integers, an even better data structure could be ArrayList<ArrayList<String>>. This will lead to faster key retrievals.

Use Array List. They are arrays with dynamic size.

The best way is to use a CSV parser, like http://opencsv.sourceforge.net/. This parser uses List of String[] to hold data.

Use a List<String>(), which can expand dynamically in size.
If you want to have 2 dimensions, use a List<List<String>>().
Here's an example:
List<List<String>> data = new ArrayList<List<String>>();
List<String> temp = Arrays.asList(someString.split(","));
data.add(temp);
put this in some kind of loop and get your data like that.

Related

Best algorithm and data structure to compare two big lists

Everyday I receive one list of 30-40k lines, each line contains meaningful or meaningless names like fastcar, ultrafastcar, blablablacar etc.
I also have one big list which consists of the all words in any language (about 50k lines).
And i want to compare first list against second in order to filter which includes(or starts with - ends with) the words from the second list. I mean If word "ultrafastcar" then it will not be filtered but "blablacar" will be filtered out.
I have prepared some Java codes but it takes too long to compare lists. I have used ArrayLists and compared them with contains(), startsWith() methods. Are ArrayLists correct choice and what else algorithm can i use to compare them except these methods.

You could try implementing a ternary search tree with the second list and then check if the words in the first exist in the tree.

Fastest way to build an index (list of substrings with lines of occurrence) of a String?

Problem:
Essentially, my goal is to build an ArrayList of IndexEntry objects from a text file. An IndexEntry has the following fields: String word, representing this unique word in the text file, and ArrayList numsList, a list containing the lines of the text file in which word occurs.
The ArrayList I build must keep the IndexEntries sorted so that their word fields are in alphabetical order. However, I want to do this in the fastest way possible. Currently, I visit each word as it appears in the text file and use binary search to determine if an IndexEntry for that word already exists in order to add the current line number to its numsList. In the case of an IndexEntry not existing I create a new one in the appropriate spot in order to maintain alphabetical order.
Example:
_
One
Two
One
Three
_
Would yield an ArrayList of IndexEntries whose output as a String (in the order of word, numsList) is:
One [1, 5], Three [7], Two [3]
Keep in mind that I am working with much larger text files, with many occurrences of the same word.
Question:
Is binary search the fastest way to approach this problem? I am still a novice at programming in Java, and am curious about searching algorithms that might perform better in this scenario or the relative time complexity of using a Hash Table when compared with my current solution.

You could try a TreeMap or a ConcurrentSkipListMap which will keep your index sorted.
However, if you only need a sorted list at the end of your indexing, good old HashMap<String, List> is the way to go (ArrayList as value is probably a safe bet as well)
When you are done, get the values of the map and sort them once by key.
Should be good enough for a couple hundred megabytes of text files.
If you are on Java 8, use the neat computeIfAbsent and computeIfPresent methods.

parse huge csv for using as key and value

i have to parse a huge csv which contains 3 values per line.
At first i get the csv of the assets folder and read it line by line.
while ((line = reader.readLine()) != null) {
String[] string = line.split(",");
...
}
Now i want to read the values effectually, to use them later.
I need the first to values as a pair. Each pair pertains the last value in the same line.
for example :
1,2,3
4,5,6
...
results:
pair = [1,2];
pairValue = 3;
...
but i need all values of the .csv to work with them later as pair and as single (for calculation) so which method is the best to work with this data?
Maybe an <ArrayList> or a HashMap like
Map <String,String> map = new HashMap<String,String>();
//add items
map.put(pair,value);
//get items
String valueOfKeys =(String) map.get(pair);
I hope one of you understand me and can help.

The first question is will all your data fit in memory? Assuming that it does then you need to decide how you want to be able to look up the data:
do you need to access the values based on the row number, eg give me the pair and pairValue for row 1985?
do you need to access a pair based on the pairValue, eg give me the pair for pairValue = 3?
do you need to iterate through all the data from start to finish?
In the first case an array list would be quickest but be aware that this would involve allocating a large contiguous chunk of memory.
In the second case a hashmap would work, as you've already suggested.
In the third case, a LinkedList would work and would mean that you wouldn't have to allocate a contiguous chunk of memory, but accessing the nth element would be slower.
If the file is too large to fit into memory then you're going to have to write the data to a database table and query it from there.

Optimising checking for Strings in a word list (Java)

I have a text file containing ~30,000 words in alphabetical order each on a separate line.
I also have a Set<String> set containing ~10 words.
I want to check if any of the words in my set are in the word list (text file).
So far my method has been to:
Open the word list text file
Read a line/word
Check if set contains that word
Repeat to the end of the word list file
This seems badly optimised. For example if I'm checking a word in my set that begins with the letter b I see no point in checking words in the text file beggining with a & c, d, .. etc.
My proposed solution would be to separate the text file into 26 files, one file for words which start with each letter of the alphabet. Is there a more efficient solution than this?
Note: I know 30,000 words isn't that large a word list but I have to do this operation many times on a mobile device so performance is key.

You can further your approach of using Hash Sets onto the entire wordlist file. String comparisons are expensive so its better to create a HashSet of Integer. You should read the wordlist (assuming words will not increase from 30,000 to something like 3 million) once in its entirety and save all the words in an Integer Hashset. When adding into the Integer Hashset use:
wordListHashSet.add(mycurrentword.hashcode());
You have mentioned that you have a string hash of 10 words that must be checked if its in the wordlist. Again instead of String Hash, create an Integer Hash Set.
Create an iterator of this Integer Hash Set.
Iterator it = myTenWordsHashSet.iterator();
Iterate over this in a loop and check for the following condition:
wordListHashSet.contains(it.next());
If this is true, then you have the word in the wordlist.
Using Integer Hash Maps is good idea when performance is what you are looking for. Internally Java processes the hash of each string and stores it in the memory such that repeated access to such strings is blazing fast, faster than binary search with search complexities of O(log n) to almost O(1) for each call to an element in the wordlist.
Hope that helps!

It's probably not worth the hassle for 30,000 words, but let's just say you have a lot more, like say 300,000,000 words, and still only 10 words to look for.
In that case, you could do a binary search in the large file for each of the search words, using Random Access Files.
Obviously, each searching step would require you to first to find the beginning of the word (or the next word, implementation dependend), which makes it a lot more difficult, and cutting out all the corner cases exceeds the limit of code one could provide here. But still it could be done and would surely be faster than reading through all of 300,000,000 words once.

You might consider iterating through your 10 word set (maybe parse it from the file into an array), and for each entry, using a binary search algorithm to see if it's contained in the larger list. Binary search should only take O(logN), so in this case log(30,000) which is significantly faster that 30,000 steps.
Since you'll repeat this step once for every word in your set, it should take 10*log(30k)

You can make some improvements depending on your needs.
If for example the file remains unchanged but your 10-words Set changes regularly, then you can load the file on another Set (HashSet). Now you just need to search for a match on this new Set. This way your search will always be O(1).

Filling in uninitialized array in java? (or workaround!)

I'm currently in the process of creating an OBJ importer for an opengles android game. I'm relatively new to the language java, so I'm not exactly clear on a few things.
I have an array which will hold the number of vertices in the model(along with a few other arrays as well):
float vertices[];
The problem is that I don't know how many vertices there are in the model before I read the file using the inputstream given to me.
Would I be able to fill it in as I need to like this?:
vertices[95] = 5.004f; //vertices was defined like the example above
or do I have to initialize it beforehand?
if the latter is the case then what would be a good way to find out the number of vertices in the file? Once I read it using inputstreamreader.read() it goes to the next line until it reads the whole file. The only thing I can think of would be to read the whole file, count the number of vertices, then read it AGAIN the fill in the newly initialized array.
Is there a way to dynamically allocate the data as is needed?

You can use an ArrayList which will give you the dynamic size that you need.
List<Float> vertices = new ArrayList<Float>();
You can add a value like this:
vertices.add(5.0F);
and the list will grow to suit your needs.
Some things to note: The ArrayList will hold objects, not primitive types. So it stores the float values you provide as Float objects. However, it is easy to get the original float value from this.
If you absolutely need an array then after you read in the entire list of values you can easily get an array from the List.
You can start reading about Java Collections here.

In java arrays have to be initialised beforehand. In your case you have the following options:
1) Use an ArrayList (or some other implementation of List interface), as suggested by others. Such lists can grow dynamically so this will help.
2) If you have control over the file format, add information on the number of vertices to the beginning of the file, so you can pre-initialise your array with correct size.
3) If you don't have control over it, try guessing the number of vertices based on file size (float is 4 bytes, so maybe divide File.length() by 4, for example). If the guessed number is too small, you can dynamically create a bigger array (say, 120% of the previous array size), the copy all data from previous array into the new one and carry on. This may be costly but if your guessing of array size is precise it will not be a problem.
We might be able to give you more ideas if you give us more information on file format and/or how this array of vertices going to be used (like: stored for a long time, or thrown away quickly).

No, you can't fill in uninitialized array.
If you need a dynamic structure that allows storing data + indexes (which seem to be important in your case), I would go for Map (key of Map would be your index):
Map<Integer, Float> vertices = new HashMap<Integer, Float>();
vertices.put(95, 5.004f);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Data structure for holding the content of a parsed CSV file - java

Use Array List. They are arrays with dynamic size.

The best way is to use a CSV parser, like http://opencsv.sourceforge.net/. This parser uses List of String[] to hold data.

Related

Best algorithm and data structure to compare two big lists

Fastest way to build an index (list of substrings with lines of occurrence) of a String?

parse huge csv for using as key and value

Optimising checking for Strings in a word list (Java)

Filling in uninitialized array in java? (or workaround!)

Categories

Resources