I am currently building an app which will be used to time a race.
All the times are saved in a .txt file in this format.
STARTOFEVENT,20/11/2011 11:04:58
0,20/11/2011 11:05:14
1,20/11/2011 11:05:17,00:00:02
2,20/11/2011 11:05:19,00:00:04
3,20/11/2011 11:05:20,00:00:05
4,20/11/2011 11:05:21,00:00:06
5,20/11/2011 11:05:22,00:00:07
What I need help with is displaying the position number (column 1) and finish time (column 3) in a textView / editText as the results come in.
I have tried to a bit of code for parsing CSV files but with no luck.
Example of split(...)...
String csvRecord = "1,20/11/2011 11:05:17,00:00:02";
String[] csvFields = csvRecord.split(",");
Each part of the string csvRecord separated by a comma is allocated to an element of the csvFields array. The number of array elements is dependent on the number of csv fields and is handled by the split(...) method which dynamically creates the array with the correct number.
From the above, csvFields[0] will be 1 with csvFields[1] as 20/11/2011 11:05:17 and csvFields[2] will be 00:00:02
Related
thank you for your help in advance.
I have written a code to gather data on some institutions. From the main website an id is provided, and then I have the use that id, to find a second id that I need from a different source. To save time when institutions are repeated (I can have the same institutions in different years for example, so even though some data will be different, the second id won't) I am storing the pairs of ids within two arrays, with the same index, so if I have the "indexOf" one, I have the other wihtout having to go every time to get the data from the source for that institution.
This works great so far, however the problem is when I try to store this pairs. each string(id) is alphanumeric that can contain spaces but never exceeds 20 characters.
I try to save the list after some iterations to save the job before continuing the next day, so I can start with a dictionary already. The data is put into a text file in the following way, each line contains a pair:
id1;id2
The problem is that this txt file gets huge. I mean, with 16 000 lines, it's already above 300MB, and when it got above 60 000 it was already above 22GB of size.
There is obviously a problem as I have had files with more text taking far less space. I write the file in the following way:
List<String> aux = new ArrayList<>();
for (int i = 0 ; i < nipcs.size() ; i++ ){
aux.add(id1.get(i) + ";" + id2.get(i));
}
Files.write(file.toPath() , aux , StandardCharsets.UTF_8);
I want to know if these file sizes are normal, or if there is another better method to do so. Basically I do not care if the txt file is "human-readable", just to be able to load its content into the id1 and id2 array the next day.
Thank you again!
Edit:
Here it is a sample of the txt file written, random 6 lines, copy-paste. I know I should expect to have around 120 000 lines max when all the institutions are registered.
ESA58357021;645090
507755383;346686
510216099;632378
207781818;840321
513268006;840323
502106344;54991
I think I found out what was happening. I paused Dropbox, and the symptom seems to have disapeared. I believe that the file being written while Dropbox doing the sync probably was messing with the structure of the file.
If this goes south again I will post it here.
I have this collection of folders:
60G ./big_folder_6
52G ./big_folder_8
61G ./big_folder_7
60G ./big_folder_4
58G ./big_folder_5
63G ./big_folder_2
54G ./big_folder_9
61G ./big_folder_3
39G ./big_folder_10
74G ./big_folder_1
Each folder contains 100 txt files, with one sentence per line. For example, the file ./big_folder_6/001.txt:
sentence ..
sentence ..
...
Each file in the folder is between 4 and 6 GB (as you can see from the totals reported above) with 40-60 million of sentences more or less. One single file fits in memory.
I need to deduplicate and count the sentences globally unique, so as to obtain a new collection of files where the lines are counted:
count ...unique sentence...
The collection is huge.
My first implementation (using Java) was a "merge sort" approach ordering the lines in a new collection of 500 files (dispatching each line in the right file using the first N characters), then order and aggregate duplicates on the single files.
I know it is a wordcount map-reduce problem but I would prefer to avoid it. The question is: am I using the right approach to solve this kind of problem or I should consider other tool/approach beside MapReduce?
You mean delete duplicated lines of each file? or among all files?
in any case, you cant read the whole file, you need to read line by line or a memory exception will be thrown. Use BufferedReader (example here), use a map storing the string with the count of the repeated line as a value, when you read a line, put in the map incrementing the value if it exist.
After read the file, write all the lines and theirs counts to a new file and release memory.
UPDATE 1
the problem is that you have a lot of gigas. So you cant keep in memory each line because it can thrown a memory exception, but at the same time you have to keep them in memory to quickly validate if they are duplicated. What comes to may mind is instead of having a string representing the key value, put a hash of the string (usgin string.toHash()), and when it was the first, write it to the new file, but flush every 100 lines or more to lower the time writing to the disk. After you processed all the files and write unique lines in the file and you have only integers in the map (hashcode of the string as a key and count as a value), you start reading the file containing only unique lines, then create a new file writing the line and the count values.
Suppose I have a CSV file with 1 million columns, hundreds of gigabytes.
My objective is to read every row of the 1st, 20th and 50th columns of this CSV file to memory as quickly as possible.
How do I achieve this? Something like this will work but is inefficient in terms of speed and memory since I need to parse every single column which is 1 million elements -- looking for a better solution that doesn't require this.
BufferedReader stream = ...;//reader
while ((line = stream.readLine()) != null) {
String[] keep = line.split(",");
//keep only 0th,19th,49th elements.
}
You could use the linux command cut to retrieve those columns into a separate file and then process that file instead.
cut -c1,20,50 giant.csv >> just3columns.csv
I have a file Hier.csv which looks like this (thousands of lines):
value;nettingNodeData;ADM59505_10851487;CVAEngine;ADM;;USD;0.4;35661;BDR;NA;ICE;;RDC;MAS35661_10851487;CVAEngine;MA;10851487;RDC
I have another one, Prices.csv, which looks like this :
value;nettingNodePrices;ADM68834_22035364;CVAEngine;CVA with FTD;EUR;1468.91334249291905;DVA with FTD;EUR;5365.59742483701497
I have to make sure that both files have the same number of lines and the same ids (the third value of each lines), and it's a known fact that the set of ids from Hier.csv is larger and contains the set of ids from Prices.csv, ie. some ids that are in Hier.csv are not in Prices.csv.
Also, there are no duplicates in either file.
So far, I have tried the following, but it's taking ages, and not working (I can do it faster with my little hands and Excel, but that's not what I want).
Here is my program in pseudo code, as I don't have access to my code right now, I will edit this question as soon as I can :
for each line of Hier.csv
for each line of Prices.csv
if prices.line doesn't contain the 3rd value of hier.line
store that value in a list
end
end
end
Process p;
for each value in the list
// remove the line containing that value from Hier.csv
String[] command1 = {"sed", "'/^.*" + value + ".*$/d'", "Hier.csv", ">", "tmp.csv"};
Process p = Runtime.getRuntime().exec(command1)
end
String[] command2 = {"mv", "tmp.csv" "Hier.csv"};
Process p = Runtime.getRuntime().exec(command2)
Is there a better way than that double loop ?
Why does'nt the last part (exec(command)) work ?
And lastly, which is more efficient when reading csv files : BufferedReader or Scanner ?
You can use merge or hashtable.
Merge:
sort both files and merge together
Hashtable:
load smaller file (ids) to hashtable, loop through bigger file and test existence against hashtable
I have an app that will create 3 arrays : 2 with double values and one with strings that can contain anything,alphanumeric,commas,points,anything the user might want to type or type by accident. The double arrays are easy.The string one i find to be tricky.
It can contain stuff like cake red,blue 1kg paper-clip,you get the ideea.
I will need to store those arrays somehow(i guess in a file is the easiest way),read them and get them back into the app whenever the user wants to.
Also,it would be well if they wouldn't be human readable,to only be able to read them thru my app.
What's the best way to do this ? My issue is,how can i read them back into arrays.Its easy to write to a file but then to get them back in the same array i put them in...How can i separate array elements for it not to split one element in two because it has a space or any other element.
Can i like,make 3 rows of text,each element split by a tab \t or something and when i read it each element will by split by that tab ? Will this be able to create any issues when reading ?
I guess i want to know how can i split the elements of the array so that it won't be able to ever read them wrong.
Thanks and have a nice day !
If you don't want the file to be human readable, you could usejava.io.RandomAccessFile.
You would probably want to specify a maximum string size if you did this.
To save a string:
String str = "hello";
RandomAccessFile file = new RandomAccessFile(new File("filename"));
final int MAX_STRING_BYTES = 100; // max number of bytes the string could use in the file
file.writeUTF(str);
file.skipBytes(MAX_STRING_BYTES - str.getBytes().length);
// then write another..
To read a string:
// instantiate again
final int STRING_POSITION = 100; // or whichever place you saved it
file.seek(STRING_POSITION);
String str = new String(file.read(MAX_STRING_BYTES));
You would probably want a use the beginning of the file to store the size of each array. Then just store all the values one by one in the file, no need for separators.