Removing file lines containing data which are not present in another file - java

I have a file Hier.csv which looks like this (thousands of lines):
value;nettingNodeData;ADM59505_10851487;CVAEngine;ADM;;USD;0.4;35661;BDR;NA;ICE;;RDC;MAS35661_10851487;CVAEngine;MA;10851487;RDC
I have another one, Prices.csv, which looks like this :
value;nettingNodePrices;ADM68834_22035364;CVAEngine;CVA with FTD;EUR;1468.91334249291905;DVA with FTD;EUR;5365.59742483701497
I have to make sure that both files have the same number of lines and the same ids (the third value of each lines), and it's a known fact that the set of ids from Hier.csv is larger and contains the set of ids from Prices.csv, ie. some ids that are in Hier.csv are not in Prices.csv.
Also, there are no duplicates in either file.
So far, I have tried the following, but it's taking ages, and not working (I can do it faster with my little hands and Excel, but that's not what I want).
Here is my program in pseudo code, as I don't have access to my code right now, I will edit this question as soon as I can :
for each line of Hier.csv
for each line of Prices.csv
if prices.line doesn't contain the 3rd value of hier.line
store that value in a list
end
end
end
Process p;
for each value in the list
// remove the line containing that value from Hier.csv
String[] command1 = {"sed", "'/^.*" + value + ".*$/d'", "Hier.csv", ">", "tmp.csv"};
Process p = Runtime.getRuntime().exec(command1)
end
String[] command2 = {"mv", "tmp.csv" "Hier.csv"};
Process p = Runtime.getRuntime().exec(command2)
Is there a better way than that double loop ?
Why does'nt the last part (exec(command)) work ?
And lastly, which is more efficient when reading csv files : BufferedReader or Scanner ?

You can use merge or hashtable.
Merge:
sort both files and merge together
Hashtable:
load smaller file (ids) to hashtable, loop through bigger file and test existence against hashtable

Related

List to txt file getting insanely large, I mean huge

thank you for your help in advance.
I have written a code to gather data on some institutions. From the main website an id is provided, and then I have the use that id, to find a second id that I need from a different source. To save time when institutions are repeated (I can have the same institutions in different years for example, so even though some data will be different, the second id won't) I am storing the pairs of ids within two arrays, with the same index, so if I have the "indexOf" one, I have the other wihtout having to go every time to get the data from the source for that institution.
This works great so far, however the problem is when I try to store this pairs. each string(id) is alphanumeric that can contain spaces but never exceeds 20 characters.
I try to save the list after some iterations to save the job before continuing the next day, so I can start with a dictionary already. The data is put into a text file in the following way, each line contains a pair:
id1;id2
The problem is that this txt file gets huge. I mean, with 16 000 lines, it's already above 300MB, and when it got above 60 000 it was already above 22GB of size.
There is obviously a problem as I have had files with more text taking far less space. I write the file in the following way:
List<String> aux = new ArrayList<>();
for (int i = 0 ; i < nipcs.size() ; i++ ){
aux.add(id1.get(i) + ";" + id2.get(i));
}
Files.write(file.toPath() , aux , StandardCharsets.UTF_8);
I want to know if these file sizes are normal, or if there is another better method to do so. Basically I do not care if the txt file is "human-readable", just to be able to load its content into the id1 and id2 array the next day.
Thank you again!
Edit:
Here it is a sample of the txt file written, random 6 lines, copy-paste. I know I should expect to have around 120 000 lines max when all the institutions are registered.
ESA58357021;645090
507755383;346686
510216099;632378
207781818;840321
513268006;840323
502106344;54991
I think I found out what was happening. I paused Dropbox, and the symptom seems to have disapeared. I believe that the file being written while Dropbox doing the sync probably was messing with the structure of the file.
If this goes south again I will post it here.

Go back 'n' lines in file using Stream.lines

I need to build an application which scans through a large amount of files. These files contain blocks with some data about a sessions, in which each line has a different value. E.g.: "=ID: 39487".
At that point I have that line, but the problem I now face is that I need the value n lines above that ID. I was thinking about an Iterator but it only has forward methods. I also thought about saving the results in a List but that defies the reason to use Stream and some files are huge so that would cause memory problems.
I was wondering if something like this is possible using the Stream API (Files)? Or perhaps a better question, is there a better way to approach this?
Stream<String> lines = Files.lines(Paths.get(file.getName()));
Iterator<String> search = lines.iterator();
You can't arbitrarily read backwards and forwards through the file with the same reader (no matter if you're using streams, iterators, or a plain BufferedReader.)
If you need:
m lines before a given line
n lines after the given line
You don't know the value of m and n in advance, until you reach that line
...then you essentially have three options:
Read the whole file once, keep it in memory, and then your task is trivial (but this uses the most memory.)
Read the whole file once, mark the line numbers that you need, then do a second pass where you extract the lines you require.
Read the whole file once, storing some form of metadata about line lengths as you go, then use a RandomAccessFile to extract the specific bits you need without having to read the whole file again.
I'd suggest given the files are huge, the second option here is probably the most realistic. The third will probably give you better performance, but will require much more in the way of development effort.
As an alternative if you can guarantee that both n and m are below a certain value, and that value is a reasonable size - you could also just keep a certain number of lines in a buffer as you're processing the file, and read through that buffer when you need to read lines "backwards".
Try my library. abacus-util
try(Reader reader = new FileReader(yourFile)) {
StreamEx.of(reader)
.sliding(n, n, ArrayList::new)
.filter(l -> l.get(l.size() - 1).contains("=ID: 39487"))
./* then do your work */
}
No matter how big your file is. as long as n is small number, not millions

Count duplicates in huge text files collection

I have this collection of folders:
60G ./big_folder_6
52G ./big_folder_8
61G ./big_folder_7
60G ./big_folder_4
58G ./big_folder_5
63G ./big_folder_2
54G ./big_folder_9
61G ./big_folder_3
39G ./big_folder_10
74G ./big_folder_1
Each folder contains 100 txt files, with one sentence per line. For example, the file ./big_folder_6/001.txt:
sentence ..
sentence ..
...
Each file in the folder is between 4 and 6 GB (as you can see from the totals reported above) with 40-60 million of sentences more or less. One single file fits in memory.
I need to deduplicate and count the sentences globally unique, so as to obtain a new collection of files where the lines are counted:
count ...unique sentence...
The collection is huge.
My first implementation (using Java) was a "merge sort" approach ordering the lines in a new collection of 500 files (dispatching each line in the right file using the first N characters), then order and aggregate duplicates on the single files.
I know it is a wordcount map-reduce problem but I would prefer to avoid it. The question is: am I using the right approach to solve this kind of problem or I should consider other tool/approach beside MapReduce?
You mean delete duplicated lines of each file? or among all files?
in any case, you cant read the whole file, you need to read line by line or a memory exception will be thrown. Use BufferedReader (example here), use a map storing the string with the count of the repeated line as a value, when you read a line, put in the map incrementing the value if it exist.
After read the file, write all the lines and theirs counts to a new file and release memory.
UPDATE 1
the problem is that you have a lot of gigas. So you cant keep in memory each line because it can thrown a memory exception, but at the same time you have to keep them in memory to quickly validate if they are duplicated. What comes to may mind is instead of having a string representing the key value, put a hash of the string (usgin string.toHash()), and when it was the first, write it to the new file, but flush every 100 lines or more to lower the time writing to the disk. After you processed all the files and write unique lines in the file and you have only integers in the map (hashcode of the string as a key and count as a value), you start reading the file containing only unique lines, then create a new file writing the line and the count values.

Java - Trying to create an arraylist of strings but the arraylist gets full(?)

I might just be doing something stupid here but I'm trying to write a program that will take all the text from an xml file, put it in an arraylist as strings, then find certain recurring strings and count them. It basically works, but for some reason it won't go through the entire xml file. It's a pretty large file with over 15000 lines (ideally I'd like it to be able to hand any amount of lines though). I did a test to output everything it was putting in the arraylist to a .txt file and eventually the last line simply says "no", and there's still much more text/lines to go through.
This is the code I'm using to make the arraylist (lines is the amount of lines in the file):
// make array of strings
for (int i=0; i<lines; i++) {
strList.add(fin2.next());
}
fin2.close();
Then I'm searching for the desired strings with:
// find strings
for (String string : strList) {
if(string.matches(identifier)){
count++;
}
}
System.out.println(count);
fout.println(count);
It basically works (the printwriter and scanners work, line count works, etc) except the arraylist won't take all the text from the .xml file, so of course the count at the end is inaccurate. Is arraylist not the best solution for this problem?
This is a BAD practice to do. Each time you put a string into an ArrayList and keep it there, you're going to have an increase in memory usage. The bigger the file, the more memory is used up to the point where you're wondering why your application is using 75% of your memory.
You don't need to store the lines into an ArrayList in order to see if they match. You can simply just read the line and compare it to whatever text you're comparing it to.
Here would be your code modified:
String nextString = "";
while (fin2.hasNext()) {
nextString = fin2.next();
if (nextString.matches(identifier) || nextString.matches(identifier2)) {
count++;
}
}
fin2.close();
System.out.pritnln(count);
Eliminates looping through everything twice, saves you a ton of memory, and gives you accurate results. Also I'm not sure if you're meaning to read the entire line, or you have some sort of token. If you want to read the entire line, change hasNext to hasNextLine and next to nextLine
Edit: Modified the code to show what it would look like looking for multiple strings.
Have you tried to use map, like HashMap. Since Your goal is to find the occurrence of word from a xml, hashmap will make your like easier.
The problem is not with your ArrayList but with your for loop. What's happening is that you're using the number of lines in your file as your sentinel value, but rather than incrementing i by 1 every line, you are doing it every word. Therefore, not all the words are added to your ArrayList because your loop terminates earlier than expected. Hope this helps!
EDIT: I don't know what object you are using right now to collect the contents of this xml file, but I would suggest using Scanner instead (passing the File as a parameter in the constructor) and replacing the current for loop with a while loop that uses while (nameOfScanner.hasNextLine())

Writing/reading array to file

I have an app that will create 3 arrays : 2 with double values and one with strings that can contain anything,alphanumeric,commas,points,anything the user might want to type or type by accident. The double arrays are easy.The string one i find to be tricky.
It can contain stuff like cake red,blue 1kg paper-clip,you get the ideea.
I will need to store those arrays somehow(i guess in a file is the easiest way),read them and get them back into the app whenever the user wants to.
Also,it would be well if they wouldn't be human readable,to only be able to read them thru my app.
What's the best way to do this ? My issue is,how can i read them back into arrays.Its easy to write to a file but then to get them back in the same array i put them in...How can i separate array elements for it not to split one element in two because it has a space or any other element.
Can i like,make 3 rows of text,each element split by a tab \t or something and when i read it each element will by split by that tab ? Will this be able to create any issues when reading ?
I guess i want to know how can i split the elements of the array so that it won't be able to ever read them wrong.
Thanks and have a nice day !
If you don't want the file to be human readable, you could usejava.io.RandomAccessFile.
You would probably want to specify a maximum string size if you did this.
To save a string:
String str = "hello";
RandomAccessFile file = new RandomAccessFile(new File("filename"));
final int MAX_STRING_BYTES = 100; // max number of bytes the string could use in the file
file.writeUTF(str);
file.skipBytes(MAX_STRING_BYTES - str.getBytes().length);
// then write another..
To read a string:
// instantiate again
final int STRING_POSITION = 100; // or whichever place you saved it
file.seek(STRING_POSITION);
String str = new String(file.read(MAX_STRING_BYTES));
You would probably want a use the beginning of the file to store the size of each array. Then just store all the values one by one in the file, no need for separators.

Categories

Resources