Count duplicates in huge text files collection - java

I have this collection of folders:
60G ./big_folder_6
52G ./big_folder_8
61G ./big_folder_7
60G ./big_folder_4
58G ./big_folder_5
63G ./big_folder_2
54G ./big_folder_9
61G ./big_folder_3
39G ./big_folder_10
74G ./big_folder_1
Each folder contains 100 txt files, with one sentence per line. For example, the file ./big_folder_6/001.txt:
sentence ..
sentence ..
...
Each file in the folder is between 4 and 6 GB (as you can see from the totals reported above) with 40-60 million of sentences more or less. One single file fits in memory.
I need to deduplicate and count the sentences globally unique, so as to obtain a new collection of files where the lines are counted:
count ...unique sentence...
The collection is huge.
My first implementation (using Java) was a "merge sort" approach ordering the lines in a new collection of 500 files (dispatching each line in the right file using the first N characters), then order and aggregate duplicates on the single files.
I know it is a wordcount map-reduce problem but I would prefer to avoid it. The question is: am I using the right approach to solve this kind of problem or I should consider other tool/approach beside MapReduce?

You mean delete duplicated lines of each file? or among all files?
in any case, you cant read the whole file, you need to read line by line or a memory exception will be thrown. Use BufferedReader (example here), use a map storing the string with the count of the repeated line as a value, when you read a line, put in the map incrementing the value if it exist.
After read the file, write all the lines and theirs counts to a new file and release memory.
UPDATE 1
the problem is that you have a lot of gigas. So you cant keep in memory each line because it can thrown a memory exception, but at the same time you have to keep them in memory to quickly validate if they are duplicated. What comes to may mind is instead of having a string representing the key value, put a hash of the string (usgin string.toHash()), and when it was the first, write it to the new file, but flush every 100 lines or more to lower the time writing to the disk. After you processed all the files and write unique lines in the file and you have only integers in the map (hashcode of the string as a key and count as a value), you start reading the file containing only unique lines, then create a new file writing the line and the count values.

Related

List to txt file getting insanely large, I mean huge

thank you for your help in advance.
I have written a code to gather data on some institutions. From the main website an id is provided, and then I have the use that id, to find a second id that I need from a different source. To save time when institutions are repeated (I can have the same institutions in different years for example, so even though some data will be different, the second id won't) I am storing the pairs of ids within two arrays, with the same index, so if I have the "indexOf" one, I have the other wihtout having to go every time to get the data from the source for that institution.
This works great so far, however the problem is when I try to store this pairs. each string(id) is alphanumeric that can contain spaces but never exceeds 20 characters.
I try to save the list after some iterations to save the job before continuing the next day, so I can start with a dictionary already. The data is put into a text file in the following way, each line contains a pair:
id1;id2
The problem is that this txt file gets huge. I mean, with 16 000 lines, it's already above 300MB, and when it got above 60 000 it was already above 22GB of size.
There is obviously a problem as I have had files with more text taking far less space. I write the file in the following way:
List<String> aux = new ArrayList<>();
for (int i = 0 ; i < nipcs.size() ; i++ ){
aux.add(id1.get(i) + ";" + id2.get(i));
}
Files.write(file.toPath() , aux , StandardCharsets.UTF_8);
I want to know if these file sizes are normal, or if there is another better method to do so. Basically I do not care if the txt file is "human-readable", just to be able to load its content into the id1 and id2 array the next day.
Thank you again!
Edit:
Here it is a sample of the txt file written, random 6 lines, copy-paste. I know I should expect to have around 120 000 lines max when all the institutions are registered.
ESA58357021;645090
507755383;346686
510216099;632378
207781818;840321
513268006;840323
502106344;54991
I think I found out what was happening. I paused Dropbox, and the symptom seems to have disapeared. I believe that the file being written while Dropbox doing the sync probably was messing with the structure of the file.
If this goes south again I will post it here.

Go back 'n' lines in file using Stream.lines

I need to build an application which scans through a large amount of files. These files contain blocks with some data about a sessions, in which each line has a different value. E.g.: "=ID: 39487".
At that point I have that line, but the problem I now face is that I need the value n lines above that ID. I was thinking about an Iterator but it only has forward methods. I also thought about saving the results in a List but that defies the reason to use Stream and some files are huge so that would cause memory problems.
I was wondering if something like this is possible using the Stream API (Files)? Or perhaps a better question, is there a better way to approach this?
Stream<String> lines = Files.lines(Paths.get(file.getName()));
Iterator<String> search = lines.iterator();
You can't arbitrarily read backwards and forwards through the file with the same reader (no matter if you're using streams, iterators, or a plain BufferedReader.)
If you need:
m lines before a given line
n lines after the given line
You don't know the value of m and n in advance, until you reach that line
...then you essentially have three options:
Read the whole file once, keep it in memory, and then your task is trivial (but this uses the most memory.)
Read the whole file once, mark the line numbers that you need, then do a second pass where you extract the lines you require.
Read the whole file once, storing some form of metadata about line lengths as you go, then use a RandomAccessFile to extract the specific bits you need without having to read the whole file again.
I'd suggest given the files are huge, the second option here is probably the most realistic. The third will probably give you better performance, but will require much more in the way of development effort.
As an alternative if you can guarantee that both n and m are below a certain value, and that value is a reasonable size - you could also just keep a certain number of lines in a buffer as you're processing the file, and read through that buffer when you need to read lines "backwards".
Try my library. abacus-util
try(Reader reader = new FileReader(yourFile)) {
StreamEx.of(reader)
.sliding(n, n, ArrayList::new)
.filter(l -> l.get(l.size() - 1).contains("=ID: 39487"))
./* then do your work */
}
No matter how big your file is. as long as n is small number, not millions

Taking cartesian product of two huge files using hadoop

We have two files file1 and file2 which have a huge number of lines, lets say a billion lines each. The goal here is to take a cartesian product of the files.
So if file1 has m lines and file2 has n lines the cartesian product output would have m*n lines.
I could think below solution for this problem:
Write mapper1 which reads each line as <line_number, line_content> from file1 and outputs <const, file1_marker+line_number+line_content>.
Write mapper2 which reads each line as <line_number, line_content> from file2 and outputs <const, file2_marker+line_number+line_content>.
Write mapper3 which reads from both the output files and outputs <const, list_of_file1_and_file2_line_contents>.
Write a reducer which gets the mapper3 output and while traversing the values combines every file1_marker+line_number value with every file2_marker+line_number value it sees (It could be made faster by hashing etc.) and outputs <file1_line_content, file2_line_contents>.
But it appears that the amount of memory will be an issue here. So I am looking for a more memory efficient solution if possible.
Please suggest.

Removing file lines containing data which are not present in another file

I have a file Hier.csv which looks like this (thousands of lines):
value;nettingNodeData;ADM59505_10851487;CVAEngine;ADM;;USD;0.4;35661;BDR;NA;ICE;;RDC;MAS35661_10851487;CVAEngine;MA;10851487;RDC
I have another one, Prices.csv, which looks like this :
value;nettingNodePrices;ADM68834_22035364;CVAEngine;CVA with FTD;EUR;1468.91334249291905;DVA with FTD;EUR;5365.59742483701497
I have to make sure that both files have the same number of lines and the same ids (the third value of each lines), and it's a known fact that the set of ids from Hier.csv is larger and contains the set of ids from Prices.csv, ie. some ids that are in Hier.csv are not in Prices.csv.
Also, there are no duplicates in either file.
So far, I have tried the following, but it's taking ages, and not working (I can do it faster with my little hands and Excel, but that's not what I want).
Here is my program in pseudo code, as I don't have access to my code right now, I will edit this question as soon as I can :
for each line of Hier.csv
for each line of Prices.csv
if prices.line doesn't contain the 3rd value of hier.line
store that value in a list
end
end
end
Process p;
for each value in the list
// remove the line containing that value from Hier.csv
String[] command1 = {"sed", "'/^.*" + value + ".*$/d'", "Hier.csv", ">", "tmp.csv"};
Process p = Runtime.getRuntime().exec(command1)
end
String[] command2 = {"mv", "tmp.csv" "Hier.csv"};
Process p = Runtime.getRuntime().exec(command2)
Is there a better way than that double loop ?
Why does'nt the last part (exec(command)) work ?
And lastly, which is more efficient when reading csv files : BufferedReader or Scanner ?
You can use merge or hashtable.
Merge:
sort both files and merge together
Hashtable:
load smaller file (ids) to hashtable, loop through bigger file and test existence against hashtable

Java compiler error: lookup table exceeds 65535 limit

I'm running into this compiler error due to my extremely large lookup table based on this definition:
//92 X 182 array
private static final double[][] lookUpTable = new double[][]
{
{ numbers....}
};
As i understand it, dividing it up is a solution, but it would be extremely difficult to split this array up accurately. I also believe i could move it out to a file, but i don't know if i could format it in a way to help me, plus i don't want file reads every second.
Are there any other suggestions to help me get around this?
Convert your table to a file, embed the file as a resource, read it once in a static initialization block, and store it in a lookUpTable array. It will not be distinguishable from an array initialized through an aggregate, except there would be no 65535 limit. Storing in a static variable will help you avoid "reads every second".
As far as the format is concerned, you can put each row of the matrix in a separate line of the resource file. Reading and maintaining this file would be simple, because there would be no other mark-up around your numbers.
Here is a link to an answer explaining how to read a file from a resource.
Read the file once on demand.
As you have a table/matrix, I suggest having one line per row. Read each line and split the numbers and parse them individually.
You could keep the rows in a string (thus reducing the number of objects for java to handle) as comma separated values, and on program start, split each row and so build up your table of longs.

Categories

Resources