Best way to compare big csv files?

Best way to compare big csv files? - java

I must do an application, that compares some very big csv files, each one having 40,000 records. I have done an application, that works properly, but it spends a lot of time in doing that comparison, because the two files could be disordenated or have different records - for that I must iterate (40000^2)*2 times.
Here is my code:
if (nomFich.equals("CAR"))
{
while ((linea = br3.readLine()) != null)
{
array =linea.split(",");
spliteado = array[0]+array[1]+array[2]+array[8];
FileReader fh3 = new FileReader(cadena + lista2[0]);
BufferedReader bh3 = new BufferedReader(fh3);
find=0;
while (((linea2 = bh3.readLine()) != null))
{
array2 =linea2.split(",");
spliteado2 = array2[0]+array2[1]+array2[2]+array2[8];
if (spliteado.equals(spliteado2))
{
find =1;
}
}
if (find==0)
{
bw3.write("+++++++++++++++++++++++++++++++++++++++++++");
bw3.newLine();
bw3.write("Se han incorporado los siguientes CGI en la nueva lista");
bw3.newLine();
bw3.write(linea);
bw3.newLine();
aparece=1;
}
bh3.close();
}
I think that using a Set in Java is a good option, like the following post suggests:
Comparing two csv files in Java
But before I try it this way, I would like to know, if there are any better options.
Thanks for all.

As far as I can interpret your code, you need to find out which lines in the first CSV file do not have an equal line in the second CSV file. Correct?
If so, you only need to put all lines of the second CSV file into a HashSet. Like so (Java 7 code):
Set<String> linesToCompare = new HashSet<>();
try (BufferedReader reader = new BufferedReader(new FileReader(cadena + lista2[0]))) {
String line;
while ((line = reader.readLine()) != null) {
String[] splitted = line.split(",");
linesToCompare.add(splitted[0] + splitted[1] + splitted[2] + splitted[8]);
}
}
Afterwards you can simply iterate over the lines in the first CSV file and compare:
try (BufferedReader reader = new BufferedReader(new FileReader(...))) {
String line;
while ((line = reader.readLine()) != null) {
String[] splitted = line.split(",");
String joined = splitted[0] + splitted[1] + splitted[2] + splitted[8];
if (!linesToCompare.contains(joined)) {
// handle missing line here
}
}
}
Does that fit your needs?

HashMap<String, String> file1Map = new HashMap<String, String>();
while ((String line = file1.readLine()) != null) {
array =line.split(",");
key = array[0]+array[1]+array[2]+array[8];
file1Map.put(key, key);
}
while ((String line = file2.readLine()) != null) {
array =line.split(",");
key = array[0]+array[1]+array[2]+array[8];
if (file1Map.containsKey(key)) {
//if file1 has same line in file2
}
else {
//if file1 doesn't have line like in file2
}
}

Assuming this all won't fit in memory I would first convert the files to their stripped down versions (el0, el1, el2, el8, orig-file-line-nr-for-reference-afterwards) and then sort said files. After that you can stream through both files simultaneously and compare the records as you go... Taking the sorting out of the equation you only need to compare them 'about once'.
But I'm guessing you could do the same using some List/Array object that allows for sorting and storing in memory; 40k records really doesn't sound all that much to me, unless the elements are very big of course. And it's going to be magnitudes faster.

Related

Java compare strings from two places and exclude any matches

I'm trying to end up with a results.txt minus any matching items, having successfully compared some string inputs against another .txt file. Been staring at this code for way too long and I can't figure out why it isn't working. New to coding so would appreciate it if I could be steered in the right direction! Maybe I need a different approach? Apologies in advance for any loud tutting noises you may make. Using Java8.
//Sending a String[] into 'searchFile', contains around 8 small strings.
//Example of input: String[]{"name1","name2","name 3", "name 4.zip"}
^ This is my exclusions list.
public static void searchFile(String[] arr, String separator)
{
StringBuilder b = new StringBuilder();
for(int i = 0; i < arr.length; i++)
{
if(i != 0) b.append(separator);
b.append(arr[i]);
String findME = arr[i];
searchInfo(MyApp.getOptionsDir()+File.separator+"file-to-search.txt",findME);
}
}
^This works fine. I'm then sending the results to 'searchInfo' and trying to match and remove any duplicate (complete, not part) strings. This is where I am currently failing. Code runs but doesn't produce my desired output. It often finds part strings rather than complete ones. I think the 'results.txt' file is being overwritten each time...but I'm not sure tbh!
file-to-search.txt contains: "name2","name.zip","name 3.zip","name 4.zip" (text file is just a single line)
public static String searchInfo(String fileName, String findME)
{
StringBuffer sb = new StringBuffer();
try {
BufferedReader br = new BufferedReader(new FileReader(fileName));
String line = null;
while((line = br.readLine()) != null)
{
if(line.startsWith("\""+findME+"\""))
{
sb.append(line);
//tried various replace options with no joy
line = line.replaceFirst(findME+"?,", "");
//then goes off with results to create a txt file
FileHandling.createFile("results.txt",line);
}
}
} catch (Exception e) {
e.printStackTrace();
}
return sb.toString();
}
What i'm trying to end up with is a result file MINUS any matching complete strings (not part strings):
e.g. results.txt to end up with: "name.zip","name 3.zip"

ok with the information I have. What you can do is this
List<String> result = new ArrayList<>();
String content = FileUtils.readFileToString(file, "UTF-8");
for (String s : content.split(", ")) {
if (!s.equals(findME)) { // assuming both have string quotes added already
result.add(s);
}
}
FileUtils.write(newFile, String.join(", ", result), "UTF-8");
using apache commons file utils for ease. You may add or remove spaces after comma as per your need.

Multiple string search in txt file (java)

BufferedReader br2 = new BufferedReader(
new InputStreamReader(new FileInputStream(id_zastavky), "windows-1250")
);
for (int i = 0; i < id_linky_list.size(); i++)
{
while ((sCurrentLine2 = br2.readLine()) != null)
{
String pom = id_linky_list.get(i);
String[] result = sCurrentLine2.split("\\|");
if((result[1].toString()).equals(pom.toString()))
{
System.out.println(result[1].toString()+" " +pom.toString() + " " + result[3]);
}
}
}
br2.close();
Hey guys. Anyone can give me advice why is my FOR loop using only first item in my id_linky_list a then it quits? I think that the problem is on this line
while ((sCurrentLine2 = br2.readLine()) != null)
. I have over 5 000 items in my list and I need to compare them if they exist in my txt file. If I run my App the for loop only takes first item. How should I modify my code to make it work properly? Thank you for any help.

during the first iteration of for loop, the whole file will be read and br2.readLine() will always return null for next iterations.
Instead of that if the file size is small you could build a map and you can use that map to check the content
File file = new File("filename");
List<String> lines = Files.linesOf(file, Charset.defaultCharset());
Map<String, List<String>> map = lines.stream().collect(Collectors.groupingBy(line -> line.split("\\|")[1]));
List<String> id_linky_list = null;
for (int i = 0; i < id_linky_list.size(); i++) {
if (map.get(id_linky_list.get(i)) != null) {
//sysout
}
}
Update
Map<String, List<String>> text = Files.lines(file.toPath(), Charset.forName("windows-1250")).collect(Collectors.groupingBy(line -> line.split("\\|")[1]));

Anyone can give me advice why is my FOR loop using only first item in
my id_linky_list a then it quits?
Simply because you read your entire file in the loop while ((sCurrentLine2 = br2.readLine()) != null) when you first call it which is when i = 0 next calls will do nothing as the file content has already been read so br2.readLine() will return null.
How should I modify my code to make it work properly?
You need to invert the loops for and while as next
while ((sCurrentLine2 = br2.readLine()) != null)
{
for (int i = 0; i < id_linky_list.size(); i++)
{
To get better performances consider using a Set instead of a List to store your words and simply check if a given word exists by using the method contains(object) instead of iterating over your List.

reading specific lines from file is extremely slow

I have created a method that reads specific lines from a file based on their line number. It works fine for most files but when I try to read a file that contains a large number of really long lines then it takes ages, particularly as it gets further down in the file. I've also done some debugging and it appears to take a lot of memory as well but I'm not sure if this is something that can be improved. I know there are some other questions which focus on how to read certain lines from a file but this question is focussed primarily on the performance aspect.
public static final synchronized List<String> readLines(final File file, final Integer start, final Integer end) throws IOException {
BufferedReader bufferedReader = new BufferedReader(new FileReader(file));
List<String> lines = new ArrayList<>();
try {
String line = bufferedReader.readLine();
Integer currentLine = 1;
while (line != null) {
if ((currentLine >= start) && (currentLine <= end)) {
lines.add(line + "\n");
}
currentLine++;
if (currentLine > end) {
return lines;
}
line = bufferedReader.readLine();
}
} finally {
bufferedReader.close();
}
return lines;
}
How can I optimize this method to be faster than light?

I realised that what I was doing before was inherently slow and used up too much memory.
By adding all lines to memory and then processing all lines in a List it was not only taking twice as long but was also creating String variables for no reason.
I am now using Java 8 Stream and processing at point of reading which is the fastest method I've used so far.
Path path = Paths.get(file.getAbsolutePath());
Stream<String> stream = Files.lines(path, StandardCharsets.UTF_8);
for (String line : (Iterable<String>) stream::iterator) {
//do stuff
}
}

How can I read a specifc column from a text file and calculate the average of this column?

I am a little stuck with a java exercise I am currently working on. I have a text file in this format:
Quio Kla,2221,3.6
Wow Pow,3332,9.3
Zou Tou,5556,9.7
Flo Po,8766,8.1
Andy Candy,3339,6.8
I now want to calculate the average of the whole third column, but I have to extract the data first I believe and store it in an array. I was able to read all the data with a buffered reader and print out the entire file in console, but that did not get me closer to get it into an array. Any suggestions on how I can read in a specific column of a text file with a buffered readder into an array would be highly appreciated.
Thank you very much in advance.

You can split your text file by using this portion of code:
BufferedReader in = null;
try {
in = new BufferedReader(new FileReader("textfile.txt"));
String read = null;
while ((read = in.readLine()) != null) {
String[] splited = read.split(",");
for (String part : splited) {
System.out.println(part);
}
}
} catch (IOException e) {
System.out.println("There was a problem: " + e);
e.printStackTrace();
} finally {
try {
in.close();
} catch (Exception e) {
}
}
And then you'll have all your columns in the array part.

It`s definitely not the best solution, but should be sufficient for you
BufferedReader input = new BufferedReader(new FileReader("/file"));
int numOfColumn = 2;
String line = "";
ArrayList<Integer>lines = new ArrayList<>();
while ((line = input.readLine()) != null) {
lines.add(Integer.valueOf(line.split(",")[numOfColumn-1]));
}
long sum =0L;
for(int j:lines){
sum+=j;
}
int avg = (int)sum/lines.size();

I'm going to assume each data set is separated by newline characters in your text file.
ArrayList<Double> thirdColumn = new ArrayList<>();
BufferedReader in = null;
String line=null;
//initialize your reader here
while ((line = in.readLine())!=null){
String[] split = line.split(",");
if (split.length>2)
thirdColumn.add(Double.parseDouble(split[2]));
}
By the end of the while loop, you should have the thirdColumn ArrayList ready and populated with the required data.
The assumption is made that your data set has the following standard format.
String,Integer,Double
So naturally a split by a comma should give a String array of length 3, Where the String at index 2 contains your third column data.

Android reading file from assets and storing in array list

My app needs to read from several files in the assets folder. But my file has delimiters $$$ and ||. The structure of the file is like this.
Construction$$$
All the work involved in assembling resources and
putting together the materials required to form a new or changed
facility.||
Construction Contractor$$$
A corporation or individual
who has entered into a contract with the organization to perform
construction work.||
The sentences ending with $$$ are to be stored in seperate array list and the sentences ending with || are to be stored on seperate array list.
How can i do this? Any sample or example code will be appreciated. Note that these files are very long.
BufferedReader br = null;
try {
br = new BufferedReader(new InputStreamReader(getAssets().open("c.txt"))); //throwing a FileNotFoundException?
String word;
while((word=br.readLine()) != null)
A_Words_array.add(word); //break txt file into different words, add to wordList
}
catch(IOException e) {
e.printStackTrace();
}
finally {
try {
br.close(); //stop reading
}
catch(IOException ex) {
ex.printStackTrace();
}
}
String[]words = new String[A_Words_array.size()];
A_Words_array.toArray(words); //make array of wordList
for(int i=0;i<words.length; i++)
Log.i("Read this: ", words[i]);
Above is the code i found now how to split my sentences based upon ending delimiters?

Asuming that each sentence is in one line and they finish either with $$$ or ||, you can store the lines in different arrays depending on its endings:
List<String> list1 = new ArrayList<>();
List<String> list2 = new ArrayList<>();
String line;
while (line = br.readLine()) != null) {
if (line.endsWith("$$$")) {
list1.add(line);
} else {
list2.add(line);
}
}
String[] dollarlines = list1.toArray(new String[list1.size()]);
String[] verticalLines = list2.toArray(new String[list2.size()]);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Best way to compare big csv files? - java

Related

Java compare strings from two places and exclude any matches

Multiple string search in txt file (java)

reading specific lines from file is extremely slow

How can I read a specifc column from a text file and calculate the average of this column?

Android reading file from assets and storing in array list

Categories

Resources