comparing data line by line from two large files

comparing data line by line from two large files - java

I need to analyze differences between two large data files which should each have identical structures. Each file is a couple of gigabytes in size, with perhaps 30 million lines or text data. The data files are so large that I hesitate to load each into its own array, when it might be easier to just iterate through the lines in order. Each line has the structure:
topicIdx, recordIdx, other fields...
topicIdx and recordIdx are sequential, starting at zero and incrementing +1 with each iteration, so it is easy to find them in the files. (No searching around required; just increment forward in order).
I need to do something like:
for each line in fileA
store line in String itemsA
get topicIdx and recordIdx
find line in fileB with same topicIdx and recordIdx
if exists
store this line in string itemsB
for each item in itemsA
compare value with same index in itemsB
if these two items are not virtually equal
//do something
else
//do something else
I wrote the following code with FileReader and BufferedReader, but the apis for these do not seem to provide the functionality that I need. Can anyone show me how to fix the code below so that it accomplishes what I desire?
void checkData(){
FileReader FileReaderA;
FileReader FileReaderB;
int topicIdx = 0;
int recordIdx = 0;
try {
int numLines = 0;
FileReaderA = new FileReader("B:\\mypath\\fileA.txt");
FileReaderB = new FileReader("B:\\mypath\\fileB.txt");
BufferedReader readerA = new BufferedReader(FileReaderA);
BufferedReader readerB = new BufferedReader(FileReaderB);
String lineA = null;
while ((lineA = readerA.readLine()) != null) {
if (lineA != null && !lineA.isEmpty()) {
List<String> itemsA = Arrays.asList(lineA.split("\\s*,\\s*"));
topicIdx = Integer.parseInt(itemsA.get(0));
recordIdx = Integer.parseInt(itemsA.get(1));
String lineB = null;
//lineB = readerB.readLine();//i know this syntax is wrong
setB = rows from FileReaderB where itemsB.get(0).equals(itemsA.get(0));
for each lineB in setB{
List<String> itemsB = Arrays.asList(lineB.split("\\s*,\\s*"));
for(int m = 0;m<itemsB.size();m++){}
for(int j=0;j<itemsA.size();j++){
double myDblA = Double.parseDouble(itemsA.get(j));
double myDblB = Double.parseDouble(itemsB.get(j));
if(Math.abs(myDblA-myDblB)>0.0001){
//do something
}
}
}
}
readerA.close();
} catch (IOException e) {e.printStackTrace();}
}

You need both files sorted by your search keys (recordIdx and topicIdx), so you can do kind of a merge operation like this
open file 1
open file 2
read lineA from file1
read lineB from file2
while (there is lineA and lineB)
if (key lineB < key lineA)
read lineB from file 2
continue loop
if (key lineB > key lineA)
read lineA from file 1
continue
// at this point, you have lineA and lineB with matching keys
process your data
read lineB from file 2
Note that you'll only ever have two records in memory.

If you really need this in Java, why not use java-diff-utils ? It implements a well known diff algorithm.

Consider https://code.google.com/p/java-diff-utils/. Let someone else do the heavy lifting.

Related

How to truncate csv file to n of rows not reading the whole file

I have big csv(12 gb), so I can't read it in memory, and I need only 100 rows of them and save it back(truncate). Has java such api?

The other answers create a new file from the original file. As I understand it, you want to truncate the original file instead. You can do that quite easily using RandomAccessFile:
try (RandomAccessFile file = new RandomAccessFile(FILE, "rw")) {
for (int i = 0; i < N && file.readLine() != null; i++)
; // just keep reading
file.setLength(file.getFilePointer());
}
The caveat is that this will truncate after N lines, which is not necessarily the same thing as N rows, because CSV files can have rows that span multiple lines. For example, here is one CSV record that has a name, address, and phone number, and spans multiple lines:
Joe Bloggs, "1 Acacia Avenue,
Naboo Town,
Naboo", 01-234 56789
If you are sure all your rows only span one line, then the above code will work. But if there is any possibility that your CSV rows may span multiple lines, then you should first parse the file with a suitable CSV reader to find out how many lines you need to retain before you truncate the file. OpenCSV makes this quite easy:
final long numLines;
try (CSVReader csvReader = new CSVReader(new FileReader(FILE))) {
csvReader.skip(N); // Skips N rows, not lines
numLines = csvReader.getLinesRead(); // Gives number of lines, not rows
}
try (RandomAccessFile file = new RandomAccessFile(FILE, "rw")) {
for (int i = 0; i < numLines && file.readLine() != null; i++)
; // just keep reading
file.setLength(file.getFilePointer());
}

You should stream a file : read it line by line
For example :
CSVReader reader = new CSVReader(new FileReader("myfile.csv"));
String [] nextLine;
// the readnext => Reads the next line from the buffer and converts to a string array.
while ((nextLine = reader.readNext()) != null) {
System.out.println(nextLine);
}

If you need just a hundred lines, reading just that small portion of the file into memory would be really quick and cheap. You could use the Standard Library file APIs to achieve this quite easily:
val firstHundredLines = File("test.csv").useLines { lines ->
lines.take(100).joinToString(separator = System.lineSeparator())
}
File("test.csv").writeText(firstHundredLines)

Possible solution
File file = new File(fileName);
// collect first N lines
String newContent = null;
try (BufferedReader reader = new BufferedReader(new FileReader(file))) {
newContent = reader.lines().limit(N).collect(Collectors.joining(System.lineSeparator()));
}
// replace original file with collected content
Files.write(file.toPath(), newContent.getBytes(), StandardOpenOption.TRUNCATE_EXISTING);

Read a specific line form file in Java

I really do not want to do a duplicate question, but none of the answers on SO were implementable in my problem.
The answer in this question:
How to read a file from a certain offset in Java?
uses RandomAccessFile, but the implementations I found need all the file lines to have the same length.
How can I get List lines = readLinesFromLine(file);?
I tried
private static List<String> readRandomAccessFile(String filepath, int lineStart, int lineEnd, int charsPerLine, String delimiter) {
File file = new File(filepath);
String data = "";
int bytesPerLine = charsPerLine+2;
try{
RandomAccessFile randomAccessFile = new RandomAccessFile(file, "r");
for (int i = lineStart; i <lineEnd ; i++) {
randomAccessFile.seek(bytesPerLine *i);
data = randomAccessFile.readLine();
dialogLineRead.add(data);
}
randomAccessFile.close();
}catch (Exception e){
e.printStackTrace();
}
String returnData = "";
for (int i = 0; i < dialogLineRead.size(); i++) {
returnData += dialogLineRead.get(i);
returnData+=delimiter;
}
return returnData;
But like I said charsPerLine has to be the same for each line.
I tried to count the chars of each line in a file, and store it in a list, but with a log file of 2gb, that takes to much ram.
Any ideas?

For a standard text file where you don't know the line lengths in advance, there's really no way around reading the whole thing line by line, like in this answer, for example.

How to Selectively load ArrayList content into a database

I currently have some code that loads some WAPT load testing data from a CSV file into an ArrayList.
int count = 0;
String file = "C:\\Temp\\loadtest.csv";
List<String[]> content = new ArrayList<>();
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
String line = "";
while (((line = br.readLine()) != null) && (count <18)) {
content.add(line.split(";"));
count++;
}
} catch (FileNotFoundException e) {
//Some error logging
}
So now it gets complicated. The CSV file looks like this, the seperator is a ";".
In this case I want to ignore the first line, it's just minutes. I want the second row. I need to ignore "AddToCartDemo" and "Users". So the first 10 entries, in this case all ten 5's get loaded into the first ten columns of the database. Likewise the third row "Pages/sec" is ignored and the data after it is loaded into the next ten columns of the database, etc.
;;0:01:00;0:02:00;0:03:00;0:04:00;0:05:00;0:06:00;0:07:00;0:08:00;0:09:00;0:10:00;
AddToCartDemo;Users;5;5;5;5;5;5;5;5;5;5;
;Pages/sec;0.25;0.1;0.22;0.65;0.03;0.4;0.43;0.17;0.22;0.4;
;Hits/sec;0.25;0.1;0.27;0.85;0.03;0.5;0.53;0.22;0.27;0.5;
;HTTP errors;0;0;0;0;0;0;0;0;0;0;
;KB received;1015;4595;422;1600;2.46;4374;1527;1491;2551;2954;
;KB sent;12.9;3.66;13.8;39.9;5.22;21.7;23.8;13.2;12.2;23.1;
;Receiving kbps;135;613;56.3;213;0.33;583;204;199;340;394;
;Sending kbps;1.73;0.49;1.84;5.32;0.7;2.89;3.17;1.76;1.63;3.07;
Anyone have any ideas on how to caccomplish this? As usual a search brings up nothing even close to this. Thanks much in advance!

writing to text file using BufferedReader

my problem is to read non primes from txt file and write prime factors in same file.
i actually dont know how BufferedReader works.from my understanding i am trying to read the file data to buffer(8kb) and write prime factors to file.(by creating a new one)
class PS_Task2
{
public static void main(String[] args)
{
String line=null;
int x;
try
{
FileReader file2 = new FileReader("nonprimes.txt");
BufferedReader buff2=new BufferedReader(file2);
File file1 = new File("nonprimes.txt");
file1.createNewFile();
PrintWriter d=new PrintWriter(file1);
while((line = buff2.readLine()) != null)
{
x=Integer.parseInt(line);
d.printf ("%d--> ", x);
while(x%2==0)
{
d.flush();
d.print("2"+"*");
x=x/2;
}
for (int i = 3; i <= Math.sqrt(x); i = i+2)
{
while (x%i == 0)
{
d.flush();
d.printf("%d*", i);
x = x/i;
}
}
if (x > 2)
{
d.flush();
d.printf ("%d ", x);
}
d.flush();//FLUSING THE STREAM TO FILE
d.println("\n");
}
d.close(); // CLOSING FILE
}
feel free to give detailed explanation. :D thanks ~anirudh

reading and writing to a file in java doesnt EDIT the file, but clear the old one and creates a new one, you can use many approachesfor example, to get your data, modify it, and either save it on memory in a StringBuilder or a collection or what ever and re-write it again
well i created fileOne.txt containing the following data :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
and i want to multiply all those numbers by 10, then re-write them again :
public static void main(String [] args) throws Exception{ // just for the example
// locate the file
File fileOne = new File("fileOne.txt");
FileReader inputStream = new FileReader(fileOne);
BufferedReader reader = new BufferedReader(inputStream);
// create a LinkedList to hold the data read
List<Integer> numbers = new LinkedList<Integer>();
// prepare variables to refer to the temporary objects
String line = null;
int number = 0;
// start reading
do{
// read each line
line = reader.readLine();
// check if the read data is not null, so not to use null values
if(line != null){
number = Integer.parseInt(line);
numbers.add(number*10);
}
}while(line != null);
// free resources
reader.close();
// check the new numbers before writing to file
System.out.println("NEW NUMBERS IN MEMORY : "+numbers);
// assign a printer
PrintWriter writer = new PrintWriter(fileOne);
// write down data
for(int newNumber : numbers){
writer.println(newNumber);
}
// free resources
writer.flush();
writer.close();
}
this approach is not very good when dealing with massive data

As per your problem statement, you need to take input from a file, do some processing and write back the processed data in the same file. For this, please note the below points:
You may not create a file with same name in a directory, so you must create the new file at some other location; or write the content into different file and later rename it after deleting original one.
While your file is open for reading, modifying the same file is not a good idea. you could use below approach:
Read the content of the file and store in a data structure liek Arrays, ArrayList.
Close the file.
Process the data stored in the data structure.
Open the file in write mode (over-write mode rather than append mode)
Write back the processed data into the file.

Read a definite number of lines in a text file, using java

I have a text file with data. The file has information from all months. Imagine that the information for January occupy 50 lines. Than February starts and it occupies 40 more lines. Than I have March and so on... Is it possible to read only part of the file? Can I say "read from line X to line Y"? or is there a better way to accomplish this? I only want to print the data correspondent to one month not the all file. Here is my code
public static void readFile()
{
try
{
DataInputStream inputStream =
new DataInputStream(new FileInputStream("SpreadsheetDatabase2013.txt"));
while(inputStream.available() != 0)
{
System.out.println("AVAILABLE: " + inputStream.available());
System.out.println(inputStream.readUTF());
System.out.println(inputStream.readInt());
for (int i = 0; i < 40; i++)
{
System.out.println(inputStream.readUTF());
System.out.println(inputStream.readUTF());
System.out.println(inputStream.readUTF());
System.out.println(inputStream.readUTF());
System.out.println(inputStream.readUTF());
System.out.println(inputStream.readDouble());
System.out.println(inputStream.readUTF());
System.out.println(inputStream.readBoolean());
System.out.println();
}
}// end while
inputStream.close();
}// end try
catch (Exception e)
{
System.out.println("An error has occurred.");
}//end catch
}//end method
Thank you for your time.

My approach to this would be to read the entire contents of the text file and store it in a ArrayList and read only the lines for the requested month.
Example:
Use this function to read the all the lines from the file.
/**
* Read from a file specified by the filePath.
*
* #param filePath
* The path of the file.
* #return List of lines in the file.
* #throws IOException
*/
public static ArrayList<String> readFromFile(String filePath)
throws IOException {
ArrayList<String> temp = new ArrayList<String>();
File file = new File(filePath);
if (file.exists()) {
BufferedReader brin;
brin = new BufferedReader(new FileReader(filePath));
String line = brin.readLine();
while (line != null) {
if (!line.equals(""))
temp.add(line);
line = brin.readLine();
}
brin.close();
}
return temp;
}
Then read only the ones you need from ArrayList temp.
Example:
if you want to read February month's data assuming its 50 lines of data and starts from 40th line.
for(int i=40;i<90;i++)
{
System.out.println(temp.get(i));
}
Note: This is only just one way of doing this. I am not certain if there is any other way!

I would use the scanner class.
Scanner scanner = new Scanner(filename);
Use scanner.nextLine() to get each of the lines of the file. If you only want from line x to line y you can use a for loop to scan each of the lines that you don't need before going through the scanner for the lines you do need. Be careful not to hit an exception without throwing it though.
Or you can go through the scanner and for each line, add the String contents of the line to an ArrayList. Good luck.

Based on how you said your data was organized, I would suggest doing something like this
ArrayList<String> temp = new ArrayList<String>();
int read = 0;
File file = new File(filePath);
if (file.exists()) {
BufferedReader brin;
brin = new BufferedReader(new FileReader(filePath));
String line = brin.readLine();
while (line != null) {
if (!line.equals("")){
if(line.equals("March"))
read = 1;
else if(line.equals("April"))
break;
else if(read == 1)
temp.add(line);
}
line = brin.readLine();
}
brin.close();
Just tried it myself, that'll take in all the data between March and April. You can adjust them as necessary or make them variables. Thanks to ngoa for the foundation code. Credit where credit is due

If you have Java 7, you can use Files.readAllLines(Path path, Charset cs), e.g.
Path path = // Path to "SpreadsheetDatabase2013.txt"
Charset charset = // "UTF-8" or whatever charset is used
List<String> allLines = Files.readAllLines(path, charset);
List<String> relevantLines = allLines.subList(x, y);
Where x (inclusive) and y (exclusive) indicates the line numbers that are of interest, see List.subList(int fromIndex, int toIndex).
One benefit of this solution, as stated in the JavaDoc of readAllLines():
This method ensures that the file is closed when all bytes have been read or an I/O error, or other runtime exception, is thrown.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

comparing data line by line from two large files - java

If you really need this in Java, why not use java-diff-utils ? It implements a well known diff algorithm.

Consider https://code.google.com/p/java-diff-utils/. Let someone else do the heavy lifting.

Related

How to truncate csv file to n of rows not reading the whole file

Read a specific line form file in Java

How to Selectively load ArrayList content into a database

writing to text file using BufferedReader

Read a definite number of lines in a text file, using java

Categories

Resources