Most efficient merging of 2 text files.

Most efficient merging of 2 text files. - java

So I have large (around 4 gigs each) txt files in pairs and I need to create a 3rd file which would consist of the 2 files in shuffle mode. The following equation presents it best:
3rdfile = (4 lines from file 1) + (4 lines from file 2) and this is repeated until I hit the end of file 1 (both input files will have the same length - this is by definition). Here is the code I'm using now but this doesn't scale very good on large files. I was wondering if there is a more efficient way to do this - would working with memory mapped file help ? All ideas are welcome.
public static void mergeFastq(String forwardFile, String reverseFile, String outputFile) {
try {
BufferedReader inputReaderForward = new BufferedReader(new FileReader(forwardFile));
BufferedReader inputReaderReverse = new BufferedReader(new FileReader(reverseFile));
PrintWriter outputWriter = new PrintWriter(new FileWriter(outputFile, true));
String forwardLine = null;
System.out.println("Begin merging Fastq files");
int readsMerge = 0;
while ((forwardLine = inputReaderForward.readLine()) != null) {
//append the forward file
outputWriter.println(forwardLine);
outputWriter.println(inputReaderForward.readLine());
outputWriter.println(inputReaderForward.readLine());
outputWriter.println(inputReaderForward.readLine());
//append the reverse file
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
readsMerge++;
if(readsMerge % 10000 == 0) {
System.out.println("[" + now() + "] Merged 10000");
readsMerge = 0;
}
}
inputReaderForward.close();
inputReaderReverse.close();
outputWriter.close();
} catch (IOException ex) {
Logger.getLogger(Utilities.class.getName()).log(Level.SEVERE, "Error while merging FastQ files", ex);
}
}

Maybe you also want to try to use a BufferedWriter to cut down your file IO operations.
http://download.oracle.com/javase/6/docs/api/java/io/BufferedWriter.html

A simple answer is to use a bigger buffer, which help to reduce to total number of I/O call being made.
Usually, memory mapped IO with FileChannel (see Java NIO) will be used for handling large data file IO. In this case, however, it is not the case, as you need to inspect the file content in order to determine the boundary for every 4 lines.

If performance was the main requirement, then I would code this function in C or C++ instead of Java.
But regardless of language used, what I would do is try to manage memory myself. I would create two large buffers, say 128MB or more each and fill them with data from the two text files. Then you need a 3rd buffer that is twice as big as the previous two. The algorithm will start moving characters one by one from input buffer #1 to destination buffer, and at the same time count EOLs. Once you reach the 4th line you store the current position on that buffer away and repeat the same process with the 2nd input buffer. You continue alternating between the two input buffers, replenishing the buffers when you consume all the data in them. Each time you have to refill the input buffers you can also write the destination buffer and empty it.

Buffer your read and write operations. Buffer needs to be large enough to minimize the read/write operations and still be memory efficient. This is really simple and it works.
void write(InputStream is, OutputStream os) throws IOException {
byte[] buf = new byte[102400]; //optimize the size of buffer to your needs
int num;
while((n = is.read(buf)) != -1){
os.write(buffer, 0, num);
}
}
EDIT:
I just realized that you need to shuffle the lines, so this code will not work for you as is but, the concept still remains the same.

Related

Export multiple images in one byte array (BLOB IBM DB2) to disk

I have a column "Content" (BLOB data) in database (IBM DB2) and the data of an record same that (https://drive.google.com/file/d/12d1g5jtomJS-ingCn_n0GKMsM4RkdYzB/view?usp=sharing)
I have opened it by editor and I think that it has more than one image in this (https://i.stack.imgur.com/2biLN.png, https://i.stack.imgur.com/ZwBOs.png).
I can export an image from byte array (using C#) to my disk, but with multiple images, I don't know how to do it.
Please help me! Thanks!
Edit 1:
I have tried export it as only one image by this code:
private void readBLOB(DB2Connection conn, DB2Transaction trans)
{
try
{
string SavePath = #"D:\\MyBLOB";
long CurrentIndex = 0;
//the number of bytes to store in the array
int BufferSize = 413454;
//The Number of bytes returned from GetBytes() method
long BytesReturned;
//A byte array to hold the buffer
byte[] Blob = new byte[BufferSize];
DB2Command cmd = conn.CreateCommand();
cmd.CommandText = "SELECT ATTR0102500126 " +
" FROM JCR.ICMUT01278001 " +
" WHERE COMPKEY = 'N21E26B04900FC6B1F00000'";
cmd.Transaction = trans;
DB2DataReader reader;
reader = cmd.ExecuteReader(CommandBehavior.SequentialAccess);
if (reader.Read())
{
FileStream fs = new FileStream(SavePath + "\\" + "quang canh.jpg", FileMode.OpenOrCreate, FileAccess.Write);
BinaryWriter writer = new BinaryWriter(fs);
//reset the index to the beginning of the file
CurrentIndex = 0;
BytesReturned = reader.GetBytes(
0, //the BlobsTable column index
CurrentIndex, // the current index of the field from which to begin the read operation
Blob, // Array name to write the buffer to
0, // the start index of the array
BufferSize // the maximum length to copy into the buffer
);
while (BytesReturned == BufferSize)
{
writer.Write(Blob);
writer.Flush();
CurrentIndex += BufferSize;
BytesReturned = reader.GetBytes(0, CurrentIndex, Blob, 0, BufferSize);
}
writer.Write(Blob, 0, (int)BytesReturned);
writer.Flush(); writer.Close();
fs.Close();
}
reader.Close();
}
catch (Exception e)
{
Console.WriteLine(e.Message);
}
}
But can not view the image, it show format error => https://i.stack.imgur.com/PNS9Q.png

Your are currently asuming all BLOBS in that DB are JPEG Images. But that is clearly not the case.
Option 1: This is a faulty data
Programms that save to databases can fail.
Databases themself might fail, especially if transactions are turned off. Transactions are most likely turned off for BLOB's.
The physical disk the data was stored on might have degraded. And again, you will not get a lot of redundancy and error correction with BLOBS (plus getting use of the Error correction requires going through the proper DBMS in the first place).
Option 2: This is not a jpg
I know article about Unicode that says "[...]problem comes down to one naive programmer who didn’t understand the simple fact that if you don’t tell me whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot display it correctly or even figure out where it ends."
This applies doubly, triply and quadruply to images:
this could be any number of formats that uses Interlacing.
this could could be a professional graphics programms image/project file like TIFF. Which can totally contain multiple images - up to one per layer you are working with.
this could even be a .SVG file (XML text that contains drawing orders) that was run through a .ZIP compression and a word document
this could even be a PDF, where the images are usually appended at the back (allowing you to read the text with a partial file, similar to interleaving)

Need help reading bytes from a file (incorrect bytes returned)

i am trying to create a simple program to edit MP3 tags.But the problem is i am stuck at the very first step,as i cannot even pass the file bytes into a byte array.Technically i can,but they are wrong.
For example,if i copy the bytes into a txt file and open it,most of the text is gibberish including the tags(a problem for later),but the first letters are ID3 which are correct.
But if i print the byte array that results from the mp3 in the console,the first values are
1001001
1000100
110011
11
0
0
.....
Which are all invalid characters.But add a zero before the first row,a zero before the second,and TWO zeroes before the third and it now says ID3
What would cause zeroes to get lost like that? It's the same for every mp3 file.Thank you in advance for any help
The piece of code is a very simple copy
try {
FileInputStream BF1 = new FileInputStream("test.mp3");
FileOutputStream fout = new FileOutputStream("byteresults.txt");
byte[] tempbyte = new byte[1024];
BF1.read(tempbyte);
BF1.close();
int i;
for(i=0;i<900;i++){
System.out.print(Integer.toBinaryString(tempbyte[i])+'\n');}
} catch(FileNotFoundException fnf)
{
System.out.println("Specified file not found :" + fnf);
}
catch(IOException ioe)
{
System.out.println("Error while copying file :" + ioe);
}

When printing Binary format, most systems drop the leading zeroes since they do not contribute to the final value. You only expect eight digits cos it's bytes but binary counting itself doesn't work like that. It doesnt care for 8 slots or 4 slots etc. Consider 3 is written as 11 in binary so why bother printing that as 00000011?? Why not 0011?? The human reader will ignore those leading zeroes anyway.
Maybe you could try Hex format as it easier (= more efficiency). Something like :
for ( i=0; i<900; i++)
{
//System.out.print(Integer.toBinaryString(tempbyte[i])+'\n');
System.out.printf("0x%02X", tempbyte[i]);
}
This way you can even check your output against some Hex Editor software (handling exact same file's bytes). Easier to check that you have the right bytes at right place with right values etc...

Java―good size for a char buffer

I am coding a little java based tool to process mysqldump files, which can become quite large (up to a gigabyte for now). I am using this code to read and process the file:
BufferedReader reader = getReader();
BufferedWriter writer = getWriter();
char[] charBuffer = new char[CHAR_BUFFER_SIZE];
int readCharCout;
StringBuffer buffer = new StringBuffer();
while( ( readCharCout = reader.read( charBuffer ) ) > 0 )
{
buffer.append( charBuffer, 0, readCharCout );
//processing goes here
}
What is a good size for the charBuffer? At the moment it is set to 1000, but my code will run with an arbitrary size, so what is best practice or can this size be calculated depending on the file size?
Thanks in ahead,
greetings philipp

It should always be a power of 2. The optimal value is based on the OS and disk format. In code I've seen 4096 is often used, but the bigger the better.
Also, there are better ways to load a file into memory.

Splitting a .gz file into specified file sizes in Java using byte[] array

I have written a code to split a .gz file into user specified parts using byte[] array. But the for loop is not reading/writing the last part of the parent file which is less than the array size. Can you please help me in fixing this?
package com.bitsighttech.collection.packaging;
import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.log4j.Logger;
public class FileSplitterBytewise
{
private static Logger logger = Logger.getLogger(FileSplitterBytewise.class);
private static final long KB = 1024;
private static final long MB = KB * KB;
private FileInputStream fis;
private FileOutputStream fos;
private DataInputStream dis;
private DataOutputStream dos;
public boolean split(File inputFile, String splitSize)
{
int expectedNoOfFiles =0;
try
{
double parentFileSizeInB = inputFile.length();
Pattern p = Pattern.compile("(\\d+)\\s([MmGgKk][Bb])");
Matcher m = p.matcher(splitSize);
m.matches();
String FileSizeString = m.group(1);
String unit = m.group(2);
double FileSizeInMB = 0;
try {
if (unit.toLowerCase().equals("kb"))
FileSizeInMB = Double.parseDouble(FileSizeString) / KB;
else if (unit.toLowerCase().equals("mb"))
FileSizeInMB = Double.parseDouble(FileSizeString);
else if (unit.toLowerCase().equals("gb"))
FileSizeInMB = Double.parseDouble(FileSizeString) * KB;
} catch (NumberFormatException e) {
logger.error("invalid number [" + FileSizeInMB + "] for expected file size");
}
double fileSize = FileSizeInMB * MB;
int fileSizeInByte = (int) Math.ceil(fileSize);
double noOFFiles = parentFileSizeInB/fileSizeInByte;
expectedNoOfFiles = (int) Math.ceil(noOFFiles);
int splinterCount = 1;
fis = new FileInputStream(inputFile);
dis = new DataInputStream(new BufferedInputStream(fis));
fos = new FileOutputStream("F:\\ff\\" + "_part_" + splinterCount + "_of_" + expectedNoOfFiles);
dos = new DataOutputStream(new BufferedOutputStream(fos));
byte[] data = new byte[(int) fileSizeInByte];
while ( splinterCount <= expectedNoOfFiles ) {
int i;
for(i = 0; i<data.length-1; i++)
{
data[i] = s.readByte();
}
dos.write(data);
splinterCount ++;
}
}
catch(Exception e)
{
logger.error("Unable to split the file " + inputFile.getName() + " in to " + expectedNoOfFiles);
return false;
}
logger.debug("Successfully split the file [" + inputFile.getName() + "] in to " + expectedNoOfFiles + " files");
return true;
}
public static void main(String args[])
{
String FilePath1 = "F:\\az.gz";
File file= new File(FilePath1);
FileSplitterBytewise fileSplitter = new FileSplitterBytewise();
String splitlen = "1 MB";
fileSplitter.split(file, splitlen);
}
}

I'd suggest to make more methods. You've got a complicated string-handling section of code in split(); it would be best to make a method that takes the human-friendly string as input and returns the number you're looking for. (It would also make it far easier for you to test this section of the routine; there's no way you can test it now.)
Once it is split off and you're writing test cases, you'll probably find that the error message you generate if the string doesn't contain kb, mb, or gb is extremely confusing -- it blames the number 0 for the mistake rather than pointing out the string does not have the expected units.
Using an int to store the file size means your program will never handle files larger than two gigabytes. You should stick with long or double. (double feels wrong for something that is actually confined to integer values but I can't quickly think why it would fail.)
byte[] data = new byte[(int) fileSizeInByte];
Allocating several gigabytes like this is going to destroy your performance -- that's a potentially huge memory allocation (and one that might be considered under control of an adversary; depending upon your security model, this might or might not be a big deal). Don't try to work with the entire file in one piece.
You appear to be reading and writing the files one byte at a time. That's a guarantee to very slow performance. Doing some performance testing for another question earlier today, I found that my machine could read (from a hot cache) 2000 times faster using 131kb blocks than two-byte blocks. One-byte blocks would be even worse. A cold cache would be significantly worse for such small sizes.
fos = new FileOutputStream("F:\\ff\\" + "_part_" + splinterCount + "_of_" + expectedNoOfFiles);
You only appear to ever open one file output stream. Your post probably should have said "only the first works", because it looks like you've not yet tried it on a file that creates three or more pieces.
catch(Exception e)
At this point, you've got the ability to discover errors in your program; you choose to ignore them completely. Sure, you log an error message, but you cannot actually debug your program with the data you log. You should log at a minimum the exception type, message, and maybe even full stack-trace. This combination of data is immensely useful when trying to solve problems, especially in a few months when you've forgotten the details of how it works.

Can you please help me in fixing this?
I would use;
drop the DataInput/OutputStreams, you don't need them.
use in.read(data) to read the whole block instead on one byte at a time. Reading one byte at a time is so much slower!
or read the whole of the data array, you are reading one less.
stop when you reach the end of the file, it might not be a whole multiple of the size.
only write as much as you have read, if your blocks at 1 MB byte there is 100 KB left you should only read/write 100 KB at the end.
close your files when have finished, esp as you have a buffered stream.
you "split" writes everything to the same file (so its not actually splitting) You need to create, write to and close output files in a loop.
don't use fields when you could be/should be using local variables.
would use the length as a long in bytes.
the pattern ignores incorrect input and your pattern doesn't match the test you check for. e.g. your patten allows 1 G or 1 k but these will be treated as 1 MB.

reverse chunks in a file

My basic Java problem is this: I need to read in a file by chunks, then reverse the order of the chunks, then write that out to a new file. My first (naive) attempt followed this approach:
read a chunk from the file.
reverse the bytes of the chunk
push the bytes one at a time to the front of a results list
repeat for all chunks
write result list to new file.
So this is basically a very stupid and slow way to solve the problem, but generates the correct output that I am looking for. To try to improve the situation, I change to this algorithm:
read a chunk from the file
push that chunk onto the front of a list of arrays
repeat for all chunks
foreach chunk, write to new file
And to my mind, that produces the same output. except it doesn't and I am quite confused. The first chunk in the result file matches with both methods, but the rest of the file is completely different.
Here is the meat of the Java code I am using:
FileInputStream in;
FileOutputStream out, out2;
Byte[] t = new Byte[0];
LinkedList<Byte> reversed_data = new LinkedList<Byte>();
byte[] data = new byte[bufferSize];
LinkedList<byte[]> revd2 = new LinkedList<byte[]>();
try {
in = new FileInputStream(infile);
out = new FileOutputStream(outfile1);
out2 = new FileOutputStream(outfile2);
} catch (FileNotFoundException e) {
e.printStackTrace();
return;
}
while(in.read(data) != -1)
{
revd2.addFirst(data);
byte[] revd = reverse(data);
for (byte b : revd)
{
reversed_data.addFirst(b);
}
}
for (Byte b : reversed_data)
{
out.write(b);
}
for (byte[] b : revd2)
{
out2.write(b);
}
At http://pastie.org/3113665 you can see a complete example program (a long with my debugging attempts). For simplicity I am using a bufferSize that divides evenly the size of the file so all chunks will be the same size, but this won't hold in the real world. My question is, WHY don't these two methods generate the same output? It's driving me crazy because I can't grok it.

You're constantly overwriting the data you've read previously.
while(in.read(data) != -1)
{
revd2.addFirst(data);
// ignore byte-wise stuff
}
You're adding the same object repeatedly to the list revd2, so each list node will finally contain a reference to data filled with the result of the last read. I suggest replacing that with revd2.addFirst(data.clone()).

My guess is you want to change
revd2.addFirst(data);
byte[] revd = reverse(data);
for the following so the reversed copy is added to the start of the list.
byte[] revd = reverse(data);
revd2.addFirst(revd);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.