I believed that the new nio package would outperform the old io package when it comes to the time required to read the contents of a file. However, based on my results, io package seems to outperform nio package. Here's my test:
import java.io.*;
import java.lang.reflect.Array;
import java.nio.ByteBuffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.file.Files;
import java.util.ArrayList;
import java.util.List;
public class FileTestingOne {
public static void main(String[] args) {
long startTime = System.nanoTime();
File file = new File("hey2.txt");
try {
byte[] a = direct(file);
String s = new String(a);
}
catch (IOException err) {
err.printStackTrace();
}
long endTime = System.nanoTime();
long totalTime = (endTime - startTime);
System.out.println(totalTime);
}
public static ByteBuffer readFile_NIO(File file) throws IOException {
RandomAccessFile rFile = new RandomAccessFile(file.getName(), "rw");
FileChannel inChannel = rFile.getChannel();
ByteBuffer _buffer = ByteBuffer.allocate(1024);
int bytesRead = inChannel.read(_buffer);
while (bytesRead != -1) {
_buffer.flip();
while (_buffer.hasRemaining()) {
byte b = _buffer.get();
}
_buffer.clear();
bytesRead = inChannel.read(_buffer);
}
inChannel.close();
rFile.close();
return _buffer;
}
public static byte[] direct(File file) throws IOException {
byte[] buffer = Files.readAllBytes(file.toPath());
return buffer;
}
public static byte[] readFile_IO(File file) throws IOException {
byte[] _buffer = new byte[(int) file.length()];
InputStream in = null;
try {
in = new FileInputStream(file);
if ( in.read(_buffer) == -1 ) {
throw new IOException(
"EOF reached while reading file. File is probably empty");
}
}
finally {
try {
if (in != null)
in.close();
}
catch (IOException err) {
// TODO Logging
err.printStackTrace();
}
}
return _buffer;
}
}
// Small file
//7566395 -> readFile_NIO
//10790558 -> direct
//707775 -> readFile_IO
// Large file
//9228099 -> readFile_NIO
//737674 -> readFile_IO
//10903324 -> direct
// Very large file
//13700005 -> readFile_NIO
//2837188 -> readFile_IO
//11020507 -> direct
Results are:
Small file:
nio implementation: 7,566,395ns
io implementation: 707,775ns
direct implementation: 10,790,558ns
Large file:
nio implementation: 9,228,099ns
io implementation: 737,674ns
direct implementation: 10,903,324ns
Very large file:
nio implementation: 13,700,005ns
io implementation: 2,837,188ns
direct implementation: 11,020,507ns
I wanted to ask this question because (I believe) nio package is non-blocking, thus it needs to be faster, right?
Thank you,
Edit:
Changed ms to ns
Memory mapped files (or MappedByteBuffer) are a part of Java NIO and could help improve performance.
The non-blocking in Java NIO means that a thread does not have to wait for the next data to read. It does not necessarily affect performance of a full operation (like reading and processing a file) at all.
Related
I am receiving Avro records from Kafka. I want to convert these records into Parquet files. I am following this blog post: http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/
The code so far looks roughly like this:
final String fileName
SinkRecord record,
final AvroData avroData
final Schema avroSchema = avroData.fromConnectSchema(record.valueSchema());
CompressionCodecName compressionCodecName = CompressionCodecName.SNAPPY;
int blockSize = 256 * 1024 * 1024;
int pageSize = 64 * 1024;
Path path = new Path(fileName);
writer = new AvroParquetWriter<>(path, avroSchema, compressionCodecName, blockSize, pageSize);
Now, this will do the Avro to Parquet conversion, but it will write the Parquet file to the disk. I was wondering if there was an easier way to just keep the file in memory so that I don't have to manage temp files on the disk. Thank you
Please check my blog, https://yanbin.blog/convert-apache-avro-to-parquet-format-in-java/ translate into English if necessary
package yanbin.blog;
import org.apache.parquet.io.DelegatingPositionOutputStream;
import org.apache.parquet.io.OutputFile;
import org.apache.parquet.io.PositionOutputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.OutputStream;
public class InMemoryOutputFile implements OutputFile {
private final ByteArrayOutputStream baos = new ByteArrayOutputStream();
#Override
public PositionOutputStream create(long blockSizeHint) throws IOException { // Mode.CREATE calls this method
return new InMemoryPositionOutputStream(baos);
}
#Override
public PositionOutputStream createOrOverwrite(long blockSizeHint) throws IOException {
return null;
}
#Override
public boolean supportsBlockSize() {
return false;
}
#Override
public long defaultBlockSize() {
return 0;
}
public byte[] toArray() {
return baos.toByteArray();
}
private static class InMemoryPositionOutputStream extends DelegatingPositionOutputStream {
public InMemoryPositionOutputStream(OutputStream outputStream) {
super(outputStream);
}
#Override
public long getPos() throws IOException {
return ((ByteArrayOutputStream) this.getStream()).size();
}
}
}
public static <T extends SpecificRecordBase> void writeToParquet(List<T> avroObjects) throws IOException {
Schema avroSchema = avroObjects.get(0).getSchema();
GenericData genericData = GenericData.get();
genericData.addLogicalTypeConversion(new TimeConversions.DateConversion());
InMemoryOutputFile outputFile = new InMemoryOutputFile();
try (ParquetWriter<Object> writer = AvroParquetWriter.builder(outputFile)
.withDataModel(genericData)
.withSchema(avroSchema)
.withCompressionCodec(CompressionCodecName.SNAPPY)
.withWriteMode(ParquetFileWriter.Mode.CREATE)
.build()) {
avroObjects.forEach(r -> {
try {
writer.write(r);
} catch (IOException ex) {
throw new UncheckedIOException(ex);
}
});
} catch (IOException e) {
e.printStackTrace();
}
// dump memory data to file for testing
Files.write(Paths.get("./users-memory.parquet"), outputFile.toArray());
}
Test data from memory
$ parquet-tools cat --json users-memory.parquet
$ parquet-tools schema users-memory.parquet
"but it will write the Parquet file to the disk"
"if there was an easier way to just keep the file in memory"
From your queries I understood that you don't want to write the partial files to parquet. If you want the complete file to be written to disk in parquet format and temp files in memory you can use a combination of Memory Mapped File and parquet format.
Write your data to a memory mapped file, once done with the writes convert the bytes to parquet format and store to disk.
Have a look at MappedByteBuffer.
I have a 10GB PDF file that I would like to break up into 10 files each 1GB in size. I need to do this operation in parallel, which means spinning 10 threads which each starts from a different position and read up to 1GB of data and write to a file. Basically the final result should be 10 files that each contain a portion of the original 10GB file.
I looked at FileChannel, but the position is shared, so once I modify the position in one thread, it impacts the other thread. I also looked at AsynchronousFileChannel in Java 7 but I'm not sure if that's the way to go. I appreciate any suggestion on this issue.
I wrote this simple program that reads a small text file to test the FileChannel idea, doesn't seem to work for what I'm trying to achieve.
package org.cas.filesplit;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.file.Path;
import java.nio.file.Paths;
public class ConcurrentRead implements Runnable {
private int myPosition = 0;
public int getPosition() {
return myPosition;
}
public void setPosition(int position) {
this.myPosition = position;
}
static final String filePath = "C:\\Users\\temp.txt";
#Override
public void run() {
try {
readFile();
} catch (IOException e) {
e.printStackTrace();
}
}
private void readFile() throws IOException {
Path path = Paths.get(filePath);
FileChannel fileChannel = FileChannel.open(path);
fileChannel.position(myPosition);
ByteBuffer buffer = ByteBuffer.allocate(8);
int noOfBytesRead = fileChannel.read(buffer);
while (noOfBytesRead != -1) {
buffer.flip();
System.out.println("Thread - " + Thread.currentThread().getId());
while (buffer.hasRemaining()) {
System.out.print((char) buffer.get());
}
System.out.println(" ");
buffer.clear();
noOfBytesRead = fileChannel.read(buffer);
}
fileChannel.close();
}
}
import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.channels.SeekableByteChannel;
import java.nio.file.Files;
import java.nio.file.OpenOption;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.StandardOpenOption;
import java.nio.file.attribute.FileAttribute;
import java.nio.file.attribute.PosixFilePermission;
import java.nio.file.attribute.PosixFilePermissions;
import java.util.HashSet;
import java.util.Set;
public class RAFRead {
public static void main(String[] args) {
create();
read();
}
public static void create() {
// Create the set of options for appending to the file.
Set<OpenOption> options = new HashSet<OpenOption>();
options.add(StandardOpenOption.APPEND);
options.add(StandardOpenOption.CREATE);
// Create the custom permissions attribute.
Set<PosixFilePermission> perms = PosixFilePermissions
.fromString("rw-r-----");
FileAttribute<Set<PosixFilePermission>> attr = PosixFilePermissions
.asFileAttribute(perms);
Path file = Paths.get("./outfile.log");
ByteBuffer buffer = ByteBuffer.allocate(4);
try {
SeekableByteChannel sbc = Files.newByteChannel(file, options, attr);
for (int i = 9; i >= 0; --i) {
sbc = sbc.position(i * 4);
buffer.clear();
buffer.put(new Integer(i).byteValue());
buffer.flip();
sbc.write(buffer);
}
} catch (IOException e) {
System.out.println(e.getMessage());
}
}
public static void read() {
// Create the set of options for appending to the file.
Set<OpenOption> options = new HashSet<OpenOption>();
options.add(StandardOpenOption.READ);
Path file = Paths.get("./outfile.log");
ByteBuffer buffer = ByteBuffer.allocate(4);
try {
SeekableByteChannel sbc = Files.newByteChannel(file, options);
int nread;
do {
nread = sbc.read(buffer);
if(nread!= -1) {
buffer.flip();
System.out.println(buffer.getInt());
}
} while(nread != -1 && buffer.hasRemaining());
} catch (IOException e) {
System.out.println(e.getMessage());
}
}
}
I first create the file.
I am trying to put 9, then 8, then 7 and so on in the file.
But I am trying to add to file in reverse order using random access.
The output of file actually will be numbers in ascending order.
I am just writing to file in reverse order to try out random access writing.
After that I try to read the file and print the data (numbers).
It prints only 0. I was expecting it to print 1-9.
I couldn't figure out the reason. Any help is appreciated.
I followed this link from Oracle site: https://docs.oracle.com/javase/tutorial/essential/io/file.html
The file has size after I run this program, so it seems program is writing.
Since it is buffer read, i can't see the data by vi or cat.
You need to flip() the buffer before calling write() or get()(and friends), and compact() afterwards.
Why must I use DigestInputStream and not FileInputStream to get a digest of an file?
I have written a program that reads ints from FileInputStream, converts them to bytes and passes them to update method of MessageDigest object. But I have a suspicion that it doesn't work properly, because it calculates a digest of a very large file instanlty. Why doesn't it work?
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
public class DigestDemo {
public static byte[] getSha1(String file) {
FileInputStream fis = null;
MessageDigest md = null;
try {
fis = new FileInputStream(file);
} catch(FileNotFoundException exc) {
System.out.println(exc);
}
try {
md = MessageDigest.getInstance("SHA-1");
} catch (NoSuchAlgorithmException exc) {
System.out.println(exc);
}
byte b = 0;
do {
try {
b = (byte) fis.read();
} catch (IOException e) {
System.out.println(e);
}
if (b != -1)
md.update(b);
} while(b != -1);
return md.digest();
}
public static void writeBytes(byte[] a) {
for (byte b : a) {
System.out.printf("%x", b);
}
}
public static void main(String[] args) {
String file = "C:\\Users\\Mike\\Desktop\\test.txt";
byte[] digest = getSha1(file);
writeBytes(digest);
}
}
You need to change the type of b to int,, and you need to call MessageDigest.doFinal() at the end of the file, but this is horrifically inefficient. Try reading and updating from a byte array.
There's too much try-catching in this code. Reduce it to one try and two catches, outside the loop.
I want to generate a .torrent file in Java, but I don't want a big API that does anything like scraping trackers, seeding, etc. This is just for a client that generates meta data. What lightweight solutions exist? I am only generating a .torrent of a single .zip file.
Thanks!
I have put together this self-contained piece of Java code to prepare a .torrent file with a single file.
The .torrent file is created by calling createTorrent() passing the name of the .torrent file, the name of the shared file and the tracker URL.
createTorrent() uses hashPieces() to hash the file pieces using Java's MessageDigest class. Then createTorrent() prepares a meta info dictionary containing the torrent meta-data. This dictionary is then serialized in the proper bencode format using the encode*() methods and saved in a .torrent file.
See the BitTorrent spec for details.
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.HashMap;
import java.util.Map;
import java.util.SortedMap;
import java.util.TreeMap;
public class Torrent {
private static void encodeObject(Object o, OutputStream out) throws IOException {
if (o instanceof String)
encodeString((String) o, out);
else if (o instanceof Map)
encodeMap((Map) o, out);
else if (o instanceof byte[])
encodeBytes((byte[]) o, out);
else if (o instanceof Number)
encodeLong(((Number) o).longValue(), out);
else
throw new Error("Unencodable type");
}
private static void encodeLong(long value, OutputStream out) throws IOException {
out.write('i');
out.write(Long.toString(value).getBytes("US-ASCII"));
out.write('e');
}
private static void encodeBytes(byte[] bytes, OutputStream out) throws IOException {
out.write(Integer.toString(bytes.length).getBytes("US-ASCII"));
out.write(':');
out.write(bytes);
}
private static void encodeString(String str, OutputStream out) throws IOException {
encodeBytes(str.getBytes("UTF-8"), out);
}
private static void encodeMap(Map<String, Object> map, OutputStream out) throws IOException {
// Sort the map. A generic encoder should sort by key bytes
SortedMap<String, Object> sortedMap = new TreeMap<String, Object>(map);
out.write('d');
for (Map.Entry<String, Object> e : sortedMap.entrySet()) {
encodeString(e.getKey(), out);
encodeObject(e.getValue(), out);
}
out.write('e');
}
private static byte[] hashPieces(File file, int pieceLength) throws IOException {
MessageDigest sha1;
try {
sha1 = MessageDigest.getInstance("SHA");
} catch (NoSuchAlgorithmException e) {
throw new Error("SHA1 not supported");
}
InputStream in = new FileInputStream(file);
ByteArrayOutputStream pieces = new ByteArrayOutputStream();
byte[] bytes = new byte[pieceLength];
int pieceByteCount = 0, readCount = in.read(bytes, 0, pieceLength);
while (readCount != -1) {
pieceByteCount += readCount;
sha1.update(bytes, 0, readCount);
if (pieceByteCount == pieceLength) {
pieceByteCount = 0;
pieces.write(sha1.digest());
}
readCount = in.read(bytes, 0, pieceLength - pieceByteCount);
}
in.close();
if (pieceByteCount > 0)
pieces.write(sha1.digest());
return pieces.toByteArray();
}
public static void createTorrent(File file, File sharedFile, String announceURL) throws IOException {
final int pieceLength = 512 * 1024;
Map<String, Object> info = new HashMap<>();
info.put("name", sharedFile.getName());
info.put("length", sharedFile.length());
info.put("piece length", pieceLength);
info.put("pieces", hashPieces(sharedFile, pieceLength));
Map<String, Object> metainfo = new HashMap<String, Object>();
metainfo.put("announce", announceURL);
metainfo.put("info", info);
OutputStream out = new FileOutputStream(file);
encodeMap(metainfo, out);
out.close();
}
public static void main(String[] args) throws Exception {
createTorrent(new File("C:/x.torrent"), new File("C:/file"), "http://example.com/announce");
}
}
Code edits: Make this a bit more compact, fix methods visibility, use character literals where appropriate, use instanceof Number. And more recently read the file using block I/O because I 'm trying to use it for real and byte I/O is just slow,
I'd start with Java Bittorrent API. The jar is about 70Kb but you can probably strip it down by removing the classes not necessary for creating torrents. The SDK has a sample ExampleCreateTorrent.java illustrating how to do exactly what you need.
You may also look how it's implemented in the open source Java clients such as Azureus.