Q: Converting Avro to Parquet in Memory - java

I am receiving Avro records from Kafka. I want to convert these records into Parquet files. I am following this blog post: http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/
The code so far looks roughly like this:
final String fileName
SinkRecord record,
final AvroData avroData
final Schema avroSchema = avroData.fromConnectSchema(record.valueSchema());
CompressionCodecName compressionCodecName = CompressionCodecName.SNAPPY;
int blockSize = 256 * 1024 * 1024;
int pageSize = 64 * 1024;
Path path = new Path(fileName);
writer = new AvroParquetWriter<>(path, avroSchema, compressionCodecName, blockSize, pageSize);
Now, this will do the Avro to Parquet conversion, but it will write the Parquet file to the disk. I was wondering if there was an easier way to just keep the file in memory so that I don't have to manage temp files on the disk. Thank you

Please check my blog, https://yanbin.blog/convert-apache-avro-to-parquet-format-in-java/ translate into English if necessary
package yanbin.blog;
import org.apache.parquet.io.DelegatingPositionOutputStream;
import org.apache.parquet.io.OutputFile;
import org.apache.parquet.io.PositionOutputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.OutputStream;
public class InMemoryOutputFile implements OutputFile {
private final ByteArrayOutputStream baos = new ByteArrayOutputStream();
#Override
public PositionOutputStream create(long blockSizeHint) throws IOException { // Mode.CREATE calls this method
return new InMemoryPositionOutputStream(baos);
}
#Override
public PositionOutputStream createOrOverwrite(long blockSizeHint) throws IOException {
return null;
}
#Override
public boolean supportsBlockSize() {
return false;
}
#Override
public long defaultBlockSize() {
return 0;
}
public byte[] toArray() {
return baos.toByteArray();
}
private static class InMemoryPositionOutputStream extends DelegatingPositionOutputStream {
public InMemoryPositionOutputStream(OutputStream outputStream) {
super(outputStream);
}
#Override
public long getPos() throws IOException {
return ((ByteArrayOutputStream) this.getStream()).size();
}
}
}
public static <T extends SpecificRecordBase> void writeToParquet(List<T> avroObjects) throws IOException {
Schema avroSchema = avroObjects.get(0).getSchema();
GenericData genericData = GenericData.get();
genericData.addLogicalTypeConversion(new TimeConversions.DateConversion());
InMemoryOutputFile outputFile = new InMemoryOutputFile();
try (ParquetWriter<Object> writer = AvroParquetWriter.builder(outputFile)
.withDataModel(genericData)
.withSchema(avroSchema)
.withCompressionCodec(CompressionCodecName.SNAPPY)
.withWriteMode(ParquetFileWriter.Mode.CREATE)
.build()) {
avroObjects.forEach(r -> {
try {
writer.write(r);
} catch (IOException ex) {
throw new UncheckedIOException(ex);
}
});
} catch (IOException e) {
e.printStackTrace();
}
// dump memory data to file for testing
Files.write(Paths.get("./users-memory.parquet"), outputFile.toArray());
}
Test data from memory
$ parquet-tools cat --json users-memory.parquet
$ parquet-tools schema users-memory.parquet

"but it will write the Parquet file to the disk"
"if there was an easier way to just keep the file in memory"
From your queries I understood that you don't want to write the partial files to parquet. If you want the complete file to be written to disk in parquet format and temp files in memory you can use a combination of Memory Mapped File and parquet format.
Write your data to a memory mapped file, once done with the writes convert the bytes to parquet format and store to disk.
Have a look at MappedByteBuffer.

Related

Java IO outperforms Java NIO when it comes to file reading

I believed that the new nio package would outperform the old io package when it comes to the time required to read the contents of a file. However, based on my results, io package seems to outperform nio package. Here's my test:
import java.io.*;
import java.lang.reflect.Array;
import java.nio.ByteBuffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.file.Files;
import java.util.ArrayList;
import java.util.List;
public class FileTestingOne {
public static void main(String[] args) {
long startTime = System.nanoTime();
File file = new File("hey2.txt");
try {
byte[] a = direct(file);
String s = new String(a);
}
catch (IOException err) {
err.printStackTrace();
}
long endTime = System.nanoTime();
long totalTime = (endTime - startTime);
System.out.println(totalTime);
}
public static ByteBuffer readFile_NIO(File file) throws IOException {
RandomAccessFile rFile = new RandomAccessFile(file.getName(), "rw");
FileChannel inChannel = rFile.getChannel();
ByteBuffer _buffer = ByteBuffer.allocate(1024);
int bytesRead = inChannel.read(_buffer);
while (bytesRead != -1) {
_buffer.flip();
while (_buffer.hasRemaining()) {
byte b = _buffer.get();
}
_buffer.clear();
bytesRead = inChannel.read(_buffer);
}
inChannel.close();
rFile.close();
return _buffer;
}
public static byte[] direct(File file) throws IOException {
byte[] buffer = Files.readAllBytes(file.toPath());
return buffer;
}
public static byte[] readFile_IO(File file) throws IOException {
byte[] _buffer = new byte[(int) file.length()];
InputStream in = null;
try {
in = new FileInputStream(file);
if ( in.read(_buffer) == -1 ) {
throw new IOException(
"EOF reached while reading file. File is probably empty");
}
}
finally {
try {
if (in != null)
in.close();
}
catch (IOException err) {
// TODO Logging
err.printStackTrace();
}
}
return _buffer;
}
}
// Small file
//7566395 -> readFile_NIO
//10790558 -> direct
//707775 -> readFile_IO
// Large file
//9228099 -> readFile_NIO
//737674 -> readFile_IO
//10903324 -> direct
// Very large file
//13700005 -> readFile_NIO
//2837188 -> readFile_IO
//11020507 -> direct
Results are:
Small file:
nio implementation: 7,566,395ns
io implementation: 707,775ns
direct implementation: 10,790,558ns
Large file:
nio implementation: 9,228,099ns
io implementation: 737,674ns
direct implementation: 10,903,324ns
Very large file:
nio implementation: 13,700,005ns
io implementation: 2,837,188ns
direct implementation: 11,020,507ns
I wanted to ask this question because (I believe) nio package is non-blocking, thus it needs to be faster, right?
Thank you,
Edit:
Changed ms to ns
Memory mapped files (or MappedByteBuffer) are a part of Java NIO and could help improve performance.
The non-blocking in Java NIO means that a thread does not have to wait for the next data to read. It does not necessarily affect performance of a full operation (like reading and processing a file) at all.

Compressing data and upload it to S3 without keeping the full content in the memory

I want to compress data which is created dynamically using GZIP stream and upload it to S3 while I expect the data to be ±1Giga per compressed file.
Since the file size is big and I'm going to handle multiple files in parallel, I can't hold the entire data on memory and I wish to stream data to S3 as soon as possible.
Moreover, I can't know the exact size of the compress data. Reading this question "Can I stream a file upload to S3 without a content-length header?" But I can't figure out how to combine it with GZIPing.
I think I could have done that if I was able to create GZIPOutputStream, send data to it part by part while simultaneously read chunks of the compressed data (hopefully of 5Mb) and upload them to S3 using Amazon S3: Multipart upload
Is what I'm trying to do is possible? Or my only option is to Compress the data to local storage (my hard disk) and than upload the compressed file?
I don't take no for an answer, so this is how I did it:
package roee.gavriel;
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.LinkedList;
import java.util.List;
import java.util.concurrent.LinkedBlockingQueue;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.TimeUnit;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.model.CompleteMultipartUploadRequest;
import com.amazonaws.services.s3.model.InitiateMultipartUploadRequest;
import com.amazonaws.services.s3.model.InitiateMultipartUploadResult;
import com.amazonaws.services.s3.model.PartETag;
import com.amazonaws.services.s3.model.UploadPartRequest;
public class S3UploadStream extends OutputStream {
private final static Integer PART_SIZE = 5 * 1024 * 1024;
private final AmazonS3 s3client;
private final String bucket;
private final String key;
// The upload id given to the multiple parts upload by AWS.
private final String uploadId;
// A tag list. AWS give one for each part and expect then when the upload is finish.
private final List<PartETag> partETags = new LinkedList<>();
// A buffer to collect the data before sending it to AWS.
private byte[] partData = new byte[PART_SIZE];
// The index of the next free byte on the buffer.
private int partDataIndex = 0;
// Total number of parts that where uploaded.
private int totalPartCountIndex = 0;
private volatile Boolean closed = false;
// Internal thread pool which will handle the actual part uploading.
private final ThreadPoolExecutor executor;
public S3UploadStream(AmazonS3 s3client, String bucket, String key, int uploadThreadsCount) {
this.s3client = s3client;
this.bucket = bucket;
this.key = key;
InitiateMultipartUploadRequest initRequest = new InitiateMultipartUploadRequest(bucket, key);
InitiateMultipartUploadResult initResponse = s3client.initiateMultipartUpload(initRequest);
this.uploadId = initResponse.getUploadId();
this.executor = new ThreadPoolExecutor(uploadThreadsCount, uploadThreadsCount, 60, TimeUnit.SECONDS,
new LinkedBlockingQueue<Runnable>(100));
}
#Override
public synchronized void write(int b) throws IOException {
if (closed) {
throw new IOException("Trying to write to a closed S3UploadStream");
}
partData[partDataIndex++] = (byte)b;
uploadPart(false);
}
#Override
public synchronized void close() {
if (closed) {
return;
}
closed = true;
// Flush the current data in the buffer
uploadPart(true);
executor.shutdown();
try {
executor.awaitTermination(2, TimeUnit.MINUTES);
} catch (InterruptedException e) {
//Nothing to do here...
}
// Close the multiple part upload
CompleteMultipartUploadRequest compRequest =
new CompleteMultipartUploadRequest(bucket, key, uploadId, partETags);
s3client.completeMultipartUpload(compRequest);
}
private synchronized void uploadPart(Boolean force) {
if (!force && partDataIndex < PART_SIZE) {
// the API requires that only the last part can be smaller than 5Mb
return;
}
// Actually start the upload
createUploadPartTask(partData, partDataIndex);
// We are going to upload the current part, so start buffering data to new part
partData = new byte[PART_SIZE];
partDataIndex = 0;
}
private synchronized void createUploadPartTask(byte[] partData, int partDataIndex) {
// Create an Input stream of the data
InputStream stream = new ByteArrayInputStream(partData, 0, partDataIndex);
// Build the upload request
UploadPartRequest uploadRequest = new UploadPartRequest()
.withBucketName(bucket)
.withKey(key)
.withUploadId(uploadId)
.withPartNumber(++totalPartCountIndex)
.withInputStream(stream)
.withPartSize(partDataIndex);
// Upload part and add response to our tag list.
// Make the actual upload in a different thread
executor.execute(() -> {
PartETag partETag = s3client.uploadPart(uploadRequest).getPartETag();
synchronized (partETags) {
partETags.add(partETag);
}
});
}
}
And here is a small snippest of code that use it to write many guid to S3 GZIP file:
int writeThreads = 3;
int genThreads = 10;
int guidPerThread = 200_000;
try (S3UploadStream uploadStream = new S3UploadStream(s3client, "<YourBucket>", "<YourKey>.gz", writeThreads)) {
try (GZIPOutputStream stream = new GZIPOutputStream(uploadStream)) {
Semaphore s = new Semaphore(0);
for (int t = 0; t < genThreads; ++t) {
new Thread(() -> {
for (int i = 0; i < guidPerThread; ++i) {
try {
stream.write(java.util.UUID.randomUUID().toString().getBytes());
stream.write('\n');
} catch (IOException e) {
}
}
s.release();
}).start();
}
s.acquire(genThreads);
}
}

Convert pdf to byte[] and vice versa with pdfbox

I've read the documentation and the examples but I'm having a hard time putting it all together. I'm just trying to take a test pdf file and then convert it to a byte array then take the byte array and convert it back into a pdf file then create the pdf file onto disk.
It probably doesn't help much, but this is what I've got so far:
package javaapplication1;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import org.apache.pdfbox.cos.COSStream;
import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.pdmodel.PDDocument;
public class JavaApplication1 {
private COSStream stream;
public static void main(String[] args) {
try {
PDDocument in = PDDocument.load("C:\\Users\\Me\\Desktop\\JavaApplication1\\in\\Test.pdf");
byte[] pdfbytes = toByteArray(in);
PDDocument out;
} catch (Exception e) {
System.out.println(e);
}
}
private static byte[] toByteArray(PDDocument pdDoc) throws IOException, COSVisitorException {
ByteArrayOutputStream out = new ByteArrayOutputStream();
try {
pdDoc.save(out);
pdDoc.close();
} catch (Exception ex) {
System.out.println(ex);
}
return out.toByteArray();
}
public void PDStream(PDDocument document) {
stream = new COSStream(document.getDocument().getScratchFile());
}
}
You can use Apache commons, which is essential in any java project IMO.
Then you can use FileUtils's readFileToByteArray(File file) and writeByteArrayToFile(File file, byte[] data).
(here is commons-io, which is where FileUtils is: http://commons.apache.org/proper/commons-io/download_io.cgi )
For example, I just tried this here and it worked beautifully.
try {
File file = new File("/example/path/contract.pdf");
byte[] array = FileUtils.readFileToByteArray(file);
FileUtils.writeByteArrayToFile(new File("/example/path/contract2.pdf"), array);
} catch (IOException e) {
e.printStackTrace();
}

I/O operations on file

I have a requirement, for example, there will be number of .txt files in one loation c:\onelocation. I want to write the content to another location in txt format. This part is pretty easy and straight forward. But there is speed breaker here.
There will be time interval take 120 seconds. Read the files from above location and write it to another files with formate txt till 120secs and save the file with name as timestamp.
After 120sec create one more files with that timestamp but we have to read the files were cursor left in previous file.
Please can you suggest any ideas, if code is provided that would be also appreciable.
Thanks Damu.
How about this? A writer that automatically changes where it is writing two every 120 seconds.
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.Writer;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.Date;
public class TimeBoxedWriter extends Writer {
private static DateFormat FORMAT = new SimpleDateFormat("yyyyDDDHHmm");
/** Milliseconds to each time box */
private static final int TIME_BOX = 120000;
/** For testing only */
public static void main(String[] args) throws IOException {
Writer w = new TimeBoxedWriter(new File("."), "test");
// write one line per second for 500 seconds.
for(int i = 0;i < 500;i++) {
w.write("testing " + i + "\n");
try {
Thread.sleep(1000);
} catch (InterruptedException ie) {}
}
w.close();
}
/** Output folder */
private File dir_;
/** Timestamp for current file */
private long stamp_ = 0;
/** Stem for output files */
private String stem_;
/** Current output writer */
private Writer writer_ = null;
/**
* Create new output writer
*
* #param dir
* the output folder
* #param stem
* the stem used to generate file names
*/
public TimeBoxedWriter(File dir, String stem) {
dir_ = dir;
stem_ = stem;
}
#Override
public void close() throws IOException {
synchronized (lock) {
if (writer_ != null) {
writer_.close();
writer_ = null;
}
}
}
#Override
public void flush() throws IOException {
synchronized (lock) {
if (writer_ != null) writer_.flush();
}
}
private void rollover() throws IOException {
synchronized (lock) {
long now = System.currentTimeMillis();
if ((stamp_ + TIME_BOX) < now) {
if (writer_ != null) {
writer_.flush();
writer_.close();
}
stamp_ = TIME_BOX * (System.currentTimeMillis() / TIME_BOX);
String time = FORMAT.format(new Date(stamp_));
writer_ = new FileWriter(new File(dir_, stem_ + "." + time
+ ".txt"));
}
}
}
#Override
public void write(char[] cbuf, int off, int len) throws IOException {
synchronized (lock) {
rollover();
writer_.write(cbuf, off, len);
}
}
}
Use RamdomAccessFile in java to move the cursor within the file.
Before start copying check the file modification/creation(in case of new files) time, if less than 2 mins then only start copying or else skip it.
Keep a counter of no.of bytes/lines read for each file. move the cursor to that position and read it from there.
You can duplicate the file rather than using reading and writing operations.
sample code:
FileChannel ic = new FileInputStream("<source file location>")).getChannel();
FileChannel oc = new FileOutputStream("<destination location>").getChannel();
ic.transferTo(0, ic.size(), oc);
ic.close();
oc.close();
HTH
File io is simple in java, here is an example I found on the web of copying a file to another file.
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
public class Copy {
public static void main(String[] args) throws IOException {
File inputFile = new File("farrago.txt");
File outputFile = new File("outagain.txt");
FileReader in = new FileReader(inputFile);
FileWriter out = new FileWriter(outputFile);
int c;
while ((c = in.read()) != -1)
out.write(c);
in.close();
out.close();
}
}

How can I generate a .torrent in Java?

I want to generate a .torrent file in Java, but I don't want a big API that does anything like scraping trackers, seeding, etc. This is just for a client that generates meta data. What lightweight solutions exist? I am only generating a .torrent of a single .zip file.
Thanks!
I have put together this self-contained piece of Java code to prepare a .torrent file with a single file.
The .torrent file is created by calling createTorrent() passing the name of the .torrent file, the name of the shared file and the tracker URL.
createTorrent() uses hashPieces() to hash the file pieces using Java's MessageDigest class. Then createTorrent() prepares a meta info dictionary containing the torrent meta-data. This dictionary is then serialized in the proper bencode format using the encode*() methods and saved in a .torrent file.
See the BitTorrent spec for details.
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.HashMap;
import java.util.Map;
import java.util.SortedMap;
import java.util.TreeMap;
public class Torrent {
private static void encodeObject(Object o, OutputStream out) throws IOException {
if (o instanceof String)
encodeString((String) o, out);
else if (o instanceof Map)
encodeMap((Map) o, out);
else if (o instanceof byte[])
encodeBytes((byte[]) o, out);
else if (o instanceof Number)
encodeLong(((Number) o).longValue(), out);
else
throw new Error("Unencodable type");
}
private static void encodeLong(long value, OutputStream out) throws IOException {
out.write('i');
out.write(Long.toString(value).getBytes("US-ASCII"));
out.write('e');
}
private static void encodeBytes(byte[] bytes, OutputStream out) throws IOException {
out.write(Integer.toString(bytes.length).getBytes("US-ASCII"));
out.write(':');
out.write(bytes);
}
private static void encodeString(String str, OutputStream out) throws IOException {
encodeBytes(str.getBytes("UTF-8"), out);
}
private static void encodeMap(Map<String, Object> map, OutputStream out) throws IOException {
// Sort the map. A generic encoder should sort by key bytes
SortedMap<String, Object> sortedMap = new TreeMap<String, Object>(map);
out.write('d');
for (Map.Entry<String, Object> e : sortedMap.entrySet()) {
encodeString(e.getKey(), out);
encodeObject(e.getValue(), out);
}
out.write('e');
}
private static byte[] hashPieces(File file, int pieceLength) throws IOException {
MessageDigest sha1;
try {
sha1 = MessageDigest.getInstance("SHA");
} catch (NoSuchAlgorithmException e) {
throw new Error("SHA1 not supported");
}
InputStream in = new FileInputStream(file);
ByteArrayOutputStream pieces = new ByteArrayOutputStream();
byte[] bytes = new byte[pieceLength];
int pieceByteCount = 0, readCount = in.read(bytes, 0, pieceLength);
while (readCount != -1) {
pieceByteCount += readCount;
sha1.update(bytes, 0, readCount);
if (pieceByteCount == pieceLength) {
pieceByteCount = 0;
pieces.write(sha1.digest());
}
readCount = in.read(bytes, 0, pieceLength - pieceByteCount);
}
in.close();
if (pieceByteCount > 0)
pieces.write(sha1.digest());
return pieces.toByteArray();
}
public static void createTorrent(File file, File sharedFile, String announceURL) throws IOException {
final int pieceLength = 512 * 1024;
Map<String, Object> info = new HashMap<>();
info.put("name", sharedFile.getName());
info.put("length", sharedFile.length());
info.put("piece length", pieceLength);
info.put("pieces", hashPieces(sharedFile, pieceLength));
Map<String, Object> metainfo = new HashMap<String, Object>();
metainfo.put("announce", announceURL);
metainfo.put("info", info);
OutputStream out = new FileOutputStream(file);
encodeMap(metainfo, out);
out.close();
}
public static void main(String[] args) throws Exception {
createTorrent(new File("C:/x.torrent"), new File("C:/file"), "http://example.com/announce");
}
}
Code edits: Make this a bit more compact, fix methods visibility, use character literals where appropriate, use instanceof Number. And more recently read the file using block I/O because I 'm trying to use it for real and byte I/O is just slow,
I'd start with Java Bittorrent API. The jar is about 70Kb but you can probably strip it down by removing the classes not necessary for creating torrents. The SDK has a sample ExampleCreateTorrent.java illustrating how to do exactly what you need.
You may also look how it's implemented in the open source Java clients such as Azureus.

Categories

Resources