Read whole text files from a compression in Spark

Read whole text files from a compression in Spark - java

I have the following problem: suppose that I have a directory containing compressed directories which contain multiple files, stored on HDFS. I want to create an RDD consisting some objects of type T, i.e.:
context = new JavaSparkContext(conf);
JavaPairRDD<String, String> filesRDD = context.wholeTextFiles(inputDataPath);
JavaPairRDD<String, String> filesRDD = context.wholeTextFiles(inputDataPath);
JavaRDD<T> processingFiles = filesRDD.map(fileNameContent -> {
// The name of the file
String fileName = fileNameContent._1();
// The content of the file
String content = fileNameContent._2();
// Class T has a constructor of taking the filename and the content of each
// processed file (as two strings)
T t = new T(content, fileName);
return t;
});
Now when inputDataPath is a directory containing files this works perfectly fine, i.e. when it's something like:
String inputDataPath = "hdfs://some_path/*/*/"; // because it contains subfolders
But, when there's a tgz containing multiple files, the file content (fileNameContent._2()) gets me some useless binary string (quite expected). I found a similar question on SO, but it's not the same case, because there the solution is when each compression consists of one file only, and in my case there are many other files which I want to read individually as whole files. I also found a question about wholeTextFiles, but this doesn't work in my case.
Any ideas how to do this?
EDIT:
I tried with the reader from here (trying to test the reader from here, like in the function testTarballWithFolders()), but whenever I call
TarballReader tarballReader = new TarballReader(fileName);
and I get NullPointerException:
java.lang.NullPointerException
at java.util.zip.InflaterInputStream.<init>(InflaterInputStream.java:83)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:77)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91)
at utils.TarballReader.<init>(TarballReader.java:61)
at main.SparkMain.lambda$0(SparkMain.java:105)
at main.SparkMain$$Lambda$18/1667100242.call(Unknown Source)
at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1015)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The line 105 in MainSpark is the one I showed upper in my edit of the post, and line 61 from TarballReader is
GZIPInputStream gzip = new GZIPInputStream(in);
which gives a null value for the input stream in in the upper line:
InputStream in = this.getClass().getResourceAsStream(tarball);
Am I on the right path here? If so, how do I continue? Why do I get this null value and how can I fix it?

One possible solution is to read data with binaryFiles and extract content manually.
Scala:
import org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream
import org.apache.commons.compress.archivers.tar.TarArchiveInputStream
import org.apache.spark.input.PortableDataStream
import scala.util.Try
import java.nio.charset._
def extractFiles(ps: PortableDataStream, n: Int = 1024) = Try {
val tar = new TarArchiveInputStream(new GzipCompressorInputStream(ps.open))
Stream.continually(Option(tar.getNextTarEntry))
// Read until next exntry is null
.takeWhile(_.isDefined)
// flatten
.flatMap(x => x)
// Drop directories
.filter(!_.isDirectory)
.map(e => {
Stream.continually {
// Read n bytes
val buffer = Array.fill[Byte](n)(-1)
val i = tar.read(buffer, 0, n)
(i, buffer.take(i))}
// Take as long as we've read something
.takeWhile(_._1 > 0)
.map(_._2)
.flatten
.toArray})
.toArray
}
def decode(charset: Charset = StandardCharsets.UTF_8)(bytes: Array[Byte]) =
new String(bytes, StandardCharsets.UTF_8)
sc.binaryFiles("somePath").flatMapValues(x =>
extractFiles(x).toOption).mapValues(_.map(decode()))
libraryDependencies += "org.apache.commons" % "commons-compress" % "1.11"
Full usage example with Java: https://bitbucket.org/zero323/spark-multifile-targz-extract/src
Python:
import tarfile
from io import BytesIO
def extractFiles(bytes):
tar = tarfile.open(fileobj=BytesIO(bytes), mode="r:gz")
return [tar.extractfile(x).read() for x in tar if x.isfile()]
(sc.binaryFiles("somePath")
.mapValues(extractFiles)
.mapValues(lambda xs: [x.decode("utf-8") for x in xs]))

A slight improvement on the accepted answer is to change
Option(tar.getNextTarEntry)
to
Try(tar.getNextTarEntry).toOption.filter( _ != null)
to contend with malformed / truncated .tar.gzs in a robust way.
BTW, is there anything special about the size of the buffer array? Would it faster on average if it were closer to the average file size, maybe 500k in my case? Or is the slowdown I am seeing more likely the overhead of Stream relative to a while loop that was more Java-ish, I guess.

Related

MimeBodyPart.getInputStream only returns the first 8192 bytes of an email attachment

I am using MimeBodyPart.getInputStream to retrieve a file attached to an incoming email. Anytime the attached file is larger than 8192 bytes (8 KiB), the rest of the data is lost. fileInput.readAllBytes().length seems to always be min(fileSize, 8192). My relevant code looks like this
val part = multiPart.getBodyPart(i).asInstanceOf[MimeBodyPart]
if (Part.ATTACHMENT.equalsIgnoreCase(part.getDisposition)) {
val filePath = fileManager.generateRandomUniqueFilename
val fileName = part.getFileName
val fileSize = part.getSize
val fileContentType = part.getContentType
val fileInput = part.getInputStream
doSomething(filePath, fileInput.readAllBytes(), fileName, fileSize, fileContentType)
}
Note that the variable fileSize contains the right value (e. g. 63209 for a roughly 64kb file). I've tried this with two different mail servers yielding the same result. In the documentation I cannot find anything about a 8KiB limit. What is happening here?
Note: When I use part.getRawInputStream I receive the full data!

FileOutputStream in dart flutter

what's the equivalent fileoutputstream of java in dart?
Java code
file = new FileOutputStream(logFile, true);
byte[] input = "String".getBytes();
file.write(input);
java file output
String
Ive tried this in dart
Dart code
var file = File(logFile!.path).openWrite();
List input = "String".codeUnits;
file.write(input);
[String]
and every time I open the file again to append "String2" and "String3" to it, the output will be
[String][String2][String3]
as oppose to java's output
StringString3String3
to sum it up, is there a way to fix/workaround this?
why each array bytes written in dart will be a new array instead of append into an existing one?

You can achieve that by using File.writeAsString() and using FileMode.append.
Picking up your example, this would be:
var file = File(logFile!.path);
file.writeAsString("String", mode: FileMode.append);

did you try writeAsString() ?
import 'dart:io';
void main() async {
final filename = 'file.txt';
var file = await File(filename).writeAsString('some content');
// Do something with the file.
}

How to read and write Parquet files efficiently?

I am working on a utility which reads multiple parquet files at a time and writing them into one single output file.
the implementation is very straightforward. This utility reads parquet files from the directory, reads Group from all the file and put them into a list .Then uses ParquetWrite to write all these Groups into a single file. After reading 600mb it throws Out of memory error for Java heap space. It also takes 15-20 minutes to read and write 500mb of data.
Is there a way to make this operation more efficient?
Read method looks like this:
ParquetFileReader reader = new ParquetFileReader(conf, path, ParquetMetadataConverter.NO_FILTER);
ParquetMetadata readFooter = reader.getFooter();
MessageType schema = readFooter.getFileMetaData().getSchema();
ParquetFileReader r = new ParquetFileReader(conf, path, readFooter);
reader.close();
PageReadStore pages = null;
try {
while (null != (pages = r.readNextRowGroup())) {
long rows = pages.getRowCount();
System.out.println("Number of rows: " + pages.getRowCount());
MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema);
RecordReader<Group> recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema));
for (int i = 0; i < rows; i++) {
Group g = (Group) recordReader.read();
//printGroup(g);
groups.add(g);
}
}
} finally {
System.out.println("close the reader");
r.close();
}
Write method is like this:
for(Path file : files){
groups.addAll(readData(file));
}
System.out.println("Number of groups from the parquet files "+groups.size());
Configuration configuration = new Configuration();
Map<String, String> meta = new HashMap<String, String>();
meta.put("startkey", "1");
meta.put("endkey", "2");
GroupWriteSupport.setSchema(schema, configuration);
ParquetWriter<Group> writer = new ParquetWriter<Group>(
new Path(outputFile),
new GroupWriteSupport(),
CompressionCodecName.SNAPPY,
2147483647,
268435456,
134217728,
true,
false,
ParquetProperties.WriterVersion.PARQUET_2_0,
configuration);
System.out.println("Number of groups to write:"+groups.size());
for(Group g : groups) {
writer.write(g);
}
writer.close();

I use these functions to merge parquet files, but it is in Scala. Anyway, it may give you good starting point.
import java.util
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.parquet.hadoop.{ParquetFileReader, ParquetFileWriter}
import org.apache.parquet.hadoop.util.{HadoopInputFile, HadoopOutputFile}
import org.apache.parquet.schema.MessageType
import scala.collection.JavaConverters._
object ParquetFileMerger {
def mergeFiles(inputFiles: Seq[Path], outputFile: Path): Unit = {
val conf = new Configuration()
val mergedMeta = ParquetFileWriter.mergeMetadataFiles(inputFiles.asJava, conf).getFileMetaData
val writer = new ParquetFileWriter(conf, mergedMeta.getSchema, outputFile, ParquetFileWriter.Mode.OVERWRITE)
writer.start()
inputFiles.foreach(input => writer.appendFile(HadoopInputFile.fromPath(input, conf)))
writer.end(mergedMeta.getKeyValueMetaData)
}
def mergeBlocks(inputFiles: Seq[Path], outputFile: Path): Unit = {
val conf = new Configuration()
val parquetFileReaders = inputFiles.map(getParquetFileReader)
val mergedSchema: MessageType =
parquetFileReaders.
map(_.getFooter.getFileMetaData.getSchema).
reduce((a, b) => a.union(b))
val writer = new ParquetFileWriter(HadoopOutputFile.fromPath(outputFile, conf), mergedSchema, ParquetFileWriter.Mode.OVERWRITE, 64*1024*1024, 8388608)
writer.start()
parquetFileReaders.foreach(_.appendTo(writer))
writer.end(new util.HashMap[String, String]())
}
def getParquetFileReader(file: Path): ParquetFileReader = {
ParquetFileReader.open(HadoopInputFile.fromPath(file, new Configuration()))
}
}

I faced with the very same problem. On not very big files (up to 100 megabytes), the writing time could be up to 20 minutes.
Try to use kite-sdk api. I know it looks as if abandoned but some things in it are done very efficiently. Also if you like Spring you can try spring-data-hadoop (which is some kind of a wrapper over kite-sdk-api). In my case the use of this libraries reduced the writing time to 2 minutes.
For example you can write in Parquet (using spring-data-hadoop but writing using kite-sdk-api looks quite similiar) in this manner:
final DatasetRepositoryFactory repositoryFactory = new DatasetRepositoryFactory();
repositoryFactory.setBasePath(basePath);
repositoryFactory.setConf(configuration);
repositoryFactory.setNamespace("my-parquet-file");
DatasetDefinition datasetDefinition = new DatasetDefinition(targetClass, true, Formats.PARQUET.getName());
try (DataStoreWriter<T> writer = new ParquetDatasetStoreWriter<>(clazz, datasetRepositoryFactory, datasetDefinition)) {
for (T record : records) {
writer.write(record);
}
writer.flush();
}
Of course you need to add some dependencies to your project (in my case this is spring-data-hadoop):
<dependency>
<groupId>org.springframework.data</groupId>
<artifactId>spring-data-hadoop</artifactId>
<version>${spring.hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.springframework.data</groupId>
<artifactId>spring-data-hadoop-boot</artifactId>
<version>${spring.hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.springframework.data</groupId>
<artifactId>spring-data-hadoop-config</artifactId>
<version>${spring.hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.springframework.data</groupId>
<artifactId>spring-data-hadoop-store</artifactId>
<version>${spring.hadoop.version}</version>
</dependency>
If you absolutely want to do it using only native hadoop api, in any case it will be useful to take a look at the source code of these libraries in order to implement efficiently writing in parquet files.

What you are trying to achieve is already possible using the merge command of parquet-tools. However, it is not recommended for merging small files, since it doesn't actually merge the row groups, only places them one after the another (exactly how you describe it in your question). The resulting file will probably have bad performance characteristics.
If you would like to implement it yourself nevertheless, you can either increase the heap size, or modify the code so that it does not read all of the files into memory before writing the new file, but instead reads them one by one (or even better, rowgroup by rowgroup), and immediately writes them to the new file. This way you will only need to keep in memory a single file or row group.

I have implemented something solution using Spark with pyspark script, below is sample code for same, here loading multiple parquet files from directory location, also if parquet files schema is little different in files we are merging that as well.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("App_name") \
.getOrCreate()
dataset_DF = spark.read.option("mergeSchema", "true").load("/dir/parquet_files/")
dataset_DF.write.parquet("file_name.parquet")
Hope it will be short solution.

In Java, how can I create an equivalent of an Apache Avro container file without being forced to use a File as a medium?

This is somewhat of a shot in the dark in case anyone savvy with the Java implementation of Apache Avro is reading this.
My high-level objective is to have some way to transmit some series of avro data over the network (let's just say HTTP for example, but the particular protocol is not that important for this purpose). In my context I have a HttpServletResponse I need to write this data to somehow.
I initially attempted to write the data as what amounted to a virtual version of an avro container file (suppose that "response" is of type HttpServletResponse):
response.setContentType("application/octet-stream");
response.setHeader("Content-transfer-encoding", "binary");
ServletOutputStream outStream = response.getOutputStream();
BufferedOutputStream bos = new BufferedOutputStream(outStream);
Schema someSchema = Schema.parse(".....some valid avro schema....");
GenericRecord someRecord = new GenericData.Record(someSchema);
someRecord.put("somefield", someData);
...
GenericDatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(someSchema);
DataFileWriter<GenericRecord> fileWriter = new DataFileWriter<GenericRecord>(datumWriter);
fileWriter.create(someSchema, bos);
fileWriter.append(someRecord);
fileWriter.close();
bos.flush();
This was all fine and dandy, except that it turns out Avro doesn't really provide a way to read a container file apart from an actual file: the DataFileReader only has two constructors:
public DataFileReader(File file, DatumReader<D> reader);
and
public DataFileReader(SeekableInput sin, DatumReader<D> reader);
where SeekableInput is some avro-specific customized form whose creation also ends up reading from a file. Now given that, unless there is some way to somehow coerce an InputStream into a File (http://stackoverflow.com/questions/578305/create-a-java-file-object-or-equivalent-using-a-byte-array-in-memory-without-a suggests that there is not, and I have tried looking around the Java documentation as well), this approach won't work if the reader on the other end of the OutputStream receives that avro container file (I'm not sure why they allowed one to output avro binary container files to an arbitrary OutputStream without providing a way to read them from the corresponding InputStream on the other end, but that's beside the point). It seems that the implementation of the container file reader requires the "seekable" functionality that a concrete File provides.
Okay, so it doesn't look like that approach will do what I want. How about creating a JSON response that mimics the avro container file?
public static Schema WRAPPER_SCHEMA = Schema.parse(
"{\"type\": \"record\", " +
"\"name\": \"AvroContainer\", " +
"\"doc\": \"a JSON avro container file\", " +
"\"namespace\": \"org.bar.foo\", " +
"\"fields\": [" +
"{\"name\": \"schema\", \"type\": \"string\", \"doc\": \"schema representing the included data\"}, " +
"{\"name\": \"data\", \"type\": \"bytes\", \"doc\": \"packet of data represented by the schema\"}]}"
);
I'm not sure if this is the best way to approach this given the above constraints, but it looks like this might do the trick. I'll put the schema (of "Schema someSchema" from above, for instance) as a String inside the "schema" field, and then put in the avro-binary-serialized form of a record fitting that schema (ie. "GenericRecord someRecord") inside the "data" field.
I actually wanted to know about a specific detail of that which is described below, but I thought it would be worthwhile to give a bigger context as well, so that if there is a better high-level approach I could be taking (this approach works but just doesn't feel optimal) please do let me know.
My question is, assuming I go with this JSON-based approach, how do I write the avro binary representation of my Record into the "data" field of the AvroContainer schema? For example, I got up to here:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
GenericDatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(someSchema);
Encoder e = new BinaryEncoder(baos);
datumWriter.write(resultsRecord, e);
e.flush();
GenericRecord someRecord = new GenericData.Record(someSchema);
someRecord.put("schema", someSchema.toString());
someRecord.put("data", ByteBuffer.wrap(baos.toByteArray()));
datumWriter = new GenericDatumWriter<GenericRecord>(WRAPPER_SCHEMA);
JsonGenerator jsonGenerator = new JsonFactory().createJsonGenerator(baos, JsonEncoding.UTF8);
e = new JsonEncoder(WRAPPER_SCHEMA, jsonGenerator);
datumWriter.write(someRecord, e);
e.flush();
PrintWriter printWriter = response.getWriter(); // recall that response is the HttpServletResponse
response.setContentType("text/plain");
response.setCharacterEncoding("UTF-8");
printWriter.print(baos.toString("UTF-8"));
I initially tried omitting the ByteBuffer.wrap clause, but then then the line
datumWriter.write(someRecord, e);
threw an exception that I couldn't cast a byte array into ByteBuffer. Fair enough, it looks like when the Encoder class (of which JsonEncoder is a subclass) is called to write an avro Bytes object, it requires a ByteBuffer to be given as an argument. Thus, I tried encapsulating the byte[] with java.nio.ByteBuffer.wrap, but when the data was printed out, it was printed as a straight series of bytes, without being passed through the avro hexadecimal representation:
"data": {"bytes": ".....some gibberish other than the expected format...}
That doesn't seem right. According to the avro documentation, the example bytes object they give says that I need to put in a json object, an example of which looks like "\u00FF", and what I have put in there is clearly not of that format. What I now want to know is the following:
What is an example of an avro bytes format? Does it look something like "\uDEADBEEFDEADBEEF..."?
How do I coerce my binary avro data (as output by the BinaryEncoder into a byte[] array) into a format that I can stick into the GenericRecord object and have it print correctly in JSON? For example, I want an Object DATA for which I can call on some GenericRecord "someRecord.put("data", DATA);" with my avro serialized data inside?
How would I then read that data back into a byte array on the other (consumer) end, when it is given the text JSON representation and wants to recreate the GenericRecord as represented by the AvroContainer-format JSON?
(reiterating the question from before) Is there a better way I could be doing all this?

As Knut said, if you want to use something other than a file, you can either:
use SeekableByteArrayInput, as Knut said, for anything you can shoe-horn into a byte array
Implement SeekablInput in your own way - for example if you were getting it out of some weird database structure.
Or just use a file. Why not?
Those are your answers.

The way I solved this was to ship the schemas separately from the data. I set up a connection handshake that transmits the schemas down from the server, then I send encoded data back and forth. You have to create an outside wrapper object like this:
{'name':'Wrapper','type':'record','fields':[
{'name':'schemaName','type':'string'},
{'name':'records','type':{'type':'array','items':'bytes'}}
]}
Where you first encode your array of records, one by one, into an array of encoded byte arrays. Everything in one array should have the same schema. Then you encode the wrapper object with the above schema -- set "schemaName" to be the name of the schema you used to encode the array.
On the server, you will decode the wrapper object first. Once you decode the wrapper object, you know the schemaName, and you have an array of objects you know how to decode -- use as you will!
Note that you can get away without using the wrapper object if you use a protocol like WebSockets and an engine like Socket.IO (for Node.js) Socket.io gives you a channel-based communication layer between browser and server. In that case, just use a specific schema for each channel, encode each message before you send it. You still have to share the schemas when the connection initiates -- but if you are using WebSockets this is easy to implement. And when you are done you have an arbitrary number of strongly-typed, bidirectional streams between client and server.

Under Java and Scala, we tried using inception via code generated using the Scala nitro codegen. Inception is how the Javascript mtth/avsc library solved this problem. However, we ran into several serialization problems using the Java library where there were erroneous bytes being injected into the byte stream, consistently - and we could not figure out where those bytes were coming from.
Of course that meant building our own implementation of Varint with ZigZag encoding. Meh.
Here it is:
package com.terradatum.query
import java.io.ByteArrayOutputStream
import java.nio.ByteBuffer
import java.security.MessageDigest
import java.util.UUID
import akka.actor.ActorSystem
import akka.stream.stage._
import akka.stream.{Attributes, FlowShape, Inlet, Outlet}
import com.nitro.scalaAvro.runtime.GeneratedMessage
import com.terradatum.diagnostics.AkkaLogging
import org.apache.avro.Schema
import org.apache.avro.generic.{GenericDatumWriter, GenericRecord}
import org.apache.avro.io.EncoderFactory
import org.elasticsearch.search.SearchHit
import scala.collection.mutable.ArrayBuffer
import scala.reflect.ClassTag
/*
* The original implementation of this helper relied exclusively on using the Header Avro record and inception to create
* the header. That didn't work for us because somehow erroneous bytes were injected into the output.
*
* Specifically:
* 1. 0x08 prepended to the magic
* 2. 0x0020 between the header and the sync marker
*
* Rather than continue to spend a large number of hours trying to troubleshoot why the Avro library was producing such
* erroneous output, we build the Avro Container File using a combination of our own code and Avro library code.
*
* This means that Terradatum code is responsible for the Avro Container File header (including magic, file metadata and
* sync marker) and building the blocks. We only use the Avro library code to build the binary encoding of the Avro
* records.
*
* #see https://avro.apache.org/docs/1.8.1/spec.html#Object+Container+Files
*/
object AvroContainerFileHelpers {
val magic: ByteBuffer = {
val magicBytes = "Obj".getBytes ++ Array[Byte](1.toByte)
val mg = ByteBuffer.allocate(magicBytes.length).put(magicBytes)
mg.position(0)
mg
}
def makeSyncMarker(): Array[Byte] = {
val digester = MessageDigest.getInstance("MD5")
digester.update(s"${UUID.randomUUID}#${System.currentTimeMillis()}".getBytes)
val marker = ByteBuffer.allocate(16).put(digester.digest()).compact()
marker.position(0)
marker.array()
}
/*
* Note that other implementations of avro container files, such as the javascript library
* mtth/avsc uses "inception" to encode the header, that is, a datum following a header
* schema should produce valid headers. We originally had attempted to do the same but for
* an unknown reason two bytes wore being inserted into our header, one at the very beginning
* of the header before the MAGIC marker, and one right before the syncmarker of the header.
* We were unable to determine why this wasn't working, and so this solution was used instead
* where the record/map is encoded per the avro spec manually without the use of "inception."
*/
def header(schema: Schema, syncMarker: Array[Byte]): Array[Byte] = {
def avroMap(map: Map[String, ByteBuffer]): Array[Byte] = {
val mapBytes = map.flatMap {
case (k, vBuff) =>
val v = vBuff.array()
val byteStr = k.getBytes()
Varint.encodeLong(byteStr.length) ++ byteStr ++ Varint.encodeLong(v.length) ++ v
}
Varint.encodeLong(map.size.toLong) ++ mapBytes ++ Varint.encodeLong(0)
}
val schemaBytes = schema.toString.getBytes
val schemaBuffer = ByteBuffer.allocate(schemaBytes.length).put(schemaBytes)
schemaBuffer.position(0)
val metadata = Map("avro.schema" -> schemaBuffer)
magic.array() ++ avroMap(metadata) ++ syncMarker
}
def block(binaryRecords: Seq[Array[Byte]], syncMarker: Array[Byte]): Array[Byte] = {
val countBytes = Varint.encodeLong(binaryRecords.length.toLong)
val sizeBytes = Varint.encodeLong(binaryRecords.foldLeft(0)(_+_.length).toLong)
val buff: ArrayBuffer[Byte] = new scala.collection.mutable.ArrayBuffer[Byte]()
buff.append(countBytes:_*)
buff.append(sizeBytes:_*)
binaryRecords.foreach { rec =>
buff.append(rec:_*)
}
buff.append(syncMarker:_*)
buff.toArray
}
def encodeBlock[T](schema: Schema, records: Seq[GenericRecord], syncMarker: Array[Byte]): Array[Byte] = {
//block(records.map(encodeRecord(schema, _)), syncMarker)
val writer = new GenericDatumWriter[GenericRecord](schema)
val out = new ByteArrayOutputStream()
val binaryEncoder = EncoderFactory.get().binaryEncoder(out, null)
records.foreach(record => writer.write(record, binaryEncoder))
binaryEncoder.flush()
val flattenedRecords = out.toByteArray
out.close()
val buff: ArrayBuffer[Byte] = new scala.collection.mutable.ArrayBuffer[Byte]()
val countBytes = Varint.encodeLong(records.length.toLong)
val sizeBytes = Varint.encodeLong(flattenedRecords.length.toLong)
buff.append(countBytes:_*)
buff.append(sizeBytes:_*)
buff.append(flattenedRecords:_*)
buff.append(syncMarker:_*)
buff.toArray
}
def encodeRecord[R <: GeneratedMessage with com.nitro.scalaAvro.runtime.Message[R]: ClassTag](
entity: R
): Array[Byte] =
encodeRecord(entity.companion.schema, entity.toMutable)
def encodeRecord(schema: Schema, record: GenericRecord): Array[Byte] = {
val writer = new GenericDatumWriter[GenericRecord](schema)
val out = new ByteArrayOutputStream()
val binaryEncoder = EncoderFactory.get().binaryEncoder(out, null)
writer.write(record, binaryEncoder)
binaryEncoder.flush()
val bytes = out.toByteArray
out.close()
bytes
}
}
/**
* Encoding of integers with variable-length encoding.
*
* The avro specification uses a variable length encoding for integers and longs.
* If the most significant bit in a integer or long byte is 0 then it knows that no
* more bytes are needed, if the most significant bit is 1 then it knows that at least one
* more byte is needed. In signed ints and longs the most significant bit is traditionally
* used to represent the sign of the integer or long, but for us it's used to encode whether
* more bytes are needed. To get around this limitation we zig-zag through whole numbers such that
* negatives are odd numbers and positives are even numbers:
*
* i.e. -1, -2, -3 would be encoded as 1, 3, 5, and so on
* while 1, 2, 3 would be encoded as 2, 4, 6, and so on.
*
* More information is available in the avro specification here:
* #see http://lucene.apache.org/core/3_5_0/fileformats.html#VInt
* https://developers.google.com/protocol-buffers/docs/encoding?csw=1#types
*/
object Varint {
import scala.collection.mutable
def encodeLong(longVal: Long): Array[Byte] = {
val buff = new ArrayBuffer[Byte]()
Varint.zigZagSignedLong(longVal, buff)
buff.toArray[Byte]
}
def encodeInt(intVal: Int): Array[Byte] = {
val buff = new ArrayBuffer[Byte]()
Varint.zigZagSignedInt(intVal, buff)
buff.toArray[Byte]
}
def zigZagSignedLong[T <: mutable.Buffer[Byte]](x: Long, dest: T): Unit = {
// sign to even/odd mapping: http://code.google.com/apis/protocolbuffers/docs/encoding.html#types
writeUnsignedLong((x << 1) ^ (x >> 63), dest)
}
def writeUnsignedLong[T <: mutable.Buffer[Byte]](v: Long, dest: T): Unit = {
var x = v
while ((x & 0xFFFFFFFFFFFFFF80L) != 0L) {
dest += ((x & 0x7F) | 0x80).toByte
x >>>= 7
}
dest += (x & 0x7F).toByte
}
def zigZagSignedInt[T <: mutable.Buffer[Byte]](x: Int, dest: T): Unit = {
writeUnsignedInt((x << 1) ^ (x >> 31), dest)
}
def writeUnsignedInt[T <: mutable.Buffer[Byte]](v: Int, dest: T): Unit = {
var x = v
while ((x & 0xFFFFF80) != 0L) {
dest += ((x & 0x7F) | 0x80).toByte
x >>>= 7
}
dest += (x & 0x7F).toByte
}
}

Compare file extension to file header

I'm starting to design an application, that will, in part, run through a directory of files and compare their extensions to their file headers.
Does anyone have any advice as to the best way to approach this? I know I could simply have a lookup table that will contain the file's header signature. e.g., JPEG: \xFF\xD8\xFF\xE0
I was hoping there might be a simper way.
Thanks in advance for your help.

I'm afraid it'll have to be more complicated than that. Not every file type has a header at all, and some (such as RAR) have their characteristic data structures at the end rather than at the beginning.
You may want to take a look at the Unix file command, which does the same job:
http://linux.die.net/man/1/file
http://linux.die.net/man/5/magic

If you don't need to do dirty work on these values (and you don't have linux) you could simply use an external program, like TrID, that is able to do this thing for you.
Maybe you can just work on its output without caring to doing it by yourself.. in anycase if you have just around 20 kinds of files that you will have to manage having a simple lookup table (eg. HashMap<String,byte[]>) is not that bad. Of cours this will work only if desidered file format has a magic number, otherwise you are on your own (or with an external program).

Because of the problem with the missing significant header for some file types (thanks #Michael) I would create a map of extension to a kind of type checker with a simple API like
public interface TypeCheck throws IOException {
public boolean isValid(InputStream data);
}
Now you can code something like
File toBeTested = ...;
Map<String,TypeCheck> typeCheckByExtension = ...;
TypeCheck check = typeCheckByExtension.get(getExtension(toBeTested.getName()));
if (check != null) {
InputStream in = new FileInputStream(toBeTested);
if (check.isValid(in)) {
// process valid file
} else {
// process invalid file
}
in.close();
} else {
// process unknown file
}
The Header check for JPEG for example may look like
public class JpegTypeCheck implements TypeCheck {
private static final byte[] HEADER = new byte[] {0xFF, 0xD8, 0xFF, 0xE0};
public boolean isValid(InputStream data) throws IOException {
byte[] header = new byte[4];
return data.read(header) == 4 && Arrays.equals(header, HEADER);
}
}
For other types with no significant header you can implement completly other type checks.

You can extract the mime type for each file and compare this to a map of mimetype/extension (Map<String, List<String>>, the first String is the mime type, the second is a list of valid extensions).
Resources :
Get the Mime Type from a File
JMimeMagic
On the same topic :
Java - HowTo extract MimeType from a byte[]
Getting A File's Mime Type In Java

You can know the file type of file reading the header using apache tika. Following code need apache tika jar.
InputStream is = MainApp.class.getResourceAsStream("/NetFx20SP1_x64.txt");
BufferedInputStream bis = new BufferedInputStream(is);
AutoDetectParser parser = new AutoDetectParser();
Detector detector = parser.getDetector();
Metadata md = new Metadata();
md.add(Metadata.RESOURCE_NAME_KEY,MainApp.class.getResource("/NetFx20SP1_x64.txt").getPath());
MediaType mediaType = detector.detect(bis, md);
System.out.println("MIMe Type of File : " + mediaType.toString());

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Read whole text files from a compression in Spark - java

Related

MimeBodyPart.getInputStream only returns the first 8192 bytes of an email attachment

FileOutputStream in dart flutter

How to read and write Parquet files efficiently?

In Java, how can I create an equivalent of an Apache Avro container file without being forced to use a File as a medium?

Compare file extension to file header

Categories

Resources