Spock mocking inputStream causes infinite loop - java

I have a code:
gridFSFile.inputStream?.bytes
When I try to test it this way:
given:
def inputStream = Mock(InputStream)
def gridFSDBFile = Mock(GridFSDBFile)
List<Byte> byteList = "test data".bytes
...
then:
1 * gridFSDBFile.getInputStream() >> inputStream
1 * inputStream.getBytes() >> byteList
0 * _
The problem is that inputStream.read(_) is called infinite number of times. When I remove the 0 * _ - the test hangs until garbage collector dies.
Please advise how can I properly mock the InputStream without falling into infinite loops i.e. to be able to test the row above with 2 (or like that) interactions.

The following test works:
import spock.lang.Specification
class Spec extends Specification {
def 'it works'() {
given:
def is = GroovyMock(InputStream)
def file = Mock(GridFile)
byte[] bytes = 'test data'.bytes
when:
new FileHolder(file: file).read()
then:
1 * file.getInputStream() >> is
1 * is.getBytes() >> bytes
}
class FileHolder {
GridFile file;
def read() {
file.getInputStream().getBytes()
}
}
class GridFile {
InputStream getInputStream() {
null
}
}
}
Not 100% sure about it but it seems that you need to use GroovyMock here since getBytes is a method added dynamically by groovy. Have a look here.

Related

How to implement fault tolerant file upload with akka remote and steam

I'm an Akka beginner. (I am using Java)
I'm making a file transfer system using Akka.
Currently, I have completed sending the Actor1(Local) -> Actor2(Remote) file.
Now,
When I have a problem transferring files, I'm thinking about how to solve it.
Then I had a question. The questions are as follows.
If I lost my network connection while I was transferring files, the file transfer failed (90 percent complete).
I will recover my network connection a few minutes later.
Is it possible to transfer the rest of the file data? (10% Remaining)
If that's possible, Please give me some advice.
here is my simple code.
thanks :)
Actor1 (Local)
private Behavior<Event> onTick() {
....
String fileName = "test.zip";
Source<ByteString, CompletionStage<IOResult>> logs = FileIO.fromPath(Paths.get(fileName));
logs.runForeach(f -> originalSize += f.size(), mat).thenRun(() -> System.out.println("originalSize : " + originalSize));
SourceRef<ByteString> logsRef = logs.runWith(StreamRefs.sourceRef(), mat);
getContext().ask(
Receiver.FileTransfered.class,
selectedReceiver,
timeout,
responseRef -> new Receiver.TransferFile(logsRef, responseRef, fileName),
(response, failure) -> {
if (response != null) {
return new TransferCompleted(fileName, response.transferedSize);
} else {
return new JobFailed("Processing timed out", fileName);
}
}
);
}
Actor2 (Remote)
public static Behavior<Command> create() {
return Behaviors.setup(context -> {
...
Materializer mat = Materializer.createMaterializer(context);
return Behaviors.receive(Command.class)
.onMessage(TransferFile.class, command -> {
command.sourceRef.getSource().runWith(FileIO.toPath(Paths.get("test.zip")), mat);
command.replyTo.tell(new FileTransfered("filename", 1024));
return Behaviors.same();
}).build();
});
}
You need to think about following for a proper implementation of file transfer with fault tolerance:
How to identify that a transfer has to be resumed for a given file.
How to find the point from which to resume the transfer.
Following implementation makes very simple assumptions about 1 and 2.
The file name is unique and thus can be used for such identification. Strictly speaking, this is not true, for example you can transfer files with the same name from different folders. Or from different nodes, etc. You will have to readjust this based on your use case.
It is assumed that the last/all writes on the receiver side wrote all bytes correctly and total number of written bytes indicate the point to resume the transfer. If this cannot be guaranteed, you need to logically split the original file into chunks and transfer hashes of each chunk, its size and position to the receiver, which has to validate chunks on its side and find correct pointer for resuming the transfer.
(That's a bit more than 2 :) ) This implementation ignores identification of transfer problem and focuses on 1 and 2 instead.
The code:
object Sender {
sealed trait Command
case class Upload(file: String) extends Command
case class StartWithIndex(file: String, index: Long) extends Sender.Command
def behavior(receiver: ActorRef[Receiver.Command]): Behavior[Sender.Command] = Behaviors.setup[Sender.Command] { ctx =>
implicit val materializer: Materializer = SystemMaterializer(ctx.system).materializer
Behaviors.receiveMessage {
case Upload(file) =>
receiver.tell(Receiver.InitUpload(file, ctx.self.narrow[StartWithIndex]))
ctx.log.info(s"Initiating upload of $file")
Behaviors.same
case StartWithIndex(file, starWith) =>
val source = FileIO.fromPath(Paths.get(file), chunkSize = 8192, starWith)
val ref = source.runWith(StreamRefs.sourceRef())
ctx.log.info(s"Starting upload of $file")
receiver.tell(Receiver.Upload(file, ref))
Behaviors.same
}
}
}
object Receiver {
sealed trait Command
case class InitUpload(file: String, replyTo: ActorRef[Sender.StartWithIndex]) extends Command
case class Upload(file: String, fileSource: SourceRef[ByteString]) extends Command
val behavior: Behavior[Receiver.Command] = Behaviors.setup[Receiver.Command] { ctx =>
implicit val materializer: Materializer = SystemMaterializer(ctx.system).materializer
Behaviors.receiveMessage {
case InitUpload(path, replyTo) =>
val file = fileAtDestination(path)
val index = if (file.exists()) file.length else 0
ctx.log.info(s"Got init command for $file at pointer $index")
replyTo.tell(Sender.StartWithIndex(path, index.toLong))
Behaviors.same
case Upload(path, fileSource) =>
val file = fileAtDestination(path)
val sink = if (file.exists()) {
FileIO.toPath(file.toPath, Set(StandardOpenOption.APPEND, StandardOpenOption.WRITE))
} else {
FileIO.toPath(file.toPath, Set(StandardOpenOption.CREATE_NEW, StandardOpenOption.WRITE))
}
ctx.log.info(s"Saving file into ${file.toPath}")
fileSource.runWith(sink)
Behaviors.same
}
}
}
Some auxiliary methods
val destination: File = Files.createTempDirectory("destination").toFile
def fileAtDestination(file: String) = {
val name = new File(file).getName
new File(destination, name)
}
def writeRandomToFile(file: File, size: Int): Unit = {
val out = new FileOutputStream(file, true)
(0 until size).foreach { _ =>
out.write(Random.nextPrintableChar())
}
out.close()
}
And finally some test code
// sender and receiver bootstrapping is omitted
//Create some dummy file to upload
val file: Path = Files.createTempFile("test", "test")
writeRandomToFile(file.toFile, 1000)
//Initiate a new upload
sender.tell(Sender.Upload(file.toAbsolutePath.toString))
// Sleep to allow file upload to finish
Thread.sleep(1000)
//Write more data to the file to emulate a failure
writeRandomToFile(file.toFile, 1000)
//Initiate a new upload that will "recover" from the previous upload
sender.tell(Sender.Upload(file.toAbsolutePath.toString))
Finally, the whole process can be defined as

Tee the InputStream from a launched process in Java/Kotlin

I'm launching a process using ProcessBuilder like so:
val pb = ProcessBuilder("/path/to/process")
pb.redirectErrorStream(true)
val proc = pb.start()
I'd like to do 2 things with the stdout of the process:
Continually monitor its most recent line of output
Log all lines to a file
As far as I can tell, in order to do both of these things I'll need to "split" the InputStream I get from proc.inputStream so that every line is mirrored to 2 other InputStreams: one that can be used to log to a file, and another to parse and monitor the status of the process.
One option would be to have a thread which reads from the InputStream fires an event with each line read to "subscribers", and I think this should work fine, but I was hoping to come up with a more generic "Tee" type functionality that would expose InputStreams to be consumed by whatever wanted to. Basically something like this:
val pb = ProcessBuilder("/path/to/process")
pb.redirectErrorStream(true)
val proc = pb.start()
val originalInputStream = proc.inputStream
val tee = Tee(originalInputStream)
// Every line read from originalInputStream would be
// mirrored to all branches (not necessarily every line
// from the beginning of the originalInputStream, but
// since the start of the lifetime of the created branch)
val branchOne: InputStream = tee.addBranch()
val branchTwo: InputStream = tee.addBranch()
I took a shot at a Tee class, but I'm not sure what to do in the addBranch method:
class Tee(inputStream: InputStream) {
val reader = BufferedReader(InputStreamReader(inputStream))
val branches = mutableListOf<OutputStream>()
fun readLine() {
val line = reader.readLine()
branches.forEach {
it.write(line.toByteArray())
}
}
fun addBranch(): InputStream {
// What to do here? Need to create an OutputStream
// which readLine can write to, but return an InputStream
// which will be updated with each future write to that
// OutputStream
}
}
EDIT: The implementation of Tee I ended up with was as follows:
/**
* Reads from the given [InputStream] and mirrors the read
* data to all of the created 'branches' off of it.
* All branches will 'receive' all data from the original
* [InputStream] starting at the the point of
* the branch's creation.
* NOTE: This class will not read from the given [InputStream]
* automatically, its [read] must be invoked
* to read the data from the original stream and write it to
* the branches
*/
class Tee(inputStream: InputStream) {
val reader = BufferedReader(InputStreamReader(inputStream))
var branches = CopyOnWriteArrayList<OutputStream>()
fun read() {
val c = reader.read()
branches.forEach {
// Recreate the carriage return so that readLine on the
// branched InputStreams works
it.write(c)
}
}
fun addBranch(): InputStream {
val outputStream = PipedOutputStream()
branches.add(outputStream)
return PipedInputStream(outputStream)
}
}
Take a look at the org.apache.commons.io.output.TeeInputStream from Apache Commons then you don't need to bother writing your own.
val pb = ProcessBuilder("/path/to/process")
pb.redirectErrorStream(true)
val proc = pb.start()
val original = proc.inputStream
val out = new PipedOutputStream()
val in = new PipedInputStream()
out.connect(in)
val tee = new TeeInputStream(in, out)
Then just read from tee instead of original, and any bytes read will be also written to out. By using the Piped streams, the data written to out will be made available to be read via in and so now you can have two threads reading from in and tee independently. One thread writing to logs, and one thread monitoring lines.
Looks like simple decorator will be enough for you:
class Tee(private vararg val branches: OutputStream) : OutputStream() {
override fun write(b: Int) {
for (branch in branches) {
branch.write(b)
}
}
override fun write(b: ByteArray?) {
for (branch in branches) {
branch.write(b)
}
}
override fun write(b: ByteArray?, off: Int, len: Int) {
for (branch in branches) {
branch.write(b,off, len)
}
}
override fun flush() {
for (branch in branches) {
branch.flush()
}
}
override fun close() {
for (branch in branches) {
branch.close()
}
}
}
And then you can just copy your input stream to Tee, which, underneath, can do anything — write to file, parse input and so on.
If I understand correctly, you need to parse data line by line, so you can add one else implementation of output steam, which, in reality, will parse input data and do what you need.
Also, please take a look at this answer. Possibly it's what you need if you don't want to deal with multiple output streams.
Also I think you can combine both technics to gain even more power — write to several output streams and parse data at te same time, for example.

Groovy serialization without class definition

I am trying to create a serializable interface implementation in groovy dynamically which could be send over the wire where it can be deserialized and executed with args. I have created anonymous interface implementation using map but it fails on serialization.
gcloader = new ​GroovyClassLoade​r()
script = "class X { public def x = [call: {y -> y+1}] as MyCallable }"​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​
gclass = gcloader.parseClass(script)
x = gclass.newInstance().x​​
// serialzing x fails
I am not sure if a groovy closure is compiled to a random class name, which would make it impossible to deserialized even if it gets serialized. Is there a way to do this?
Here's a piece of code that might be helpful:
import groovy.lang.GroovyClassLoader
def gcLoader = new GroovyClassLoader()
def script = """
class X implements Serializable {
public def x = [
call: { y -> y + 1 }
]
}"""
def cls = gcLoader.parseClass(script)
def inst = cls.newInstance().x
def baos = new ByteArrayOutputStream()
def oos = new ObjectOutputStream(baos)
oos.writeObject(inst)
def serialized = baos.toByteArray()
def bais = new ByteArrayInputStream(serialized)
def ois = new CustomObjectInputStream(bais, gcLoader)
inst = ois.readObject()
assert 2 == inst.call(1)
public class CustomObjectInputStream extends ObjectInputStream {
private ClassLoader classLoader
public CustomObjectInputStream(InputStream ins, ClassLoader classLoader) throws IOException {
super(ins)
this.classLoader = classLoader
}
protected Class<?> resolveClass(ObjectStreamClass desc) throws ClassNotFoundException {
return Class.forName(desc.getName(), false, classLoader)
}
}
Basically, You need an instance of ObjectInputStream with custom ClassLoader.
According to my own limited research I came to conclusion that, In jvm there is no standard/popular library which could pickle code like in python which was the requirement i was primarily after. There are some ways to do it through URL classloaders etc but comes with some inherent complexities. I ended up just simply sending the code string and recompiling it whenever required on multiple machines.

In Java, how can I create an equivalent of an Apache Avro container file without being forced to use a File as a medium?

This is somewhat of a shot in the dark in case anyone savvy with the Java implementation of Apache Avro is reading this.
My high-level objective is to have some way to transmit some series of avro data over the network (let's just say HTTP for example, but the particular protocol is not that important for this purpose). In my context I have a HttpServletResponse I need to write this data to somehow.
I initially attempted to write the data as what amounted to a virtual version of an avro container file (suppose that "response" is of type HttpServletResponse):
response.setContentType("application/octet-stream");
response.setHeader("Content-transfer-encoding", "binary");
ServletOutputStream outStream = response.getOutputStream();
BufferedOutputStream bos = new BufferedOutputStream(outStream);
Schema someSchema = Schema.parse(".....some valid avro schema....");
GenericRecord someRecord = new GenericData.Record(someSchema);
someRecord.put("somefield", someData);
...
GenericDatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(someSchema);
DataFileWriter<GenericRecord> fileWriter = new DataFileWriter<GenericRecord>(datumWriter);
fileWriter.create(someSchema, bos);
fileWriter.append(someRecord);
fileWriter.close();
bos.flush();
This was all fine and dandy, except that it turns out Avro doesn't really provide a way to read a container file apart from an actual file: the DataFileReader only has two constructors:
public DataFileReader(File file, DatumReader<D> reader);
and
public DataFileReader(SeekableInput sin, DatumReader<D> reader);
where SeekableInput is some avro-specific customized form whose creation also ends up reading from a file. Now given that, unless there is some way to somehow coerce an InputStream into a File (http://stackoverflow.com/questions/578305/create-a-java-file-object-or-equivalent-using-a-byte-array-in-memory-without-a suggests that there is not, and I have tried looking around the Java documentation as well), this approach won't work if the reader on the other end of the OutputStream receives that avro container file (I'm not sure why they allowed one to output avro binary container files to an arbitrary OutputStream without providing a way to read them from the corresponding InputStream on the other end, but that's beside the point). It seems that the implementation of the container file reader requires the "seekable" functionality that a concrete File provides.
Okay, so it doesn't look like that approach will do what I want. How about creating a JSON response that mimics the avro container file?
public static Schema WRAPPER_SCHEMA = Schema.parse(
"{\"type\": \"record\", " +
"\"name\": \"AvroContainer\", " +
"\"doc\": \"a JSON avro container file\", " +
"\"namespace\": \"org.bar.foo\", " +
"\"fields\": [" +
"{\"name\": \"schema\", \"type\": \"string\", \"doc\": \"schema representing the included data\"}, " +
"{\"name\": \"data\", \"type\": \"bytes\", \"doc\": \"packet of data represented by the schema\"}]}"
);
I'm not sure if this is the best way to approach this given the above constraints, but it looks like this might do the trick. I'll put the schema (of "Schema someSchema" from above, for instance) as a String inside the "schema" field, and then put in the avro-binary-serialized form of a record fitting that schema (ie. "GenericRecord someRecord") inside the "data" field.
I actually wanted to know about a specific detail of that which is described below, but I thought it would be worthwhile to give a bigger context as well, so that if there is a better high-level approach I could be taking (this approach works but just doesn't feel optimal) please do let me know.
My question is, assuming I go with this JSON-based approach, how do I write the avro binary representation of my Record into the "data" field of the AvroContainer schema? For example, I got up to here:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
GenericDatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(someSchema);
Encoder e = new BinaryEncoder(baos);
datumWriter.write(resultsRecord, e);
e.flush();
GenericRecord someRecord = new GenericData.Record(someSchema);
someRecord.put("schema", someSchema.toString());
someRecord.put("data", ByteBuffer.wrap(baos.toByteArray()));
datumWriter = new GenericDatumWriter<GenericRecord>(WRAPPER_SCHEMA);
JsonGenerator jsonGenerator = new JsonFactory().createJsonGenerator(baos, JsonEncoding.UTF8);
e = new JsonEncoder(WRAPPER_SCHEMA, jsonGenerator);
datumWriter.write(someRecord, e);
e.flush();
PrintWriter printWriter = response.getWriter(); // recall that response is the HttpServletResponse
response.setContentType("text/plain");
response.setCharacterEncoding("UTF-8");
printWriter.print(baos.toString("UTF-8"));
I initially tried omitting the ByteBuffer.wrap clause, but then then the line
datumWriter.write(someRecord, e);
threw an exception that I couldn't cast a byte array into ByteBuffer. Fair enough, it looks like when the Encoder class (of which JsonEncoder is a subclass) is called to write an avro Bytes object, it requires a ByteBuffer to be given as an argument. Thus, I tried encapsulating the byte[] with java.nio.ByteBuffer.wrap, but when the data was printed out, it was printed as a straight series of bytes, without being passed through the avro hexadecimal representation:
"data": {"bytes": ".....some gibberish other than the expected format...}
That doesn't seem right. According to the avro documentation, the example bytes object they give says that I need to put in a json object, an example of which looks like "\u00FF", and what I have put in there is clearly not of that format. What I now want to know is the following:
What is an example of an avro bytes format? Does it look something like "\uDEADBEEFDEADBEEF..."?
How do I coerce my binary avro data (as output by the BinaryEncoder into a byte[] array) into a format that I can stick into the GenericRecord object and have it print correctly in JSON? For example, I want an Object DATA for which I can call on some GenericRecord "someRecord.put("data", DATA);" with my avro serialized data inside?
How would I then read that data back into a byte array on the other (consumer) end, when it is given the text JSON representation and wants to recreate the GenericRecord as represented by the AvroContainer-format JSON?
(reiterating the question from before) Is there a better way I could be doing all this?
As Knut said, if you want to use something other than a file, you can either:
use SeekableByteArrayInput, as Knut said, for anything you can shoe-horn into a byte array
Implement SeekablInput in your own way - for example if you were getting it out of some weird database structure.
Or just use a file. Why not?
Those are your answers.
The way I solved this was to ship the schemas separately from the data. I set up a connection handshake that transmits the schemas down from the server, then I send encoded data back and forth. You have to create an outside wrapper object like this:
{'name':'Wrapper','type':'record','fields':[
{'name':'schemaName','type':'string'},
{'name':'records','type':{'type':'array','items':'bytes'}}
]}
Where you first encode your array of records, one by one, into an array of encoded byte arrays. Everything in one array should have the same schema. Then you encode the wrapper object with the above schema -- set "schemaName" to be the name of the schema you used to encode the array.
On the server, you will decode the wrapper object first. Once you decode the wrapper object, you know the schemaName, and you have an array of objects you know how to decode -- use as you will!
Note that you can get away without using the wrapper object if you use a protocol like WebSockets and an engine like Socket.IO (for Node.js) Socket.io gives you a channel-based communication layer between browser and server. In that case, just use a specific schema for each channel, encode each message before you send it. You still have to share the schemas when the connection initiates -- but if you are using WebSockets this is easy to implement. And when you are done you have an arbitrary number of strongly-typed, bidirectional streams between client and server.
Under Java and Scala, we tried using inception via code generated using the Scala nitro codegen. Inception is how the Javascript mtth/avsc library solved this problem. However, we ran into several serialization problems using the Java library where there were erroneous bytes being injected into the byte stream, consistently - and we could not figure out where those bytes were coming from.
Of course that meant building our own implementation of Varint with ZigZag encoding. Meh.
Here it is:
package com.terradatum.query
import java.io.ByteArrayOutputStream
import java.nio.ByteBuffer
import java.security.MessageDigest
import java.util.UUID
import akka.actor.ActorSystem
import akka.stream.stage._
import akka.stream.{Attributes, FlowShape, Inlet, Outlet}
import com.nitro.scalaAvro.runtime.GeneratedMessage
import com.terradatum.diagnostics.AkkaLogging
import org.apache.avro.Schema
import org.apache.avro.generic.{GenericDatumWriter, GenericRecord}
import org.apache.avro.io.EncoderFactory
import org.elasticsearch.search.SearchHit
import scala.collection.mutable.ArrayBuffer
import scala.reflect.ClassTag
/*
* The original implementation of this helper relied exclusively on using the Header Avro record and inception to create
* the header. That didn't work for us because somehow erroneous bytes were injected into the output.
*
* Specifically:
* 1. 0x08 prepended to the magic
* 2. 0x0020 between the header and the sync marker
*
* Rather than continue to spend a large number of hours trying to troubleshoot why the Avro library was producing such
* erroneous output, we build the Avro Container File using a combination of our own code and Avro library code.
*
* This means that Terradatum code is responsible for the Avro Container File header (including magic, file metadata and
* sync marker) and building the blocks. We only use the Avro library code to build the binary encoding of the Avro
* records.
*
* #see https://avro.apache.org/docs/1.8.1/spec.html#Object+Container+Files
*/
object AvroContainerFileHelpers {
val magic: ByteBuffer = {
val magicBytes = "Obj".getBytes ++ Array[Byte](1.toByte)
val mg = ByteBuffer.allocate(magicBytes.length).put(magicBytes)
mg.position(0)
mg
}
def makeSyncMarker(): Array[Byte] = {
val digester = MessageDigest.getInstance("MD5")
digester.update(s"${UUID.randomUUID}#${System.currentTimeMillis()}".getBytes)
val marker = ByteBuffer.allocate(16).put(digester.digest()).compact()
marker.position(0)
marker.array()
}
/*
* Note that other implementations of avro container files, such as the javascript library
* mtth/avsc uses "inception" to encode the header, that is, a datum following a header
* schema should produce valid headers. We originally had attempted to do the same but for
* an unknown reason two bytes wore being inserted into our header, one at the very beginning
* of the header before the MAGIC marker, and one right before the syncmarker of the header.
* We were unable to determine why this wasn't working, and so this solution was used instead
* where the record/map is encoded per the avro spec manually without the use of "inception."
*/
def header(schema: Schema, syncMarker: Array[Byte]): Array[Byte] = {
def avroMap(map: Map[String, ByteBuffer]): Array[Byte] = {
val mapBytes = map.flatMap {
case (k, vBuff) =>
val v = vBuff.array()
val byteStr = k.getBytes()
Varint.encodeLong(byteStr.length) ++ byteStr ++ Varint.encodeLong(v.length) ++ v
}
Varint.encodeLong(map.size.toLong) ++ mapBytes ++ Varint.encodeLong(0)
}
val schemaBytes = schema.toString.getBytes
val schemaBuffer = ByteBuffer.allocate(schemaBytes.length).put(schemaBytes)
schemaBuffer.position(0)
val metadata = Map("avro.schema" -> schemaBuffer)
magic.array() ++ avroMap(metadata) ++ syncMarker
}
def block(binaryRecords: Seq[Array[Byte]], syncMarker: Array[Byte]): Array[Byte] = {
val countBytes = Varint.encodeLong(binaryRecords.length.toLong)
val sizeBytes = Varint.encodeLong(binaryRecords.foldLeft(0)(_+_.length).toLong)
val buff: ArrayBuffer[Byte] = new scala.collection.mutable.ArrayBuffer[Byte]()
buff.append(countBytes:_*)
buff.append(sizeBytes:_*)
binaryRecords.foreach { rec =>
buff.append(rec:_*)
}
buff.append(syncMarker:_*)
buff.toArray
}
def encodeBlock[T](schema: Schema, records: Seq[GenericRecord], syncMarker: Array[Byte]): Array[Byte] = {
//block(records.map(encodeRecord(schema, _)), syncMarker)
val writer = new GenericDatumWriter[GenericRecord](schema)
val out = new ByteArrayOutputStream()
val binaryEncoder = EncoderFactory.get().binaryEncoder(out, null)
records.foreach(record => writer.write(record, binaryEncoder))
binaryEncoder.flush()
val flattenedRecords = out.toByteArray
out.close()
val buff: ArrayBuffer[Byte] = new scala.collection.mutable.ArrayBuffer[Byte]()
val countBytes = Varint.encodeLong(records.length.toLong)
val sizeBytes = Varint.encodeLong(flattenedRecords.length.toLong)
buff.append(countBytes:_*)
buff.append(sizeBytes:_*)
buff.append(flattenedRecords:_*)
buff.append(syncMarker:_*)
buff.toArray
}
def encodeRecord[R <: GeneratedMessage with com.nitro.scalaAvro.runtime.Message[R]: ClassTag](
entity: R
): Array[Byte] =
encodeRecord(entity.companion.schema, entity.toMutable)
def encodeRecord(schema: Schema, record: GenericRecord): Array[Byte] = {
val writer = new GenericDatumWriter[GenericRecord](schema)
val out = new ByteArrayOutputStream()
val binaryEncoder = EncoderFactory.get().binaryEncoder(out, null)
writer.write(record, binaryEncoder)
binaryEncoder.flush()
val bytes = out.toByteArray
out.close()
bytes
}
}
/**
* Encoding of integers with variable-length encoding.
*
* The avro specification uses a variable length encoding for integers and longs.
* If the most significant bit in a integer or long byte is 0 then it knows that no
* more bytes are needed, if the most significant bit is 1 then it knows that at least one
* more byte is needed. In signed ints and longs the most significant bit is traditionally
* used to represent the sign of the integer or long, but for us it's used to encode whether
* more bytes are needed. To get around this limitation we zig-zag through whole numbers such that
* negatives are odd numbers and positives are even numbers:
*
* i.e. -1, -2, -3 would be encoded as 1, 3, 5, and so on
* while 1, 2, 3 would be encoded as 2, 4, 6, and so on.
*
* More information is available in the avro specification here:
* #see http://lucene.apache.org/core/3_5_0/fileformats.html#VInt
* https://developers.google.com/protocol-buffers/docs/encoding?csw=1#types
*/
object Varint {
import scala.collection.mutable
def encodeLong(longVal: Long): Array[Byte] = {
val buff = new ArrayBuffer[Byte]()
Varint.zigZagSignedLong(longVal, buff)
buff.toArray[Byte]
}
def encodeInt(intVal: Int): Array[Byte] = {
val buff = new ArrayBuffer[Byte]()
Varint.zigZagSignedInt(intVal, buff)
buff.toArray[Byte]
}
def zigZagSignedLong[T <: mutable.Buffer[Byte]](x: Long, dest: T): Unit = {
// sign to even/odd mapping: http://code.google.com/apis/protocolbuffers/docs/encoding.html#types
writeUnsignedLong((x << 1) ^ (x >> 63), dest)
}
def writeUnsignedLong[T <: mutable.Buffer[Byte]](v: Long, dest: T): Unit = {
var x = v
while ((x & 0xFFFFFFFFFFFFFF80L) != 0L) {
dest += ((x & 0x7F) | 0x80).toByte
x >>>= 7
}
dest += (x & 0x7F).toByte
}
def zigZagSignedInt[T <: mutable.Buffer[Byte]](x: Int, dest: T): Unit = {
writeUnsignedInt((x << 1) ^ (x >> 31), dest)
}
def writeUnsignedInt[T <: mutable.Buffer[Byte]](v: Int, dest: T): Unit = {
var x = v
while ((x & 0xFFFFF80) != 0L) {
dest += ((x & 0x7F) | 0x80).toByte
x >>>= 7
}
dest += (x & 0x7F).toByte
}
}

How can I read a file to an InputStream then write it into an OutputStream in Scala?

I'm trying to use basic Java code in Scala to read from a file and write to an OutputStream, but when I use the usual while( != -1 ) in Scala gives me a warning "comparing types of Unit and Int with != will always yield true".
The code is as follows:
val file = this.cache.get(imageFileEntry).getValue().asInstanceOf[File]
response.setContentType( "image/%s".format( imageDescription.getFormat() ) )
val input = new BufferedInputStream( new FileInputStream( file ) )
val output = response.getOutputStream()
var read : Int = -1
while ( ( read = input.read ) != -1 ) {
output.write( read )
}
input.close()
output.flush()
How am I supposed to write from an input stream to an output stream in Scala?
I'm mostly interested in a Scala-like solution.
You could do this:
Iterator
.continually (input.read)
.takeWhile (-1 !=)
.foreach (output.write)
If this is slow:
Iterator
.continually (input.read)
.takeWhile (-1 !=)
.foreach (output.write)
you can expand it:
val bytes = new Array[Byte](1024) //1024 bytes - Buffer size
Iterator
.continually (input.read(bytes))
.takeWhile (-1 !=)
.foreach (read=>output.write(bytes,0,read))
output.close()
Assignment statements always return Unit in Scala, so read = input.read returns Unit, which never equals -1. You can do it like this:
while ({read = input.read; read != -1}) {
output.write(read)
}
def stream(inputStream: InputStream, outputStream: OutputStream) =
{
val buffer = new Array[Byte](16384)
def doStream(total: Int = 0): Int = {
val n = inputStream.read(buffer)
if (n == -1)
total
else {
outputStream.write(buffer, 0, n)
doStream(total + n)
}
}
doStream()
}
We can copy an inputstream to an outputstream in a generic and type-safe manner using typeclasses. A typeclass is a concept. It's one approach to polymorphism. In particular, it's parametric polymorphism because the polymorphic behavior is encoded using parameters. In our case, our parameters will be generic types to Scala traits.
Let's make Reader[I] and Writer[O] traits, where I and O are input and output stream types, respectively.
trait Reader[I] {
def read(input: I, buffer: Array[Byte]): Int
}
trait Writer[O] {
def write(output: O, buffer: Array[Byte], startAt: Int, nBytesToWrite: Int): Unit
}
We can now make a generic copy method that can operate on things that subscribe to these interfaces.
object CopyStreams {
type Bytes = Int
def apply[I, O](input: I, output: O, chunkSize: Bytes = 1024)(implicit r: Reader[I], w: Writer[O]): Unit = {
val buffer = Array.ofDim[Byte](chunkSize)
var count = -1
while ({count = r.read(input, buffer); count > 0})
w.write(output, buffer, 0, count)
}
}
Note the implicit r and w parameters here. Essentially, we're saying that CopyStreams[I,O].apply will work iff there are Reader[I] and a Writer[O] values in scope. This will make us able to call CopyStreams(input, output) seamlessly.
Importantly, however, note that this implementation is generic. It operates on types that are independent of actual stream implementations.
In my particular use case, I needed to copy S3 objects to local files. So I made the following implicit values.
object Reader {
implicit val s3ObjectISReader = new Reader[S3ObjectInputStream] {
#inline override def read(input: S3ObjectInputStream, buffer: Array[Byte]): Int =
input.read(buffer)
}
}
object Writer {
implicit val fileOSWriter = new Writer[FileOutputStream] {
#inline override def write(output: FileOutputStream,
buffer: Array[Byte],
startAt: Int,
nBytesToWrite: Int): Unit =
output.write(buffer, startAt, nBytesToWrite)
}
}
So now I can do the following:
val input:S3ObjectStream = ...
val output = new FileOutputStream(new File(...))
import Reader._
import Writer._
CopyStreams(input, output)
// close and such...
And if we ever need to copy different stream types, we only need to write a new Reader or Writer implicit value. We can use the CopyStreams code without changing it!

Categories

Resources