custom nifi processor - writing of flow file

custom nifi processor - writing of flow file - java

I want to create a custom NiFi processor which can read ESRi ASCII grid files and return CSV like representation with some metadata per file and geo-referenced user data in WKT format.
Unfortunately, the parsed result is not written back as an updated flow file.
https://github.com/geoHeil/geomesa-nifi/blob/rasterAsciiGridToWKT/geomesa-nifi-processors/src/main/scala/org/geomesa/nifi/geo/AsciiGrid2WKT.scala#L71-L107 is my try at making this happen in NiFi.
Unfortunately, only the original files are returned. The converted output is not persisted.
When trying to adapt it to manually serialize some CSV strings like:
val lineSep = System.getProperty("line.separator")
val csvResult = result.map(p => p.productIterator.map{
case Some(value) => value
case None => ""
case rest => rest
}.mkString(";")).mkString(lineSep)
var output = session.write(flowFile, new OutputStreamCallback() {
#throws[IOException]
def process(outputStream: OutputStream): Unit = {
IOUtils.write(csvResult, outputStream, "UTF-8")
}
})
still no flowflies are written. Either the issue from above persists or I get Stream not closed exceptions for the outputStream.
It must be a tiny bit which is missing, but I can't seem to find the missing bit.

Each session method that changes flow file like session.write() returns a new version of file and you have to transfer this new version.
If you change your file in converterIngester() function, you have to return this new version to caller function to transfer to relationship.

Related

Use akka to stream contents of streaming file as a HttpResponse

I am reading parts of large file via a Java FileInputStream and would like to stream it's content back to the client (in the form of an akka HttpResponse). I am wondering if this is possible, and how I would do this?
From my research, EntityStreamingSupport can be used but only supports json or csv data. I will be streaming raw data from the file, which will not be in the form of json or csv.

Assuming you use akka-http and Scala you may use getFromFile to stream the entire binary file from a path to the HttpResponse like this:
path("download") {
get {
entity(as[FileHandle]) { fileHandle: FileHandle =>
println(s"Server received download request for: ${fileHandle.fileName}")
getFromFile(new File(fileHandle.absolutePath), MediaTypes.`application/octet-stream`)
}
}
}
Taken from this file upload/download roundtrip akka-http example:
https://github.com/pbernet/akka_streams_tutorial/blob/f246bc061a8f5a1ed9f79cce3f4c52c3c9e1b57a/src/main/scala/akkahttp/HttpFileEcho.scala#L52
Streaming the entire file eliminates the need for "manual chunking", thus the example above will run with limited heap size.
However, if needed manual chunking could be done like this:
val fileInputStream = new FileInputStream(fileHandle.absolutePath)
val chunked: Source[ByteString, Future[IOResult]] = akka.stream.scaladsl.StreamConverters
.fromInputStream(() => fileInputStream, chunkSize = 10 * 1024)
chunked.map(each => println(each)).runWith(Sink.ignore)

Reading binary InputStream in Spark produces wrong results

I am designing a Spark job in order to:
Parse a binary file that comes inside a .tar.gz file
Create a Dataframe with POJOs extracted from the byte array
Store them in parquet
For the parsing of the binary file, I am using some legacy Java code that reads fixed-length fields from the byte array. This code works when I execute the code as part of a regular JVM process in my laptop.
However, when I upload the same file to HDFS and try to read it from Spark, the fixed-length reading of the fields fails as I never get the fields that the Java code expects.
Standalone code used successfully:
// This is a local path in my laptop
val is = new GZIPInputStream(new FileInputStream(basepath + fileName))
val reader = new E4GTraceFileReader(is,fileName)
// Here I invoke the legacy Java code
// The result here is correct
val result = reader.readTraces()
Spark Job:
val hdfs = FileSystem.get(new URI("hdfs://HDFS_IP_PORT/"), new Configuration())
val hdfsFiles = spark.sparkContext.parallelize(hdfs.listStatus(new Path("SOME_PATH")).map(_.getPath))
// Create Input Stream from each file in the folder
val inputStreamsRDD = hdfsFiles.map(x =>{
val hdfs = FileSystem.get(new URI("hdfs://HDFS_IP_PORT/"), new Configuration())
(hdfs.open(x).getWrappedStream,x)
})
// Read the InputStream into a byte[]
val tracesRDD = inputStreamsRDD.flatMap(x => readTraceRecord(x._1,x._2)).map(flattenPOJO)
private def readTraceRecord(is : InputStream, fileName: Path) : List[E4GEventPacket] = {
println(s"Starting to read ${fileName.getName}")
val reader = new E4GTraceFileReader(is, fileName.getName)
reader.readTraces().asScala.toList
}
I have tried both using the FSDataInputStream returned by hdfs.open as well as hdfs.open(x).getWrappedStream but I don´t get the expected result.
I don´t know if I should paste the legacy Java code here as it is a bit lenghty, however I clearly fails to get the expected fields.
Do you think that the problem here is the serialization done in Spark from the driver program to the executors, which causes the data to be somehow corrupted?
I have tried using both YARN as well as local[1] but I get the same results.

How to read a CSV uploaded via a Spring REST handler using spark?

I am new to Spark and Dataframes. I came across the below piece of code provided by the databricks library to read CSV from a specified path in the file system.
sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.load("../Downlos/2017.csv")
Is there any API in csv databricks that parses a byte array from a http request instead of reading from a file system?
Usecase here is to read a multipart(CSV) file uploaded using spring REST handler using Spark Dataframes. I'm looking for a dataframe API that can load a file/byte array as input instead of reading from file system.
From the file read, need to select only those columns in each row that match a given condition(eg. any column value that is not equal to string "play" in each parsed
row) and save only those fields back to the database.
Can anyone suggest if the above mentioned Usecase is feasible in spark using RDD's/Dataframes?..Any suggestions on this would be of much help.

You cannot directly convert it to String you have to convert it tostring then you can create RDD.
check this: URL contents to a String or file
val html = scala.io.Source.fromURL("https://spark.apache.org/").mkString
val list = html.split("\n").filter(_ != "")
val rdds = sc.parallelize(list)
val count = rdds.filter(_.contains("Spark")).count()
ScalafromURLApi

MockMvc and streaming endpoints - additional bytes after stream close

I'm using spring-test with spring-boot in a small Scala application. (Apologies for the long intro/short question!)
So far, everything has worked out fine until I decided to modify one of the endpoints to support streaming. To do this, I added the HttpServletResponse object to my request handler and copy the source data using Apache Commons' IOUtils.copy.
#RequestMapping(value = Array("/hello"), method = Array(RequestMethod.GET))
def retrieveFileForVersion(response:HttpServletResponse) = {
val is = getAnInputStream
val os = response.getOutputStream
try {
IOUtils.copy(is, os)
} finally {
IOUtils.closeQuietly(is)
os.flush()
IOUtils.closeQuietly(os)
}
}
}
This seems to work rather well. I can retrieve binary data from the endpoint and verify that its MD5 checksum matches the source data's MD5 checksum.
However, I noticed this is no longer the case when using the REST controller in spring-test's MockMvc. In fact, when the request is performed through MockMvc, the response is actually four bytes bigger than usual. Thus, some simple assertions fail:
#Test
def testHello() = {
// ... snip ... read the binary file into a byte array
val bytes = IOUtils.toByteArray(...)
val result = mockMvc.perform(get("/hello")).andExpect(status.isOk).andReturn
val responseLength = result.getResponse.getContentAsByteArray.length
// TODO - find out why this test fails!
assert(responseLength == bytes.length, s"Response size: $responseLength, file size: ${bytes.length}")
assert(Arrays.equals(result.getResponse.getContentAsByteArray, bytes))
}
Using the debugger, I was able to determine that MockMvc is appending to the response OutputStream even though it is already closed using IOUtils.closeQuietly. In fact, it is appending the return value of the request handler which is the number of bytes in the OutputStream (from IOUtils.closeQuietly in fact).
Why is MockMvc appending to the OutputStream after it's already closed? Is this a bug, or am I using the library incorrectly?

The return value from a controller method can be interpreted in different ways depending on the return type, the method annotations, and in some cases the input arguments.
This is exhaustively listed on the #RequestMapping annotation and in the reference documentation. For your streaming case, taking the combination of HttpServletResponse as input argument (you could also take OutputStream by the way) and void as return type, indicates to Spring MVC that you're handled the response yourself.

Trailing null (\x00) characters when writing text to Accumulo

I am trying to write the name of a file into Accumulo. I am using accumulo-core-1.43.
For some reason, certain files seem to be written into Accumulo with trailing \x00 characters at the end of the name. The upload is coming through a Java servlet (using the jquery file upload plugin). In the servlet, I check the name of the file with a System.out.println and it looks normal, and I even tried unescaping the string with
org.apache.commons.lang.StringEscapeUtils.unescapeJava(...);
The actual writing to accumulo looks like this:
Mutation mut = new Mutation(new Text(checkSum));
Value val = new Value(new Text(filename).getBytes());
long timestamp = System.currentTimeMillis();
mut.put(new Text(colFam), new Text(EMPTY_BYTES), timestamp, val);
but nothing unusual showed up there (perhaps \x00 isn't escaped)? But then if I do a scan on my table in accumulo, there will be one or more \x00 in the file name.
The problem this seems to cause is that I return that string within XML when I retrieve a list of files (where it shows up) and pass that back to the browser, the the XSL that is supposed to render the information in the XML no longer works when there's these extra characters (not sure why that is the case either).
In chrome, for the response on these calls, I see that there's three red dots after the file name, and when I hover over it, \u0 pops up (which I think is a different representation of 0/null?).
Anyway, I'm just trying to figure out why this happens, or at the very least, how I can filter out \x00 characters before returning the file in Java. any ideas?

You are likely incorrectly using the Hadoop Text class -- this is not an error with Accumulo. Specifically, you make the mistake in your above example:
Value val = new Value(new Text(filename).getBytes());
You must adhere to the length of provided by the Text class. See the Text javadoc for more information. If you're using Hadoop-2.2.0, you can use the provided copyBytes method on Text. If you're on older version of Hadoop where this method doesn't yet exist, you can use something like the ByteBuffer class or the System.arraycopy method to get a copy of the byte[] with the proper limits enforced.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

custom nifi processor - writing of flow file - java

Each session method that changes flow file like session.write() returns a new version of file and you have to transfer this new version. If you change your file in converterIngester() function, you have to return this new version to caller function to transfer to relationship.

Related

Use akka to stream contents of streaming file as a HttpResponse

Reading binary InputStream in Spark produces wrong results

How to read a CSV uploaded via a Spring REST handler using spark?

MockMvc and streaming endpoints - additional bytes after stream close

Trailing null (\x00) characters when writing text to Accumulo

Categories

Resources