I am designing a Spark job in order to:
Parse a binary file that comes inside a .tar.gz file
Create a Dataframe with POJOs extracted from the byte array
Store them in parquet
For the parsing of the binary file, I am using some legacy Java code that reads fixed-length fields from the byte array. This code works when I execute the code as part of a regular JVM process in my laptop.
However, when I upload the same file to HDFS and try to read it from Spark, the fixed-length reading of the fields fails as I never get the fields that the Java code expects.
Standalone code used successfully:
// This is a local path in my laptop
val is = new GZIPInputStream(new FileInputStream(basepath + fileName))
val reader = new E4GTraceFileReader(is,fileName)
// Here I invoke the legacy Java code
// The result here is correct
val result = reader.readTraces()
Spark Job:
val hdfs = FileSystem.get(new URI("hdfs://HDFS_IP_PORT/"), new Configuration())
val hdfsFiles = spark.sparkContext.parallelize(hdfs.listStatus(new Path("SOME_PATH")).map(_.getPath))
// Create Input Stream from each file in the folder
val inputStreamsRDD = hdfsFiles.map(x =>{
val hdfs = FileSystem.get(new URI("hdfs://HDFS_IP_PORT/"), new Configuration())
(hdfs.open(x).getWrappedStream,x)
})
// Read the InputStream into a byte[]
val tracesRDD = inputStreamsRDD.flatMap(x => readTraceRecord(x._1,x._2)).map(flattenPOJO)
private def readTraceRecord(is : InputStream, fileName: Path) : List[E4GEventPacket] = {
println(s"Starting to read ${fileName.getName}")
val reader = new E4GTraceFileReader(is, fileName.getName)
reader.readTraces().asScala.toList
}
I have tried both using the FSDataInputStream returned by hdfs.open as well as hdfs.open(x).getWrappedStream but I don´t get the expected result.
I don´t know if I should paste the legacy Java code here as it is a bit lenghty, however I clearly fails to get the expected fields.
Do you think that the problem here is the serialization done in Spark from the driver program to the executors, which causes the data to be somehow corrupted?
I have tried using both YARN as well as local[1] but I get the same results.
Related
I am reading parts of large file via a Java FileInputStream and would like to stream it's content back to the client (in the form of an akka HttpResponse). I am wondering if this is possible, and how I would do this?
From my research, EntityStreamingSupport can be used but only supports json or csv data. I will be streaming raw data from the file, which will not be in the form of json or csv.
Assuming you use akka-http and Scala you may use getFromFile to stream the entire binary file from a path to the HttpResponse like this:
path("download") {
get {
entity(as[FileHandle]) { fileHandle: FileHandle =>
println(s"Server received download request for: ${fileHandle.fileName}")
getFromFile(new File(fileHandle.absolutePath), MediaTypes.`application/octet-stream`)
}
}
}
Taken from this file upload/download roundtrip akka-http example:
https://github.com/pbernet/akka_streams_tutorial/blob/f246bc061a8f5a1ed9f79cce3f4c52c3c9e1b57a/src/main/scala/akkahttp/HttpFileEcho.scala#L52
Streaming the entire file eliminates the need for "manual chunking", thus the example above will run with limited heap size.
However, if needed manual chunking could be done like this:
val fileInputStream = new FileInputStream(fileHandle.absolutePath)
val chunked: Source[ByteString, Future[IOResult]] = akka.stream.scaladsl.StreamConverters
.fromInputStream(() => fileInputStream, chunkSize = 10 * 1024)
chunked.map(each => println(each)).runWith(Sink.ignore)
I am trying to unzip a file that is in a "response" of a HTTP Request.
My point is that after receiving the response I cannot unzip it nor make it to a blob to parse it afterward.
The zip will always return a xml and the idea after the file is unzipped, is to transform the XML to a JSON.
Here is the code I tried:
val client = HttpClient.newBuilder().build();
val request = HttpRequest.newBuilder()
.uri(URI.create("https://donnees.roulez-eco.fr/opendata/instantane"))
.build();
val response = client.send(request, HttpResponse.BodyHandlers.ofString());
Then the response.body() is just unreadable and I did not find a proper way to make it to a blob
The other code I used for unzipping directly is this one:
val url = URL("https://donnees.roulez-eco.fr/opendata/instantane")
val con = url.openConnection() as HttpURLConnection
con.setRequestProperty("Accept-Encoding", "gzip")
println("Length : " + con.contentLength)
var reader: Reader? = null
reader = InputStreamReader(GZIPInputStream(con.inputStream))
while (true) {
val ch: Int = reader.read()
if (ch == -1) {
break
}
print(ch.toChar())
}
But in this case, it won't accept the gzip
Any idea?
It looks like you're confusing zip (an archive format that supports compression) with gzip (a simple compressed format).
Downloading https://donnees.roulez-eco.fr/opendata/instantane (e.g. with curl) and checking the result shows that it's a zip archive (containing a single file, PrixCarburants_instantane.xml).
But you're trying to decode it as a gzip stream (with GZIPInputStream), which it's not — hence your issue.
Reading a zip file is slightly more involved than reading a gzip file, because it can hold multiple compressed files. But ZipInputStream makes it fairly easy: you can read the first zip entry (which has metadata including its uncompressed size), and then go on to read the actual data in that entry.
A further complication is that this particular compressed file seems to use ISO 8859-1 encoding, not the usual UTF-8. So you need to take that into account when converting the byte stream into text.
Here's some example code:
val zipStream = ZipInputStream(con.inputStream)
val entry = zipStream.nextEntry
val reader = InputStreamReader(zipStream, Charset.forName("ISO-8859-1"))
for (i in 1..entry.size)
print(reader.read().toChar())
Obviously, reading and printing the entire 11MB file one character at a time is not very efficient! And if there's any possibility that the zip archive could have multiple entries, you'd have to read through them all, stopping when you get to the one with the right name. But I hope this is a good illustration.
We can write the content of a dataset into a Json file:
DataSet<...> dataset = ...
dataset.write().json("myFile");
Assuming the dataset is small enough, is there a way to write the content directly into a String, a Stream or any kind of OutputStream?
It is possible to write the dataset into a temporary folder and then read the data again:
Path tempDir = Files.createTempDirectory("tempfiles");
String tempFile = tempDir.toString() + "/json";
dataset.coalesce(1).write().json(tempFile);
Path jsonFile = Files.find(Paths.get(tempFile), 1, (path, basicFileAttributes) -> {
return Files.isRegularFile(path) && path.toString().endsWith("json");
}).findFirst().get();
BufferedReader reader = Files.newBufferedReader(jsonFile);
reader.lines().forEach(System.out::println);
But is there a better way to achieve the same result without using the indirection of an intermediate file?
You can transform your Dataset[A] into a Dataset[String] just by mapping your data.
Your function would convert A to its Json representation (as String for instance).
You can use Jackson to achieve this since it's included with Spark dependencies.
I am new to Spark and Dataframes. I came across the below piece of code provided by the databricks library to read CSV from a specified path in the file system.
sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.load("../Downlos/2017.csv")
Is there any API in csv databricks that parses a byte array from a http request instead of reading from a file system?
Usecase here is to read a multipart(CSV) file uploaded using spring REST handler using Spark Dataframes. I'm looking for a dataframe API that can load a file/byte array as input instead of reading from file system.
From the file read, need to select only those columns in each row that match a given condition(eg. any column value that is not equal to string "play" in each parsed
row) and save only those fields back to the database.
Can anyone suggest if the above mentioned Usecase is feasible in spark using RDD's/Dataframes?..Any suggestions on this would be of much help.
You cannot directly convert it to String you have to convert it tostring then you can create RDD.
check this: URL contents to a String or file
val html = scala.io.Source.fromURL("https://spark.apache.org/").mkString
val list = html.split("\n").filter(_ != "")
val rdds = sc.parallelize(list)
val count = rdds.filter(_.contains("Spark")).count()
ScalafromURLApi
I want to create a custom NiFi processor which can read ESRi ASCII grid files and return CSV like representation with some metadata per file and geo-referenced user data in WKT format.
Unfortunately, the parsed result is not written back as an updated flow file.
https://github.com/geoHeil/geomesa-nifi/blob/rasterAsciiGridToWKT/geomesa-nifi-processors/src/main/scala/org/geomesa/nifi/geo/AsciiGrid2WKT.scala#L71-L107 is my try at making this happen in NiFi.
Unfortunately, only the original files are returned. The converted output is not persisted.
When trying to adapt it to manually serialize some CSV strings like:
val lineSep = System.getProperty("line.separator")
val csvResult = result.map(p => p.productIterator.map{
case Some(value) => value
case None => ""
case rest => rest
}.mkString(";")).mkString(lineSep)
var output = session.write(flowFile, new OutputStreamCallback() {
#throws[IOException]
def process(outputStream: OutputStream): Unit = {
IOUtils.write(csvResult, outputStream, "UTF-8")
}
})
still no flowflies are written. Either the issue from above persists or I get Stream not closed exceptions for the outputStream.
It must be a tiny bit which is missing, but I can't seem to find the missing bit.
Each session method that changes flow file like session.write() returns a new version of file and you have to transfer this new version.
If you change your file in converterIngester() function, you have to return this new version to caller function to transfer to relationship.