We can write the content of a dataset into a Json file:
DataSet<...> dataset = ...
dataset.write().json("myFile");
Assuming the dataset is small enough, is there a way to write the content directly into a String, a Stream or any kind of OutputStream?
It is possible to write the dataset into a temporary folder and then read the data again:
Path tempDir = Files.createTempDirectory("tempfiles");
String tempFile = tempDir.toString() + "/json";
dataset.coalesce(1).write().json(tempFile);
Path jsonFile = Files.find(Paths.get(tempFile), 1, (path, basicFileAttributes) -> {
return Files.isRegularFile(path) && path.toString().endsWith("json");
}).findFirst().get();
BufferedReader reader = Files.newBufferedReader(jsonFile);
reader.lines().forEach(System.out::println);
But is there a better way to achieve the same result without using the indirection of an intermediate file?
You can transform your Dataset[A] into a Dataset[String] just by mapping your data.
Your function would convert A to its Json representation (as String for instance).
You can use Jackson to achieve this since it's included with Spark dependencies.
Related
I am reading parts of large file via a Java FileInputStream and would like to stream it's content back to the client (in the form of an akka HttpResponse). I am wondering if this is possible, and how I would do this?
From my research, EntityStreamingSupport can be used but only supports json or csv data. I will be streaming raw data from the file, which will not be in the form of json or csv.
Assuming you use akka-http and Scala you may use getFromFile to stream the entire binary file from a path to the HttpResponse like this:
path("download") {
get {
entity(as[FileHandle]) { fileHandle: FileHandle =>
println(s"Server received download request for: ${fileHandle.fileName}")
getFromFile(new File(fileHandle.absolutePath), MediaTypes.`application/octet-stream`)
}
}
}
Taken from this file upload/download roundtrip akka-http example:
https://github.com/pbernet/akka_streams_tutorial/blob/f246bc061a8f5a1ed9f79cce3f4c52c3c9e1b57a/src/main/scala/akkahttp/HttpFileEcho.scala#L52
Streaming the entire file eliminates the need for "manual chunking", thus the example above will run with limited heap size.
However, if needed manual chunking could be done like this:
val fileInputStream = new FileInputStream(fileHandle.absolutePath)
val chunked: Source[ByteString, Future[IOResult]] = akka.stream.scaladsl.StreamConverters
.fromInputStream(() => fileInputStream, chunkSize = 10 * 1024)
chunked.map(each => println(each)).runWith(Sink.ignore)
I'm passing an array of array to a java method and I need to add that data to a new file (which will be loaded into an s3 bucket)
How do I do this? I haven't been able to find an example of this
Also, I'm sure "object" is not the correct data type this attribute should be. Array doesn't seem to be the correct one.
Java method -
public void uploadStreamToS3Bucket(String[][] locations) {
try {
AmazonS3 s3Client = AmazonS3ClientBuilder.standard()
.withRegion(String.valueOf(awsRegion))
.build();
String fileName = connectionRequestRepository.findStream() +".json";
String bucketName = "downloadable-cases";
File locationData = new File(?????) // Convert locations attribute to a file and load it to putObject
s3Client.putObject(new PutObjectRequest(bucketName, fileName, locationData));
} catch (AmazonServiceException ex) {
System.out.println("Error: " + ex.getMessage());
}
}
You're trying to use PutObjectRequest(String,String,File)
but you don't have a file. So you can either:
Write your object to a file and then pass that file
or
Use the PutObjectRequest(String,String,InputStream,ObjectMetadata) version instead.
The later is better as you save the intermediate step.
As for how to write an object to a stream you may ask: Check this How can I convert an Object to Inputstream
Bear in mind to read it you have to use the same format.
It might be worth to think about what kind of format you want to save your information, because it might be needed to be read for another program, or maybe by another human directly from the bucket and there might be other formats / serializers that area easy to read (if you write JSON for instance) or more efficient (if you use another serializer that takes less space).
As for the type of array of array you can use the [][] syntax. For instance an array of array of Strings would be:
String [][] arrayOfStringArrays;
I hope this helps.
I am designing a Spark job in order to:
Parse a binary file that comes inside a .tar.gz file
Create a Dataframe with POJOs extracted from the byte array
Store them in parquet
For the parsing of the binary file, I am using some legacy Java code that reads fixed-length fields from the byte array. This code works when I execute the code as part of a regular JVM process in my laptop.
However, when I upload the same file to HDFS and try to read it from Spark, the fixed-length reading of the fields fails as I never get the fields that the Java code expects.
Standalone code used successfully:
// This is a local path in my laptop
val is = new GZIPInputStream(new FileInputStream(basepath + fileName))
val reader = new E4GTraceFileReader(is,fileName)
// Here I invoke the legacy Java code
// The result here is correct
val result = reader.readTraces()
Spark Job:
val hdfs = FileSystem.get(new URI("hdfs://HDFS_IP_PORT/"), new Configuration())
val hdfsFiles = spark.sparkContext.parallelize(hdfs.listStatus(new Path("SOME_PATH")).map(_.getPath))
// Create Input Stream from each file in the folder
val inputStreamsRDD = hdfsFiles.map(x =>{
val hdfs = FileSystem.get(new URI("hdfs://HDFS_IP_PORT/"), new Configuration())
(hdfs.open(x).getWrappedStream,x)
})
// Read the InputStream into a byte[]
val tracesRDD = inputStreamsRDD.flatMap(x => readTraceRecord(x._1,x._2)).map(flattenPOJO)
private def readTraceRecord(is : InputStream, fileName: Path) : List[E4GEventPacket] = {
println(s"Starting to read ${fileName.getName}")
val reader = new E4GTraceFileReader(is, fileName.getName)
reader.readTraces().asScala.toList
}
I have tried both using the FSDataInputStream returned by hdfs.open as well as hdfs.open(x).getWrappedStream but I don´t get the expected result.
I don´t know if I should paste the legacy Java code here as it is a bit lenghty, however I clearly fails to get the expected fields.
Do you think that the problem here is the serialization done in Spark from the driver program to the executors, which causes the data to be somehow corrupted?
I have tried using both YARN as well as local[1] but I get the same results.
I am new to Spark and Dataframes. I came across the below piece of code provided by the databricks library to read CSV from a specified path in the file system.
sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.load("../Downlos/2017.csv")
Is there any API in csv databricks that parses a byte array from a http request instead of reading from a file system?
Usecase here is to read a multipart(CSV) file uploaded using spring REST handler using Spark Dataframes. I'm looking for a dataframe API that can load a file/byte array as input instead of reading from file system.
From the file read, need to select only those columns in each row that match a given condition(eg. any column value that is not equal to string "play" in each parsed
row) and save only those fields back to the database.
Can anyone suggest if the above mentioned Usecase is feasible in spark using RDD's/Dataframes?..Any suggestions on this would be of much help.
You cannot directly convert it to String you have to convert it tostring then you can create RDD.
check this: URL contents to a String or file
val html = scala.io.Source.fromURL("https://spark.apache.org/").mkString
val list = html.split("\n").filter(_ != "")
val rdds = sc.parallelize(list)
val count = rdds.filter(_.contains("Spark")).count()
ScalafromURLApi
I have a functionality which writes a dataset into a location in the hdfs . I noticed that in case the dataset is a result of union or join functionality more than 1 part is created in the hdfs. The contents of the location is as follows :-
_SUCCESS
part-00000-7908d1be-5409-4ac4-a218-29b9b1f99449
part-00001-7908d1be-5409-4ac4-a218-29b9b1f99449
If "dataset" is the dataset to be stored in hdfs at location "locale" with dataframeWriter csv options (header , sep, qualifier ) are stored in "options" (header being true always), as follows:-
DataFrameWriter dataFrameWriter = dataset.write();
if(options != null && !options.isEmpty()) {
dataFrameWriter = dataFrameWriter.options(options);
}
dataFrameWriter.mode(saveMode).csv(locale);
This is then merged into a single part using fileUtil "copyMerge":-
if(fileUtil.copyMerge(fileSystem,sourcePath,fileSystem,destinationPath,true,hdfs
Config,null)){
fileStatus = fileSystem.getFileStatus(destinationPath);
}
CopyMerge() is a function that take all contents of the location and merges them and writes them in the provided destination path.
The problem I am facing in each chunk the header is written, and by copyMerge the header is repeated more than once:-
Header1,Header2,Header3
x,y,z
a,b,c
Header1,Header2,Header3
r,t,y
h,y,d
I tried solving it by making the "header" option false and writing the header as a string using outputStream into the same location and then calling the copyMerge, as already mentioned.
String separator = (StringUtils.isNotEmpty(options.get(SEPARATOR)))?
options.get(SEPARATOR):",";
String header = String.join(separator,dataset.columns());
header = header+'\n';
InputStream stream = new
ByteArrayInputStream(header.getBytes(StandardCharsets.UTF_8));
writeFile(location + "/header", stream); //this being another function that writes the stream into the location
The problem I am facing with this approach is I have to manipulate the header for all csv options like "escapeQuotes", "quoteAll" etc, just like I am manipulating the header for the separator.
Is there any way I can handle this header issue, without having to manipulate the header differently.