Scala vs Java Streaming: Scala prints nothing, Java works - java

I'm doing a comparison between Scala vs Java Reactive Spec implementations using akka-stream and RxJava, respectively. My use case is a simplistic grep: Given a directory, a file filter and a search text, I look in that directory for all matching files that have the text. I then stream the (filename -> matching line) pair.
This works fine for Java but for Scala, nothing is printed. There's no exception but no output either.
The data for the test is downloaded from the internet but as you can see, the code can easily be tested with any local directory as well.
Scala:
object Transformer {
implicit val system = ActorSystem("transformer")
implicit val materializer = ActorMaterializer()
implicit val executionContext: ExecutionContext = {
implicitly
}
import collection.JavaConverters._
def run(path: String, text: String, fileFilter: String) = {
Source.fromIterator { () =>
Files.newDirectoryStream(Paths.get(path), fileFilter).iterator().asScala
}.map(p => {
val lines = io.Source.fromFile(p.toFile).getLines().filter(_.contains(text)).map(_.trim).to[ImmutableList]
(p, lines)
})
.runWith(Sink.foreach(e => println(s"${e._1} -> ${e._2}")))
}
}
Java:
public class Transformer {
public static void run(String path, String text, String fileFilter) {
Observable.from(files(path, fileFilter)).flatMap(p -> {
try {
return Observable.from((Iterable<Map.Entry<String, List<String>>>) Files.lines(p)
.filter(line -> line.contains(text))
.map(String::trim)
.collect(collectingAndThen(groupingBy(pp -> p.toAbsolutePath().toString()), Map::entrySet)));
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}).toBlocking().forEach(e -> System.out.printf("%s -> %s.%n", e.getKey(), e.getValue()));
}
private static Iterable<Path> files(String path, String fileFilter) {
try {
return Files.newDirectoryStream(Paths.get(path), fileFilter);
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}
}
Unit test using Scala Test:
class TransformerSpec extends FlatSpec with Matchers {
"Transformer" should "extract temperature" in {
Transformer.run(NoaaClient.currentConditionsPath(), "temp_f", "*.xml")
}
"Java Transformer" should "extract temperature" in {
JavaTransformer.run(JavaNoaaClient.currentConditionsPath(false), "temp_f", "*.xml")
}
}

Dang, I forgot that Source returns a Future, which means the flow never ran. #MrWiggles' comment gave me a hint. The following Scala code produces equivalent result as the Java version.
Note: The code in my question didn't close the DirectoryStreamwhich, for directories with a large number of files, caused a java.io.IOException: Too many open files in system. The code below closes the resources up properly.
def run(path: String, text: String, fileFilter: String) = {
val files = Files.newDirectoryStream(Paths.get(path), fileFilter)
val future = Source(files.asScala.toList).map(p => {
val lines = io.Source.fromFile(p.toFile).getLines().filter(_.contains(text)).map(_.trim).to[ImmutableList]
(p, lines)
})
.filter(!_._2.isEmpty)
.runWith(Sink.foreach(e => println(s"${e._1} -> ${e._2}")))
Await.result(future, 10.seconds)
files.close
true // for testing
}

Related

How to combine a WebFlux WebClient DataBuffer download with more actions

I am trying to download a file (or multiple files), based on the result of a previous webrequest. After downloading the file I need to send the previous Mono result (dossier and obj) and the file to another system. So far I have been working with flatMaps and Monos. But when reading large files, I cannot use the Mono during the file download, as the buffer is too small.
Simplified the code looks something like this:
var filePath = Paths.get("test.pdf");
this.dmsService.search()
.flatMap(result -> {
var dossier = result.getObjects().get(0).getProperties();
var objectId = dossier.getReferencedObjectId();
return Mono.zip(this.dmsService.getById(objectId), Mono.just(dossier));
})
.flatMap(tuple -> {
var obj = tuple.getT1();
var dossier = tuple.getT2();
var media = this.dmsService.getDocument(objectId);
var writeMono = DataBufferUtils.write(media, filePath);
return Mono.zip(Mono.just(obj), Mono.just(dossier), writeMono);
})
.flatMap(tuple -> {
var obj = tuple.getT1();
var dossier = tuple.getT2();
var objectId = dossier.getReferencedObjectId();
var zip = zipService.createZip(objectId, obj, dossier);
return zipService.uploadZip(Flux.just(zip));
})
.flatMap(newWorkItemId -> {
return updateMetadata(newWorkItemId);
})
.subscribe(() -> {
finishItem();
});
dmsService.search(), this.dmsService.getById(objectId), zipService.uploadZip() all return Mono of a specific type.
dmsService.getDocument(objectId) returns a Flux due to support for large files. With a DataBuffer Mono it was worked for small files if I simply used a Files.copy:
...
var contentMono = this.dmsService.getDocument(objectId);
return contentMono;
})
.flatMap(content -> {
Files.copy(content.asInputStream(), Path.of("test.pdf"));
...
}
I have tried different approaches but always ran into problems.
Based on https://www.amitph.com/spring-webclient-large-file-download/#Downloading_a_Large_File_with_WebClient
DataBufferUtils.write(dataBuffer, destination).share().block();
When I try this, nothing after .block() is ever executed. No download is made.
Without the .share() I get an exception, that I may not use block:
java.lang.IllegalStateException: block()/blockFirst()/blockLast() are blocking, which is not supported in thread reactor-http-nio-5
Since DataBufferUtils.write returns a Mono my next assumption was, that instead of calling block, I can Mono.zip() this together with my other values, but this never returns either.
var media = this.dmsService.getDocument(objectId);
var writeMono = DataBufferUtils.write(media, filePath);
return Mono.zip(Mono.just(obj), Mono.just(dossier), writeMono);
Any inputs on how to achieve this are greatly appreachiated.
I finally figured out that if I use a WritableByteChannel which returns a Flux<DataBuffer> instead of a Mono<Void> I can map the return value to release the DataBufferUtils, which seems to do the trick. I found the inspiration for this solution here: DataBuffer doesn't write to file
var media = this.dmsService.getDocument(objectId);
var file = Files.createTempFile(objectId, ".tmp");
WritableByteChannel filechannel = Files.newByteChannel(file, StandardOpenOption.WRITE);
var writeMono = DataBufferUtils.write(media, filechannel)
.map(DataBufferUtils::release)
.then(Mono.just(file));
return Mono.zip(Mono.just(obj), Mono.just(dossier), writeMono);

How to implement fault tolerant file upload with akka remote and steam

I'm an Akka beginner. (I am using Java)
I'm making a file transfer system using Akka.
Currently, I have completed sending the Actor1(Local) -> Actor2(Remote) file.
Now,
When I have a problem transferring files, I'm thinking about how to solve it.
Then I had a question. The questions are as follows.
If I lost my network connection while I was transferring files, the file transfer failed (90 percent complete).
I will recover my network connection a few minutes later.
Is it possible to transfer the rest of the file data? (10% Remaining)
If that's possible, Please give me some advice.
here is my simple code.
thanks :)
Actor1 (Local)
private Behavior<Event> onTick() {
....
String fileName = "test.zip";
Source<ByteString, CompletionStage<IOResult>> logs = FileIO.fromPath(Paths.get(fileName));
logs.runForeach(f -> originalSize += f.size(), mat).thenRun(() -> System.out.println("originalSize : " + originalSize));
SourceRef<ByteString> logsRef = logs.runWith(StreamRefs.sourceRef(), mat);
getContext().ask(
Receiver.FileTransfered.class,
selectedReceiver,
timeout,
responseRef -> new Receiver.TransferFile(logsRef, responseRef, fileName),
(response, failure) -> {
if (response != null) {
return new TransferCompleted(fileName, response.transferedSize);
} else {
return new JobFailed("Processing timed out", fileName);
}
}
);
}
Actor2 (Remote)
public static Behavior<Command> create() {
return Behaviors.setup(context -> {
...
Materializer mat = Materializer.createMaterializer(context);
return Behaviors.receive(Command.class)
.onMessage(TransferFile.class, command -> {
command.sourceRef.getSource().runWith(FileIO.toPath(Paths.get("test.zip")), mat);
command.replyTo.tell(new FileTransfered("filename", 1024));
return Behaviors.same();
}).build();
});
}
You need to think about following for a proper implementation of file transfer with fault tolerance:
How to identify that a transfer has to be resumed for a given file.
How to find the point from which to resume the transfer.
Following implementation makes very simple assumptions about 1 and 2.
The file name is unique and thus can be used for such identification. Strictly speaking, this is not true, for example you can transfer files with the same name from different folders. Or from different nodes, etc. You will have to readjust this based on your use case.
It is assumed that the last/all writes on the receiver side wrote all bytes correctly and total number of written bytes indicate the point to resume the transfer. If this cannot be guaranteed, you need to logically split the original file into chunks and transfer hashes of each chunk, its size and position to the receiver, which has to validate chunks on its side and find correct pointer for resuming the transfer.
(That's a bit more than 2 :) ) This implementation ignores identification of transfer problem and focuses on 1 and 2 instead.
The code:
object Sender {
sealed trait Command
case class Upload(file: String) extends Command
case class StartWithIndex(file: String, index: Long) extends Sender.Command
def behavior(receiver: ActorRef[Receiver.Command]): Behavior[Sender.Command] = Behaviors.setup[Sender.Command] { ctx =>
implicit val materializer: Materializer = SystemMaterializer(ctx.system).materializer
Behaviors.receiveMessage {
case Upload(file) =>
receiver.tell(Receiver.InitUpload(file, ctx.self.narrow[StartWithIndex]))
ctx.log.info(s"Initiating upload of $file")
Behaviors.same
case StartWithIndex(file, starWith) =>
val source = FileIO.fromPath(Paths.get(file), chunkSize = 8192, starWith)
val ref = source.runWith(StreamRefs.sourceRef())
ctx.log.info(s"Starting upload of $file")
receiver.tell(Receiver.Upload(file, ref))
Behaviors.same
}
}
}
object Receiver {
sealed trait Command
case class InitUpload(file: String, replyTo: ActorRef[Sender.StartWithIndex]) extends Command
case class Upload(file: String, fileSource: SourceRef[ByteString]) extends Command
val behavior: Behavior[Receiver.Command] = Behaviors.setup[Receiver.Command] { ctx =>
implicit val materializer: Materializer = SystemMaterializer(ctx.system).materializer
Behaviors.receiveMessage {
case InitUpload(path, replyTo) =>
val file = fileAtDestination(path)
val index = if (file.exists()) file.length else 0
ctx.log.info(s"Got init command for $file at pointer $index")
replyTo.tell(Sender.StartWithIndex(path, index.toLong))
Behaviors.same
case Upload(path, fileSource) =>
val file = fileAtDestination(path)
val sink = if (file.exists()) {
FileIO.toPath(file.toPath, Set(StandardOpenOption.APPEND, StandardOpenOption.WRITE))
} else {
FileIO.toPath(file.toPath, Set(StandardOpenOption.CREATE_NEW, StandardOpenOption.WRITE))
}
ctx.log.info(s"Saving file into ${file.toPath}")
fileSource.runWith(sink)
Behaviors.same
}
}
}
Some auxiliary methods
val destination: File = Files.createTempDirectory("destination").toFile
def fileAtDestination(file: String) = {
val name = new File(file).getName
new File(destination, name)
}
def writeRandomToFile(file: File, size: Int): Unit = {
val out = new FileOutputStream(file, true)
(0 until size).foreach { _ =>
out.write(Random.nextPrintableChar())
}
out.close()
}
And finally some test code
// sender and receiver bootstrapping is omitted
//Create some dummy file to upload
val file: Path = Files.createTempFile("test", "test")
writeRandomToFile(file.toFile, 1000)
//Initiate a new upload
sender.tell(Sender.Upload(file.toAbsolutePath.toString))
// Sleep to allow file upload to finish
Thread.sleep(1000)
//Write more data to the file to emulate a failure
writeRandomToFile(file.toFile, 1000)
//Initiate a new upload that will "recover" from the previous upload
sender.tell(Sender.Upload(file.toAbsolutePath.toString))
Finally, the whole process can be defined as

Stackoverflowerror while using distinct in apache spark

I use Spark 2.0.1.
I am trying to find distinct values in a JavaRDD as below
JavaRDD<String> distinct_installedApp_Ids = filteredInstalledApp_Ids.distinct();
I see that this line is throwing the below exception
Exception in thread "main" java.lang.StackOverflowError
at org.apache.spark.rdd.RDD.checkpointRDD(RDD.scala:226)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:84)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
..........
The same stacktrace is repeated again and again.
The input filteredInstalledApp_Ids has large input with millions of records.Will thh issue be the number of records or is there a efficient way to find distinct values in JavaRDD. Any help would be much appreciated. Thanks in advance. Cheers.
Edit 1:
Adding the filter method
JavaRDD<String> filteredInstalledApp_Ids = installedApp_Ids
.filter(new Function<String, Boolean>() {
#Override
public Boolean call(String v1) throws Exception {
return v1 != null;
}
}).cache();
Edit 2:
Added the method used to generate installedApp_Ids
public JavaRDD<String> getIdsWithInstalledApps(String inputPath, JavaSparkContext sc,
JavaRDD<String> installedApp_Ids) {
JavaRDD<String> appIdsRDD = sc.textFile(inputPath);
try {
JavaRDD<String> appIdsRDD1 = appIdsRDD.map(new Function<String, String>() {
#Override
public String call(String t) throws Exception {
String delimiter = "\t";
String[] id_Type = t.split(delimiter);
StringBuilder temp = new StringBuilder(id_Type[1]);
if ((temp.indexOf("\"")) != -1) {
String escaped = temp.toString().replace("\\", "");
escaped = escaped.replace("\"{", "{");
escaped = escaped.replace("}\"", "}");
temp = new StringBuilder(escaped);
}
// To remove empty character in the beginning of a
// string
JSONObject wholeventObj = new JSONObject(temp.toString());
JSONObject eventJsonObj = wholeventObj.getJSONObject("eventData");
int appType = eventJsonObj.getInt("appType");
if (appType == 1) {
try {
return (String.valueOf(appType));
} catch (JSONException e) {
return null;
}
}
return null;
}
}).cache();
if (installedApp_Ids != null)
return sc.union(installedApp_Ids, appIdsRDD1);
else
return appIdsRDD1;
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
I assume the main dataset is in inputPath. It appears that it's a comma-separated file with JSON-encoded values.
I think you could make your code a bit simpler by combination of Spark SQL's DataFrames and from_json function. I'm using Scala and leave converting the code to Java as a home exercise :)
The lines where you load a inputPath text file and the line parsing itself can be as simple as the following:
import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...
val dataset = spark.read.csv(inputPath)
You can display the content using show operator.
dataset.show(truncate = false)
You should see the JSON-encoded lines.
It appears that the JSON lines contain eventData and appType fields.
val jsons = dataset.withColumn("asJson", from_json(...))
See functions object for reference.
With JSON lines, you can select the fields of your interest:
val apptypes = jsons.select("eventData.appType")
And then union it with installedApp_Ids.
I'm sure the code gets easier to read (and hopefully to write too). The migration will give you extra optimizations that you may or may not be able to write yourself using assembler-like RDD API.
And the best is that filtering out nulls is as simple as using na operator that gives DataFrameNaFunctions like drop. I'm sure you'll like them.
It does not necessarily answer your initial question, but this java.lang.StackOverflowError might get away just by doing the code migration and the code gets easier to maintain, too.

How to read file using groovy and store its contents are variables?

I'm looking for groovy specific way to read file and store its content as different variables. Example of my properties file:
#Local credentials:
postgresql.url = xxxx.xxxx.xxxx
postgresql.username = xxxxxxx
postgresql.password = xxxxxxx
console.url = xxxxx.xxxx.xxx
At the moment I'm using this java code to read the file and use variables:
Properties prop = new Properties();
InputStream input = null;
try {
input = new FileInputStream("config.properties");
prop.load(input);
this.postgresqlUser = prop.getProperty("postgresql.username")
this.postgresqlPass = prop.getProperty("postgresql.password")
this.postgresqlUrl = prop.getProperty("postgresql.url")
this.consoleUrl = prop.getProperty("console.url")
} catch (IOException ex) {
ex.printStackTrace();
} finally {
if (input != null) {
try {
input.close();
} catch (IOException e) {
}
}
}
}
My colleague recommended to use groovy way to deal with this and mentioned streams but I can't seem to find much information about on how to store data in separate variables, what I know so far is that def text = new FileInputStream("config.properties").getText("UTF-8") could read whole file and store it in one variable, but not separate. Any help would be appreciated
If you're willing to make your property file keys and class properties abide by a naming convention, then you can apply the property file values quite easily. Here's an example:
def config = '''
#Local credentials:
postgresql.url = xxxx.xxxx.xxxx
postgresql.username = xxxxxxx
postgresql.password = xxxxxxx
console.url = xxxxx.xxxx.xxx
'''
def props = new Properties().with {
load(new StringBufferInputStream(config))
delegate
}
class Foo {
def postgresqlUsername
def postgresqlPassword
def postgresqlUrl
def consoleUrl
Foo(Properties props) {
props.each { key, value ->
def propertyName = key.replaceAll(/\../) { it[1].toUpperCase() }
setProperty(propertyName, value)
}
}
}
def a = new Foo(props)
assert a.postgresqlUsername == 'xxxxxxx'
assert a.postgresqlPassword == 'xxxxxxx'
assert a.postgresqlUrl == 'xxxx.xxxx.xxxx'
assert a.consoleUrl == 'xxxxx.xxxx.xxx'
In this example, the property keys are converted by dropping the '.' and capitalizing the following letter. So postgresql.url becomes postgresqlUrl. Then it's just a matter for iterating through the keys and calling setProperty() to apply the value.
Take a look at the ConfigSlurper:
http://mrhaki.blogspot.de/2009/10/groovy-goodness-using-configslurper.html

Java Globbing Pattern to Match Directory and File

I'm using a recursive function to traverse files under a root directory. I only want to extract *.txt files, but I don't want to exclude directories. Right now my code looks like this:
val stream = Files.newDirectoryStream(head, "*.txt")
But by doing this, it will not match any directories, and the iterator() gets returned is False. I'm using a Mac, so the noise file that I don't want to include is .DS_STORE. How can I let newDirectoryStream get directories and files that are *.txt? Is there a way?
You really should use FileVisistor, it makes the code as simple as this:
import java.nio.file.attribute.BasicFileAttributes
import java.nio.file._
import scala.collection.mutable.ArrayBuffer
val files = ArrayBuffer.empty[Path]
val root = Paths.get("/path/to/your/directory")
Files.walkFileTree(root, new SimpleFileVisitor[Path] {
override def visitFile(file: Path, attrs: BasicFileAttributes) = {
if (file.getFileName.toString.endsWith(".txt")) {
files += file
}
FileVisitResult.CONTINUE
}
})
files.foreach(println)
Not sure if nio is a requirement. If not this is fairly simple and seems to do the job. And has no mutable collections :)
import java.io.File
def collectFiles(dir: File) = {
def collectFilesHelper(dir: File, soFar: List[String]): List[String] = {
dir.listFiles.foldLeft(soFar) { (acc: List[String], f: File) =>
if (f.isDirectory)
collectFilesHelper(f, acc)
else if (f.getName().endsWith(".txt"))
f.getCanonicalPath() :: acc
else acc
}
}
collectFilesHelper(dir, List[String]())
}
Well, I didn't actually use FileVisitor, but it should be nice to use it. I used recursion and keep two lists: one is the raw file list to track down directories, the other list is used to store actual *.txt files:
#tailrec
def recursiveTraverse(filePaths: ListBuffer[Path], resultFiles: ListBuffer[Path]): ListBuffer[Path] = {
if (filePaths.isEmpty) resultFiles
else {
val head = filePaths.head
val tail = filePaths.tail
if (Files.isDirectory(head)) {
val stream: Try[DirectoryStream[Path]] = Try(Files.newDirectoryStream(head))
stream match {
case Success(st) =>
val iterator = st.iterator()
while (iterator.hasNext) {
tail += iterator.next()
}
case Failure(ex) => println(s"The file path is incorrect: ${ex.getMessage}")
}
stream.map(ds => ds.close())
recursiveTraverse(tail, resultFiles)
}
else{
if (head.toString.contains(".txt")) {
recursiveTraverse(tail, resultFiles += head)
}else{
recursiveTraverse(tail, resultFiles)
}
}
}
}
However, this is not the best solution, but is the easiest for my specific problem. Please show a maybe much shorter FileVisitor code, if you want to do it :)

Categories

Resources