Processing CQEngine ResultSet with Scala foreach is very slow

Processing CQEngine ResultSet with Scala foreach is very slow - java

I'm trying to process CQEngine's ResultSet using Scala's foreach, but the result is very slow.
Following is the snippet of what I'm trying to do
import collection.JavaConversions._
val query = existIn(myOtherCollection, REFERENCE, REFERENCE)
val resultSet = myIndexCollection.retrieve(query)
resultSet.foreach(r =>{
//do something here
})
Somehow the .foreach method is very slow. I tried to debug by putting SimonMonitor and change the .foreach using while(resultSet.hasNext), surprisingly, every call to hasNext method takes about 1-2 seconds. That's very slow.
I tried to create the same version using Java, and the Java version is super fast.
Please help

I am not able to reproduce your problem with the below test code. Can you try it on your system and let me know how it runs?
(Uncomment line 38, garages.addIndex(HashIndex.onAttribute(Garage.BRANDS_SERVICED)), to make BOTH the Scala and Java iterators run blazingly fast...)
The output first (time in milliseconds):
Done adding data
Done adding index
============== Scala ==============
Car{carId=4, name='BMW M3', description='2013 model', features=[radio, convertible]}
Time : 3 seconds
Car{carId=1, name='Ford Focus', description='great condition, low mileage', features=[spare tyre, sunroof]}
Time : 1 seconds
Car{carId=2, name='Ford Taurus', description='dirty and unreliable, flat tyre', features=[spare tyre, radio]}
Time : 2 seconds
============== Java ==============
Car{carId=4, name='BMW M3', description='2013 model', features=[radio, convertible]}
Time : 3 seconds
Car{carId=1, name='Ford Focus', description='great condition, low mileage', features=[spare tyre, sunroof]}
Time : 1 seconds
Car{carId=2, name='Ford Taurus', description='dirty and unreliable, flat tyre', features=[spare tyre, radio]}
Time : 2 seconds
Code below:
import collection.JavaConversions._
import com.googlecode.cqengine.query.QueryFactory._
import com.googlecode.cqengine.CQEngine;
import com.googlecode.cqengine.index.hash._;
import com.googlecode.cqengine.IndexedCollection;
import com.googlecode.cqengine.query.Query;
import java.util.Arrays.asList;
object CQTest {
def main(args: Array[String]) {
val cars: IndexedCollection[Car] = CQEngine.newInstance();
cars.add(new Car(1, "Ford Focus", "great condition, low mileage", asList("spare tyre", "sunroof")));
cars.add(new Car(2, "Ford Taurus", "dirty and unreliable, flat tyre", asList("spare tyre", "radio")));
cars.add(new Car(3, "Honda Civic", "has a flat tyre and high mileage", asList("radio")));
cars.add(new Car(4, "BMW M3", "2013 model", asList("radio", "convertible")));
// add cruft to try and slow down CQE
for (i <- 1 to 10000) {
cars.add(new Car(i, "BMW2014_" + i, "2014 model", asList("radio", "convertible")))
}
// Create an indexed collection of garages...
val garages: IndexedCollection[Garage] = CQEngine.newInstance();
garages.add(new Garage(1, "Joe's garage", "London", asList("Ford Focus", "Honda Civic")));
garages.add(new Garage(2, "Jane's garage", "Dublin", asList("BMW M3")));
garages.add(new Garage(3, "John's garage", "Dublin", asList("Ford Focus", "Ford Taurus")));
garages.add(new Garage(4, "Jill's garage", "Dublin", asList("Ford Focus")));
// add cruft to try and slow down CQE
for (i <- 1 to 10000) {
garages.add(new Garage(i, "Jill's garage", "Dublin", asList("DONT_MATCH_CARS_BMW2014_" + i)))
}
println("Done adding data")
// cars.addIndex(HashIndex.onAttribute(Car.NAME));
// garages.addIndex(HashIndex.onAttribute(Garage.BRANDS_SERVICED));
println("Done adding index")
val query = existsIn(garages, Car.NAME, Garage.BRANDS_SERVICED, equal(Garage.LOCATION, "Dublin"))
val resultSet = cars.retrieve(query)
var previous = System.currentTimeMillis()
println("============== Scala ============== ")
// Scala version
resultSet.foreach(r => {
println(r);
val t = (System.currentTimeMillis() - previous)
System.out.println("Time : " + t / 1000 + " seconds")
previous = System.currentTimeMillis()
})
println("============== Java ============== ")
previous = System.currentTimeMillis()
// Java version
val i: java.util.Iterator[Car] = resultSet.iterator()
while (i.hasNext) {
val r = i.next()
println(r);
val t = (System.currentTimeMillis() - previous)
System.out.println("Time : " + t / 1000 + " seconds")
previous = System.currentTimeMillis()
}
}
}

Related

How to track committed offset with Spark job for kafka batch

I have a use case where i am writing to a Kafka topic in batches using spark job (no streaming).Initially i pump-in suppose 10 records to Kafka topic and run the spark job which does some processing and finally write to another Kafka topic.
Next time when i push another 5 records and run the spark job, my requirement is to start processing these 5 records only not from starting offset. I need to maintain the committed offset so that spark job should run on next offset position and do the processing.
Here is code from kafka side to fetch the offset:
private static List<TopicPartition> getPartitions(KafkaConsumer consumer, String topic) {
List<PartitionInfo> partitionInfoList = consumer.partitionsFor(topic);
return partitionInfoList.stream().map(x -> new TopicPartition(topic, x.partition())).collect(Collectors.toList());
}
public static void getOffSet(KafkaConsumer consumer) {
List<TopicPartition> topicPartitions = getPartitions(consumer, topic);
consumer.assign(topicPartitions);
consumer.seekToBeginning(topicPartitions);
topicPartitions.forEach(x -> {
System.out.println("Partition-> " + x + " startingOffSet-> " + consumer.position(x));
});
consumer.assign(topicPartitions);
consumer.seekToEnd(topicPartitions);
topicPartitions.forEach(x -> {
System.out.println("Partition-> " + x + " endingOffSet-> " + consumer.position(x));
});
topicPartitions.forEach(x -> {
consumer.poll(1000) ;
OffsetAndMetadata offsetAndMetadata = consumer.committed(x);
long position = consumer.position(x);
System.out.printf("Committed: %s, current position %s%n", offsetAndMetadata == null ? null : offsetAndMetadata
.offset(), position);
});
}
Below code is for spark to load the messages from topic which is not working :
Dataset<Row> kafkaDataset = session.read().format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", topic)
.option("group.id", "test-consumer-group")
.option("startingOffsets","{\"Topic1\":{\"0\":2}}")
.option("endingOffsets", "{\"Topic1\":{\"0\":3}}")
.option("enable.auto.commit","true")
.load();
After above code executes i am again trying to get the offset by calling
getoffset(consumer)
from the topic which always reads from 0 offset and committed offset fetched initially keeps on increasing. I am new to kafka and still figuring out how to handle such scenarion.Please help here.
Initially i had 10 records in my topic, i published another 2 records and here is the o/p:
Output post getoffset method executes :
Partition-> Topic00-0 startingOffSet-> 0 Partition->
Topic00-0 endingOffSet-> 12 Committed: 12, current position
12
Output post spark code executes for loading messages.
Partition-> Topic00-0 startingOffSet-> 0 Partition->
Topic00-0 endingOffSet-> 12 Committed: 12, current position
12
I see no diff and . Please take a look and suggest resolution for this sceanario.

java.nio.ByteBuffer wrap method partly working with sbt run

I have an issue where I read a bytestream from a big file ~ (100MB) and after some integers I get the value 0 (but only with sbt run ). When I hit the play button on IntelliJ I get the value I expected > 0.
My guess was that the environment is somehow different. But I could not spot the difference.
// DemoApp.scala
import java.nio.{ByteBuffer, ByteOrder}
object DemoApp extends App {
val inputStream = getClass.getResourceAsStream("/HandRanks.dat")
val handRanks = new Array[Byte](inputStream.available)
inputStream.read(handRanks)
inputStream.close()
def evalCard(value: Int) = {
val offset = value * 4
println("value: " + value)
println("offset: " + offset)
ByteBuffer.wrap(handRanks, offset, handRanks.length - offset).order(ByteOrder.LITTLE_ENDIAN).getInt
}
val cards: List[Int] = List(51, 45, 14, 2, 12, 28, 46)
def eval(cards: List[Int]): Unit = {
var p = 53
cards.foreach(card => {
println("p = " + evalCard(p))
p = evalCard(p + card)
})
println("result p: " + p);
}
eval(cards)
}
The HandRanks.dat can be found here: (I put it inside a directory called resources)
https://github.com/Robert-Nickel/scala-texas-holdem/blob/master/src/main/resources/HandRanks.dat
build.sbt is:
name := "LoadInts"
version := "0.1"
scalaVersion := "2.13.4"
On my windows machine I use sbt 1.4.6 with Oracle Java 11
You will see that the evalCard call will work 4 times but after the fifth time the return value is 0. It should be higher than 0, which it is when using IntelliJ's play button.

You are not reading a whole content. This
val handRanks = new Array[Byte](inputStream.available)
allocates only as much as InputStream buffer and then you read the amount in buffer with
inputStream.read(handRanks)
Depending of defaults you will process different amount but they will never be 100MB of data. For that you would have to read data into some structure in the loop (bad idea) or process it in chunks (with iterators, stream, etc).
import scala.util.Using
// Using will close the resource whether error happens or not
Using(getClass.getResourceAsStream("/HandRanks.dat")) { inputStream =>
def readChunk(): Option[Array[Byte]] = {
// can be done better, but that's not the point here
val buffer = new Array[Byte](inputStream.available)
val bytesRead = inputStream.read(buffer)
if (bytesRead >= 0) Some(buffer.take(bytesRead))
else None
}
#tailrec def process(): Unit = {
readChunk() match {
case Some(chunk) =>
// do something
process()
case None =>
// nothing to do - EOF reached
}
}
process()
}

Spark DataFrame java.lang.OutOfMemoryError: GC overhead limit exceeded on long loop run

I'm running a Spark application (Spark 1.6.3 cluster), which does some calculations on 2 small data sets, and writes the result into an S3 Parquet file.
Here is my code:
public void doWork(JavaSparkContext sc, Date writeStartDate, Date writeEndDate, String[] extraArgs) throws Exception {
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
S3Client s3Client = new S3Client(ConfigTestingUtils.getBasicAWSCredentials());
boolean clearOutputBeforeSaving = false;
if (extraArgs != null && extraArgs.length > 0) {
if (extraArgs[0].equals("clearOutput")) {
clearOutputBeforeSaving = true;
} else {
logger.warn("Unknown param " + extraArgs[0]);
}
}
Date currRunDate = new Date(writeStartDate.getTime());
while (currRunDate.getTime() < writeEndDate.getTime()) {
try {
SparkReader<FirstData> sparkReader = new SparkReader<>(sc);
JavaRDD<FirstData> data1 = sparkReader.readDataPoints(
inputDir,
currRunDate,
getMinOfEndDateAndNextDay(currRunDate, writeEndDate));
// Normalize to 1 hours & 0.25 degrees
JavaRDD<FirstData> distinctData1 = data1.distinct();
// Floor all (distinct) values to 6 hour windows
JavaRDD<FirstData> basicData1BySixHours = distinctData1.map(d1 -> new FirstData(
d1.getId(),
TimeUtils.floorTimePerSixHourWindow(d1.getTimeStamp()),
d1.getLatitude(),
d1.getLongitude()));
// Convert Data1 to Dataframes
DataFrame data1DF = sqlContext.createDataFrame(basicData1BySixHours, FirstData.class);
data1DF.registerTempTable("data1");
// Read Data2 DataFrame
String currDateString = TimeUtils.getSimpleDailyStringFromDate(currRunDate);
String inputS3Path = basedirInput + "/dt=" + currDateString;
DataFrame data2DF = sqlContext.read().parquet(inputS3Path);
data2DF.registerTempTable("data2");
// Join data1 and data2
DataFrame mergedDataDF = sqlContext.sql("SELECT D1.Id,D2.beaufort,COUNT(1) AS hours " +
"FROM data1 as D1,data2 as D2 " +
"WHERE D1.latitude=D2.latitude AND D1.longitude=D2.longitude AND D1.timeStamp=D2.dataTimestamp " +
"GROUP BY D1.Id,D1.timeStamp,D1.longitude,D1.latitude,D2.beaufort");
// Create histogram per ID
JavaPairRDD<String, Iterable<Row>> mergedDataRows = mergedDataDF.toJavaRDD().groupBy(md -> md.getAs("Id"));
JavaRDD<MergedHistogram> mergedHistogram = mergedDataRows.map(new MergedHistogramCreator());
logger.info("Number of data1 results: " + data1DF.select("lId").distinct().count());
logger.info("Number of coordinates with data: " + data1DF.select("longitude","latitude").distinct().count());
logger.info("Number of results with beaufort histograms: " + mergedDataDF.select("Id").distinct().count());
// Save to parquet
String outputS3Path = basedirOutput + "/dt=" + TimeUtils.getSimpleDailyStringFromDate(currRunDate);
if (clearOutputBeforeSaving) {
writeWithCleanup(outputS3Path, mergedHistogram, MergedHistogram.class, sqlContext, s3Client);
} else {
write(outputS3Path, mergedHistogram, MergedHistogram.class, sqlContext);
}
} finally {
TimeUtils.progressToNextDay(currRunDate);
}
}
}
public void write(String outputS3Path, JavaRDD<MergedHistogram> outputRDD, Class outputClass, SQLContext sqlContext) {
// Apply a schema to an RDD of JavaBeans and save it as Parquet.
DataFrame fullDataDF = sqlContext.createDataFrame(outputRDD, outputClass);
fullDataDF.write().parquet(outputS3Path);
}
public void writeWithCleanup(String outputS3Path, JavaRDD<MergedHistogram> outputRDD, Class outputClass,
SQLContext sqlContext, S3Client s3Client) {
String fileKey = S3Utils.getS3Key(outputS3Path);
String bucket = S3Utils.getS3Bucket(outputS3Path);
logger.info("Deleting existing dir: " + outputS3Path);
s3Client.deleteAll(bucket, fileKey);
write(outputS3Path, outputRDD, outputClass, sqlContext);
}
public Date getMinOfEndDateAndNextDay(Date startTime, Date proposedEndTime) {
long endOfDay = startTime.getTime() - startTime.getTime() % MILLIS_PER_DAY + MILLIS_PER_DAY ;
if (endOfDay < proposedEndTime.getTime()) {
return new Date(endOfDay);
}
return proposedEndTime;
}
The size of data1 is around 150,000 and data2 is around 500,000.
What my code does is basically does some data manipulation, merges the 2 data objects, does a bit more manipulation, prints some statistics and saves to parquet.
The spark has 25GB of memory per server, and the code runs fine.
Each iteration takes about 2-3 minutes.
The problem starts when I run it on a large set of dates.
After a while, I get an OutOfMemory:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.collection.immutable.List.$colon$colon$colon(List.scala:127)
at org.json4s.JsonDSL$JsonListAssoc.$tilde(JsonDSL.scala:98)
at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:139)
at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:72)
at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:144)
at org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:164)
at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:38)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:87)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:72)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:72)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:71)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1181)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:70)
Last time it ran, it crashed after 233 iterations.
The line it crashed on was this:
logger.info("Number of coordinates with data: " + data1DF.select("longitude","latitude").distinct().count());
Can anyone please tell me what can be the reason for the eventual crashes?

I'm not sure that everyone will find this solution viable, but upgrading the Spark cluster to 2.2.0 seems to have resolved the issue.
I have ran my application for several days now, and had no crashes yet.

This error occurs when GC takes up over 98% of the total execution time of process. You can monitor the GC time in your Spark Web UI by going to stages tab in http://master:4040.
Try increasing the driver/executor(whichever is generating this error) memory using spark.{driver/executor}.memory by --conf while submitting the spark application.
Another thing to try is to change the garbage collector that the java is using. Read this article for that: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html. It very clearly explains why GC overhead error occurs and which garbage collector is best for your application.

Log producer object in Scala

I am new to scala and java altogether and trying to run a sample producer code. All it does is, takes some raw products and referrers stored in csv files and uses rnd to generate some random log. Following is my code:
object LogProducer extends App {
//WebLog config
val wlc = Settings.WebLogGen
val Products = scala.io.Source.fromInputStream(getClass.getResourceAsStream("/products.csv")).getLines().toArray
val Referrers = scala.io.Source.fromInputStream(getClass.getResourceAsStream("/referrers.csv")).getLines().toArray
val Visitors = (0 to wlc.visitors).map("Visitors-" + _)
val Pages = (0 to wlc.pages).map("Pages-" + _)
val rnd = new Random()
val filePath = wlc.filePath
val fw = new FileWriter(filePath, true)
//adding randomness to time increments for demo
val incrementTimeEvery = rnd.nextInt(wlc.records - 1) + 1
var timestamp = System.currentTimeMillis()
var adjustedTimestamp = timestamp
for (iteration <- 1 to wlc.records) {
adjustedTimestamp = adjustedTimestamp + ((System.currentTimeMillis() - timestamp) * wlc.timeMultiplier)
timestamp = System.currentTimeMillis()
val action = iteration % (rnd.nextInt(200) + 1) match {
case 0 => "purchase"
case 1 => "add_to_cart"
case _ => "page_view"
}
val referrer = Referrers(rnd.nextInt(Referrers.length - 1))
val prevPage = referrer match {
case "Internal" => Pages(rnd.nextInt(Pages.length - 1))
case _ => ""
}
val visitor = Visitors(rnd.nextInt(Visitors.length - 1))
val page = Pages(rnd.nextInt(Pages.length - 1))
val product = Products(rnd.nextInt(Products.length - 1))
val line = s"$adjustedTimestamp\t$referrer\t$action\t$prevPage\t$visitor\t$page\t$product\n"
fw.write(line)
if (iteration % incrementTimeEvery == 0) {
//os.flush()
println(s"Sent $iteration messages!")
val sleeping = rnd.nextInt(incrementTimeEvery * 60)
println(s"Sleeping for $sleeping ms")
}
}
}
It is pretty straightforward where it is basically generating some variables and adding it to the line.
However I am getting a big exception error stack which i am not able to understand:
"C:\Program Files\Java\jdk1.8.0_92\bin\java...
Exception in thread "main" java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:70)
at scala.collection.Iterator.foreach(Iterator.scala:929)
at scala.collection.Iterator.foreach$(Iterator.scala:929)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1417)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:59)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:50)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:310)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:308)
at scala.collection.AbstractIterator.to(Iterator.scala:1417)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:302)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1417)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:289)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:283)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1417)
at clickstream.LogProducer$.delayedEndpoint$clickstream$LogProducer$1(logProducer.scala:16)
at clickstream.LogProducer$delayedInit$body.apply(logProducer.scala:12)
at scala.Function0.apply$mcV$sp(Function0.scala:34)
at scala.Function0.apply$mcV$sp$(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App.$anonfun$main$1$adapted(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:389)
at scala.App.main(App.scala:76)
at scala.App.main$(App.scala:74)
at clickstream.LogProducer$.main(logProducer.scala:12)
at clickstream.LogProducer.main(logProducer.scala)
Process finished with exit code 1
Can someone please help me identify what the exception mean? Thanks all

So it wasnt hard.. it was my amateurish knowledge. It was a simple IO exception where Intellij wasnt able to get the values from my csv file. When i imported it into resources root directory, it gave me a warning message of wrong encoding.
The error was at this point:
val Products = scala.io.Source.fromInputStream(getClass.getResourceAsStream("/products.csv")).getLines().toArray
thanks for efforts though

It was an encoding issue, for Scala a quick fix would be:
replace:
val Products=scala.io.Source.fromInputStream(getClass.getResourceAsStream("/products.csv")).getLines().toArray
val Referrers = scala.io.Source.fromInputStream(getClass.getResourceAsStream("/referrers.csv")).getLines().toArray
using this:
val Products=scala.io.Source.fromInputStream(getClass.getResourceAsStream("/products.csv"))("UTF-8").getLines().toArray
val Referrers = scala.io.Source.fromInputStream(getClass.getResourceAsStream("/referrers.csv"))("UTF-8").getLines().toArray
For java and more details please check out this link: http://biercoff.com/malformedinputexception-input-length-1-exception-solution-for-scala-and-java/

localized duration format for french

What is the Python analog of Time4J's code example:
// duration in seconds normalized to hours, minutes and seconds
Duration<?> dur = Duration.of(337540, ClockUnit.SECONDS).with(Duration.STD_CLOCK_PERIOD);
// custom duration format => hh:mm:ss
String s1 = Duration.Formatter.ofPattern("hh:mm:ss").format(dur);
System.out.println(s1); // output: 93:45:40
// localized duration format for french
String s2 = PrettyTime.of(Locale.FRANCE).print(dur, TextWidth.WIDE);
System.out.println(s2); // output: 93 heures, 45 minutes et 40 secondes
It is easy to get 93:45:40:
#!/usr/bin/env python3
from datetime import timedelta
dur = timedelta(seconds=337540)
print(dur) # -> 3 days, 21:45:40
fields = {}
fields['hours'], seconds = divmod(dur // timedelta(seconds=1), 3600)
fields['minutes'], fields['seconds'] = divmod(seconds, 60)
print("%(hours)02d:%(minutes)02d:%(seconds)02d" % fields) # -> 93:45:40
but how do I emulate PrettyTime.of(Locale.FRANCE).print(dur, TextWidth.WIDE) Java code in Python (without hardcoding the units)?

babel module allows to get close to desired output:
from babel.dates import format_timedelta # $ pip install babel
print(", ".join(format_timedelta(timedelta(**{unit: fields[unit]}),
granularity=unit.rstrip('s'),
threshold=fields[unit] + 1,
locale='fr')
for unit in "hours minutes seconds".split()))
# -> 93 heures, 45 minutes, 40 secondes
It handles locale and plural forms automatically e.g., for dur = timedelta(seconds=1) it produces:
0 heure, 0 minute, 1 seconde
Perhaps a better solution would be to translate the format string manually using standard tools such as gettext.

If you're using Kotlin, I just came across a similar problem with the Kotlin Duration type with localized formatting and because I couldn't find a good solution, I wrote one myself. It is based on APIs provided starting in Android 9 (for localized units), but with a fallback to English units for lower Android versions so it can be used with lower targeting apps.
Here's how it looks like on the usage side (see Kotlin Duration type to understand 1st line):
val duration = 5.days.plus(3.hours).plus(2.minutes).plus(214.milliseconds)
DurationFormat().format(duration) // "5day 3hour 2min"
DurationFormat(Locale.GERMANY).format(duration) // "5T 3Std. 2Min."
DurationFormat(Locale.forLanguageTag("ar").format(duration) // "٥يوم ٣ساعة ٢د"
DurationFormat().format(duration, smallestUnit = DurationFormat.Unit.HOUR) // "5day 3hour"
DurationFormat().format(15.minutes) // "15min"
DurationFormat().format(0.hours) // "0sec"
As you can see, you can specify a custom locale to the DurationFormat type. By default it uses Locale.getDefault(). Languages that have different symbols for number than romanic are also supported (via NumberFormat). Also, you can specify a custom smallestUnit, by default it is set to SECOND, so milliseconds will not be shown. Note that any unit with a value of 0 will be ignored and if the entire number is 0, the smallest unit will be used with the value 0.
This is the full DurationFormat type, feel free to copy (also available as a GitHub gist incl. unit tests):
import android.icu.text.MeasureFormat
import android.icu.text.NumberFormat
import android.icu.util.MeasureUnit
import android.os.Build
import java.util.Locale
import kotlin.time.Duration
import kotlin.time.ExperimentalTime
import kotlin.time.days
import kotlin.time.hours
import kotlin.time.milliseconds
import kotlin.time.minutes
import kotlin.time.seconds
#ExperimentalTime
data class DurationFormat(val locale: Locale = Locale.getDefault()) {
enum class Unit {
DAY, HOUR, MINUTE, SECOND, MILLISECOND
}
fun format(duration: kotlin.time.Duration, smallestUnit: Unit = Unit.SECOND): String {
var formattedStringComponents = mutableListOf<String>()
var remainder = duration
for (unit in Unit.values()) {
val component = calculateComponent(unit, remainder)
remainder = when (unit) {
Unit.DAY -> remainder - component.days
Unit.HOUR -> remainder - component.hours
Unit.MINUTE -> remainder - component.minutes
Unit.SECOND -> remainder - component.seconds
Unit.MILLISECOND -> remainder - component.milliseconds
}
val unitDisplayName = unitDisplayName(unit)
if (component > 0) {
val formattedComponent = NumberFormat.getInstance(locale).format(component)
formattedStringComponents.add("$formattedComponent$unitDisplayName")
}
if (unit == smallestUnit) {
val formattedZero = NumberFormat.getInstance(locale).format(0)
if (formattedStringComponents.isEmpty()) formattedStringComponents.add("$formattedZero$unitDisplayName")
break
}
}
return formattedStringComponents.joinToString(" ")
}
private fun calculateComponent(unit: Unit, remainder: Duration) = when (unit) {
Unit.DAY -> remainder.inDays.toLong()
Unit.HOUR -> remainder.inHours.toLong()
Unit.MINUTE -> remainder.inMinutes.toLong()
Unit.SECOND -> remainder.inSeconds.toLong()
Unit.MILLISECOND -> remainder.inMilliseconds.toLong()
}
private fun unitDisplayName(unit: Unit) = if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.P) {
val measureFormat = MeasureFormat.getInstance(locale, MeasureFormat.FormatWidth.NARROW)
when (unit) {
DurationFormat.Unit.DAY -> measureFormat.getUnitDisplayName(MeasureUnit.DAY)
DurationFormat.Unit.HOUR -> measureFormat.getUnitDisplayName(MeasureUnit.HOUR)
DurationFormat.Unit.MINUTE -> measureFormat.getUnitDisplayName(MeasureUnit.MINUTE)
DurationFormat.Unit.SECOND -> measureFormat.getUnitDisplayName(MeasureUnit.SECOND)
DurationFormat.Unit.MILLISECOND -> measureFormat.getUnitDisplayName(MeasureUnit.MILLISECOND)
}
} else {
when (unit) {
Unit.DAY -> "day"
Unit.HOUR -> "hour"
Unit.MINUTE -> "min"
Unit.SECOND -> "sec"
Unit.MILLISECOND -> "msec"
}
}
}

This humanize package may help. It has a french localization, or you can add your own. For python 2.7 and 3.3.

Using pendulum module:
>>> import pendulum
>>> it = pendulum.interval(seconds=337540)
>>> it.in_words(locale='fr_FR')
'3 jours 21 heures 45 minutes 40 secondes'

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Processing CQEngine ResultSet with Scala foreach is very slow - java

Related

How to track committed offset with Spark job for kafka batch

java.nio.ByteBuffer wrap method partly working with sbt run

Spark DataFrame java.lang.OutOfMemoryError: GC overhead limit exceeded on long loop run

Log producer object in Scala

localized duration format for french

Categories

Resources