During an interview, the tech lead said my scala code was just like java code, but using scala api and he wanted me to improve on that.
I am a 3-year java developer and I began scala coding by following the MOOC on Coursera.
Can anyone tell me what is the problem and how can I improve it, please?
I got the job because of my Java acknowledge but the job is based on scala and the coding style is one thing to fix during the trial period.
object Extraction {
// IntelliJ use .idea/modules as current working directory
val FilePathPre = "../../src/main/resources/"
val UserIdFile = "lookup_user.csv"
val ProductIdFile = "lookup_product.csv"
val RatingFile = "agg_ratings.csv"
def readFile(file: String): Iterator[((String, String), String, String)] = {
val Splitter = ","
Source.fromInputStream(this.getClass.getResourceAsStream(file)).getLines()
.map(_.split(Splitter))
.filter(_.size >= 4) // in case line is not valid
.map(x => ((x(0), x(1)), x(2), x(3))) // (userId, ItemId), rating, time
}
def filePrinter(fileName: String, lines: mutable.Map[String, Int]) = {
val file = new File(fileName)
val bw = new BufferedWriter(new FileWriter(file))
lines.toArray.sortWith((a, b) => a._2 < b._2)
.map(x => x._1 + "," + x._2.toString + "\n")
.foreach(bw.write)
bw.close()
}
def aggFilePrinter(fileName: String, lines: mutable.Map[(Int, Int), Float]) = {
val file = new File(fileName)
val bw = new BufferedWriter(new FileWriter(file))
lines.foreach(x => {
val line = x._1._1.toString + "," + x._1._2.toString + "," + (math.round(x._2 * 100.0) / 100.0).toFloat + "\n"
bw.write(line)
})
bw.close()
}
/**
* * une pénalité multiplicative de 0.95 est appliquée au rating
* pour chaque jour d'écart avec le timestamp maximal de input.csv
*
* #param nowTime maximal timestamp at input.csv
* #param pastTime current rating time
* #param rating original rating
* #return final rating multiplied by 0.95 for every day interval from the maximal timestamp
*/
def finalRating(nowTime: String, pastTime: String, rating: String): Float = {
val now =
LocalDateTime.ofInstant(Instant.ofEpochMilli(nowTime.toLong), ZoneId.systemDefault())
val past =
LocalDateTime.ofInstant(Instant.ofEpochMilli(pastTime.toLong), ZoneId.systemDefault())
val diff = ChronoUnit.DAYS.between(past, now)
(math.pow(0.95, diff) * rating.toFloat).toFloat
}
/**
*
* #param file file to extract
*/
def fileDispatcher(file: String) = {
/**
* get idIndice or increment to idIndice and put it to id map
* #param id id in String
* #param idIndice id in Int
* #param idMap userIdMap or productIdMap
* #return (indice for id, max idIndice)
*/
def getIndice(id: String, idIndice: Int, idMap: mutable.Map[String, Int]): (Int, Int) = {
idMap.get(id) match {
case Some(i) => (i, idIndice)
case None => {
val indice = idIndice + 1
idMap += (id -> indice)
(indice, indice)
}
}
}
// 1. scan the file the find the max time
val maxTime = readFile(file).reduce((a, b) => if(a._3 > b._3) a else b)._3
// 2. apply rating condition, calculate rating and return only valid rating lines
val validLines = readFile(file).map(x => (x._1, finalRating(maxTime.toString, x._3, x._2))).filter(_._2 > 0.01)
// 3. loop file lines, sum ratings by (userId, productId), and combine id_String and id_Int
val userIdMap = mutable.Map[String, Int]() // (userId, userIdAsInt)
val productIdMap = mutable.Map[String, Int]() // (productId, productIdAsInt)
val userProductRatingMap = mutable.Map[(Int, Int), Float]() // (userIdAsInt, productIdAsInt, ratingSum)
var userIdIndice = -1
var productIdIndice = -1
validLines.foreach(x => {
val userIdString = x._1._1
val userId = getIndice(userIdString, userIdIndice, userIdMap)
userIdIndice = userId._2
val productIdString = x._1._2
val productId = getIndice(productIdString, productIdIndice, productIdMap)
productIdIndice = productId._2
val key = (userId._1, productId._1)
userProductRatingMap.get(key) match {
case Some(i) => userProductRatingMap += (key -> (i + x._2))
case None => userProductRatingMap += (key -> x._2)
}
})
filePrinter(FilePathPre + UserIdFile, userIdMap)
filePrinter(FilePathPre + ProductIdFile, productIdMap)
aggFilePrinter(FilePathPre + RatingFile, userProductRatingMap)
}
}
Apart of javish code you have also code style issues, suggest to read https://docs.scala-lang.org/style/ at the start (this is not an ultimate guide, but for start is ok). Avoid to use ._1 on tuples, use match { case (a, b, c) => ... } instead.
The main issue is that you use mutable structures, so in scala every structure is immutable by default and it should stay like that unless you have a strong reason to have it mutable. It is more about functional programming which from one perspective tries to avoid mutability and side effects, you can google for this topic more.
So remove mutable. from your code and replace foreach with eg. foldLeft to get newly created immutable.Map on each iteration instead of modifying existing one.
Related
Leveraging the best from SnakeYAML & Jackson in scala, I am using the following method to parse YAML files. This method supports the usage of anchors in YAML
import com.fasterxml.jackson.databind.{JsonNode, ObjectMapper}
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import java.io.{File, FileInputStream}
import org.yaml.snakeyaml.{DumperOptions, LoaderOptions, Yaml}
/**
* YAML Parser using SnakeYAML & Jackson Implementation
*
* #param yamlFilePath : Path to the YAML file that has to be parsed
* #return: JsonNode of YAML file
*/
def parseYaml(yamlFilePath: String): JsonNode = {
// Parsing the YAML file with SnakeYAML - since Jackson Parser does not have Anchors and reference support
val ios = new FileInputStream(new File(yamlFilePath))
val loaderOptions = new LoaderOptions
loaderOptions.setAllowDuplicateKeys(false)
val yaml = new Yaml(
loaderOptions
)
val mapper = new ObjectMapper().registerModules(DefaultScalaModule)
val yamlObj = yaml.loadAs(ios, classOf[Any])
// Converting the YAML to Jackson YAML - since it has more flexibility for traversing through nodes
val jsonString = mapper.writerWithDefaultPrettyPrinter().writeValueAsString(yamlObj)
val jsonObj = mapper.readTree(jsonString)
println(jsonString)
jsonObj
}
However, this currently does not support the interpolation of environment variables within the YAML file. Eg: If we get the following environment variables when we do
>>> println(System.getenv())
{PATH=/usr/bin:/bin:/usr/sbin:/sbin, XPC_FLAGS=0x0, SHELL=/bin/bash}
The question is how do we achieve environment variable interpolation in yaml file, lets say we have the following YAML file:
path_value: ${PATH}
xpc: ${XPC_FLAGS}
shell_path: ${SHELL}
Then after parsing the YAML should be:
{
"path_value": "/usr/bin:/bin:/usr/sbin:/sbin",
"xpc": "0x0",
"shell_path": "/bin/bash"
}
Thanks for your time & efforts to answer in advance!
Thanks to the comments & guidance from the community. Here is my solution for the parser with custom constructors and represented:
import java.io.{File, FileInputStream}
import scala.util.matching.Regex
import com.fasterxml.jackson.databind.{JsonNode, ObjectMapper}
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import org.yaml.snakeyaml.{DumperOptions, LoaderOptions, Yaml}
import org.yaml.snakeyaml.constructor.{AbstractConstruct, Constructor}
import org.yaml.snakeyaml.error.MissingEnvironmentVariableException
import org.yaml.snakeyaml.nodes.Node
import org.yaml.snakeyaml.nodes.ScalarNode
import org.yaml.snakeyaml.nodes.Tag
import org.yaml.snakeyaml.representer.Representer
/**
* Copyright (c) 2008, http://www.snakeyaml.org
* Class dedicated for SnakeYAML Support for Environment Variables
*/
/**
* Construct scalar for format ${VARIABLE} replacing the template with the value from the environment.
*
* #see Variable substitution
* #see Variable substitution
*/
class EnvScalarConstructor() extends Constructor {
val ENV_TAG = new Tag("!ENV")
this.yamlConstructors.put(ENV_TAG, new ConstructEnv)
val ENV_regex: Regex = "\\$\\{\\s*((?<name>\\w+)((?<separator>:?(-|\\?))(?<value>\\w+)?)?)\\s*\\}".r
private class ConstructEnv extends AbstractConstruct {
override def construct(node: Node) = {
val matchValue = constructScalar(node.asInstanceOf[ScalarNode])
val patternMatch = ENV_regex.findAllIn(matchValue)
val eval = patternMatch.toString()
val name = patternMatch.group(1)
val value = patternMatch.group(2)
val separator = null
apply(name, separator, if (value != null) value else "", ENV_regex.replaceAllIn(matchValue, getEnv(name)))
}
}
/**
* Implement the logic for missing and unset variables
*
* #param name - variable name in the template
* #param separator - separator in the template, can be :-, -, :?, ?
* #param value - default value or the error in the template
* #param environment - the value from the environment for the provided variable
* #return the value to apply in the template
*/
def apply(name: String, separator: String, value: String, environment: String): String = {
if (environment != null && !environment.isEmpty) return environment
// variable is either unset or empty
if (separator != null) { //there is a default value or error
if (separator == "?") if (environment == null) throw new MissingEnvironmentVariableException("Missing mandatory variable " + name + ": " + value)
if (separator == ":?") {
if (environment == null) throw new MissingEnvironmentVariableException("Missing mandatory variable " + name + ": " + value)
if (environment.isEmpty) throw new MissingEnvironmentVariableException("Empty mandatory variable " + name + ": " + value)
}
if (separator.startsWith(":")) if (environment == null || environment.isEmpty) return value
else if (environment == null) return value
}
""
}
/**
* Get the value of the environment variable
*
* #param key - the name of the variable
* #return value or null if not set
*/
def getEnv(key: String) = sys.env.getOrElse(key, System.getProperty(key, s"UNKNOWN_ENV_VAR-$key"))
}
The above constructor can be used in YAML Parser as follows:
/**
* Function that will be used to load the YAML file
* #param yamlFilePath - String with YAML path to read
* #return - FasterXML JsonNode
*/
def parseYaml(yamlFilePath: String): JsonNode = {
val ios = new FileInputStream(new File(yamlFilePath))
// Parsing the YAML file with SnakeYAML - since Jackson Parser does not have Anchors and reference support
val loaderOptions = new LoaderOptions
loaderOptions.setAllowDuplicateKeys(false)
val yaml = new Yaml(
new EnvScalarConstructor,
new Representer,
new DumperOptions,
loaderOptions
)
val mapper = new ObjectMapper().registerModules(DefaultScalaModule)
val yamlObj = yaml.loadAs(ios, classOf[Any])
// Converting the YAML to Jackson YAML - since it has more flexibility
val jsonString = mapper.writerWithDefaultPrettyPrinter().writeValueAsString(yamlObj)
val jsonObj = mapper.readTree(jsonString)
println(jsonString)
jsonObj
}
I have read the documentation but can not get spark.sql.columnNameOfCorruptRecord default value even with google searching.
The second question - how PERMISSIVE mode works when spark.sql.columnNameOfCorruptRecord is empty or null?
According to the code (19/01/2021) it's _corrupt_record:
val COLUMN_NAME_OF_CORRUPT_RECORD = buildConf("spark.sql.columnNameOfCorruptRecord")
.doc("The name of internal column for storing raw/un-parsed JSON and CSV records that fail " +
"to parse.")
.version("1.2.0")
.stringConf
.createWithDefault("_corrupt_record")
Regarding how PERMISSIVE mode works, you can see this in FailSafeParser[T]:
def parse(input: IN): Iterator[InternalRow] = {
try {
rawParser.apply(input).toIterator.map(row => toResultRow(Some(row), () => null))
} catch {
case e: BadRecordException => mode match {
case PermissiveMode =>
Iterator(toResultRow(e.partialResult(), e.record))
case DropMalformedMode =>
Iterator.empty
case FailFastMode =>
throw new SparkException("Malformed records are detected in record parsing. " +
s"Parse Mode: ${FailFastMode.name}. To process malformed records as null " +
"result, try setting the option 'mode' as 'PERMISSIVE'.", e)
}
}
private val toResultRow: (Option[InternalRow], () => UTF8String) => InternalRow = {
if (corruptFieldIndex.isDefined) {
(row, badRecord) => {
var i = 0
while (i < actualSchema.length) {
val from = actualSchema(i)
resultRow(schema.fieldIndex(from.name)) = row.map(_.get(i, from.dataType)).orNull
i += 1
}
resultRow(corruptFieldIndex.get) = badRecord()
resultRow
}
} else {
(row, _) => row.getOrElse(nullResult)
}
}
If it isn't specified, it'll fallback to the default value defined in the configuration.
i want to close csv writer when all the async block executed
import java.io.{FileReader, FileWriter}
import com.opencsv.{CSVReader, CSVWriter}
import org.jsoup.helper.StringUtil
import scala.async.Async.{async, await}
import scala.concurrent.ExecutionContext.Implicits.global
var rows = 0
reader.forEach(line => {
async {
val csv = new CSV(line(0), line(1), line(2), line(3), line(4));
entries(0) = csv.id
entries(1) = csv.name
val di = async(httpRequest(csv.id))
var di2 = "not available"
val di2Future = async(httpRequest(csv.name))
di2 = await(di2Future)
entries(2) = await(di)
entries(3) = di2
writer.writeNext(entries)
println(s"${csv.id} completed!!!! ")
rows += 1
}
})
writer.close();
in above code writer always closed first, so i want to execute all async block then close my csv writer.
Below is a skeleton solution.
val allResponses = reader.map(l => {
val entries = ??? // declare entires data structure here
for {
line <- async(l)
// line <- Future.succesful(l)
csv <- async {
val csv = CSV(line(0), line(1), line(2), line(3), line(4))
entries(0) = csv.id
entries(1) = csv.name
csv
}
di <- async {
httpRequest(csv.id)
}
di2 <- async {
httpRequest(csv.name)
}
e <- async {
entries(2) = di
entries(3) = di2
entries
}
} yield e
})
val t = Future.sequence(allResponses)
t.map(a => {
val writer = new FileWriter("file.txt")
a.foreach(i => {
writer.writeNext(i)
})
writer.close()
})
Hope this helps.
An async block produces a Future[A], where A in your case is Unit (which is the type of the assignment rows += 1).
In general, you can perform operations when a Future is complete like the following:
def myFuture: Future[Something] = ??? // your async process
myFuture.onComplete {
case Success(result) =>
???
case Failure(exception) =>
???
}
If you want to perform something regardless the status you can skip pattern matching:
myFuture.onComplete(_ => writer.close()) // e.g.
I have saved vector in session and I want to use random value from the vector but dont know how to extract value in session.
Errors:
'httpRequest-6' failed to execute: Vector(437420, 443940, 443932,
437437, 443981, 443956, 443973, 443915, 437445) named 'termIds' does
not support .random function
And
In 2nd scenario It passes vector in get request like this way, http://someurl/api/thr/Vector(435854)/terms/Vector(437420, 443940,
443932, 437437, 443981, 443956, 443973, 443915, 437445)
instead of using
http://someurl/api/thr/435854/terms/443973
::Here is my script::
class getTerm extends Simulation {
val repeatCount = Integer.getInteger("repeatCount", 1).toInt
val userCount = Integer.getInteger("userCount", 1).toInt
val turl = System.getProperty("turl", "some url")
val httpProtocol = http
.baseURL("http://" + turl)
val headers_10 = Map("Content-Type" -> """application/json""")
var thrIds = ""
var termIds = ""
// Scenario - 1
val getTerms = scenario("Scn 1")
.exec(http("list_of_term")
.get("/api/abc")
.headers(headers_10)
.check(jsonPath("$[*].id")
.findAll.saveAs("thrIds"))
)
.exec(http("get_all_terms")
.get("""/api/thr/${thrIds.random()}/terms""")
.headers(headers_10)
.check(jsonPath("$[*].id")
.findAll.saveAs("termIds"))
)
.exec(session => {
thrIds = session("thrIds").as[Long].toString
termIds = session("termIds").as[Long].toString
println("***************************************")
println("Session ====>>>> " + session)
println("Ths ID ====>>>> " + thrIds)
println("Term ID ====>>>> " + termIds)
println("***************************************")
session}
)
// Scenario - 2
// Want to extract vectors here and pass its value into get call
val getKnownTerms = scenario("Get Known Term")
.exec(_.set("thrIds", thrIds))
.exec(_.set("termIds", termIds))
.repeat (repeatCount){
exec(http("get_k_term")
.get("""/api/thr/${thrIds}/terms/${termIds.random()}""")
.headers(headers_10))
}
val scn = List(getTerms.inject(atOnceUsers(1)), getKnownTerms.inject(nothingFor(20 seconds), atOnceUsers(userCount)))
setUp(scn).protocols(httpProtocol)
}
Here is the solution which may help others.
class getTerm extends Simulation {
val repeatCount = Integer.getInteger("repeatCount", 1).toInt
val userCount = Integer.getInteger("userCount", 1).toInt
val turl = System.getProperty("turl", "some url")
val httpProtocol = http
.baseURL("http://" + turl)
val headers_10 = Map("Content-Type" -> """application/json""")
// Change - 1
var thrIds: Seq[String] = _
var termIds: Seq[String] = _
// Scenario - 1
val getTerms = scenario("Scn 1")
.exec(http("list_of_term")
.get("/api/abc")
.headers(headers_10)
.check(jsonPath("$[*].id")
.findAll
.transform { v => thrIds = v; v }
.saveAs("thrIds"))
)
.exec(http("get_all_trms")
.get("""/api/thr/${thrIds.random()}/terms""")
.headers(headers_10)
.check(jsonPath("$[*].id")
.findAll
.transform { v => termIds = v; v }
.saveAs("termIds"))
)
// Scenario - 2
val getKnownTerms = scenario("Get Known Term")
.exec(_.set("thrIds", thrIds))
.exec(_.set("termIds", termIds))
.repeat (repeatCount){
exec(http("get_k_term")
.get("""/api/thr/${thrIds.random()}/terms/${termIds.random()}""")
.headers(headers_10))
}
val scn = List(getTerms.inject(atOnceUsers(1)), getKnownTerms.inject(nothingFor(20 seconds), atOnceUsers(userCount)))
setUp(scn).protocols(httpProtocol)
}
I need to create a HashMap of directory-to-file in scala while I list all files in the directory. How can I achieve this in scala?
val directoryToFile = awsClient.listFiles(uploadPath).collect {
case path if !path.endsWith("/") => {
path match {
// do some regex matching to get directory & file names
case regex(dir, date) => {
// NEED TO CREATE A HASH MAP OF dir -> date. How???
}
case _ => None
}
}
}
The method listFiles(path: String) returns a Seq[String] of absolute path of all files in the path passed as argument to the function
Try to write more idiomatic Scala. Something like this:
val directoryToFile = (for {
path <- awsClient.listFiles(uploadPath)
if !path.endsWith("/")
regex(dir, date) <- regex.findFirstIn(path)
} yield dir -> date).sortBy(_._2).toMap
You can filter and then foldLeft:
val l = List("""/opt/file1.txt""", """/opt/file2.txt""")
val finalMap = l
.filter(!_.endsWith("/"))
.foldLeft(Map.empty[String, LocalDateTime])((map, s) =>
s match {
case regex(dir, date) => map + (dir -> date)
case _ => map
}
)
You can try something like this:
val regex = """(\d)-(\d)""".r
val paths = List("1-2", "3-4", "555")
for {
// Hint to Scala to produce specific type
_ <- Map("" -> "")
// Not sure why your !path.endsWith("/") is not part of regex
path#regex(a, b) <- paths
if path.startsWith("1")
} yield (a, b)
//> scala.collection.immutable.Map[String,String] = Map(1 -> 2)
Slightly more complicated if you need max:
val regex = """(\d)-(\d)""".r
val paths = List("1-2", "3-4", "555", "1-3")
for {
(_, ps) <-
( for {
path#regex(a, b) <- paths
if path.startsWith("1")
} yield (a, b)
).groupBy(_._1)
} yield ps.maxBy(_._2)
//> scala.collection.immutable.Map[String,String] = Map(1 -> 3)