Spark: how to write efficient sql query to achieve this goal - java

I have a json file whose structure is [{"time","currentStop","lat","lon","speed"}], here is an example:
[
{"time":"2015-06-09 23:59:59","currentStop":"xx","lat":"22.264856","lon":"113.520450","speed":"25.30"},
{"time":"2015-06-09 21:00:49","currentStop":"yy","lat":"22.263","lon":"113.52","speed":"34.5"},
{"time":"2015-06-09 21:55:49","currentStop":"zz","lat":"21.3","lon":"113.521","speed":"13.7"}
]
And I want to get json result which has structure [{"hour","value":["currentStop","lat","lon","speed"]}]. The result shows hourly data of distinct ("currentStop","lat","lon","speed"). Here is the result of the example(skip some empty values):
[
{"hour":0,"value":[]},
{"hour":1,"value":[]},
......
{"hour":21,"value":[{"currentStop":"yy","lat":"22.263","lon":"113.52","speed":"34.5"},{"currentStop":"zz","lat":"21.3","lon":"113.521","speed":"13.7"}]}
{"hour":23, "value": [{"currentStop":"xx","lat":22.264856,"lon":113.520450,"speed":25.30}]},
]
Is it possible to achieve this using spark-sql query?
I use spark with Java API, and with loops, I can get what I want, but this way is really inefficient and costs much.
Here is my code:
Dataset<Row> bus_ic=spark.read().json(file);
bus_ic.createOrReplaceTempView("view");
StringBuilder text = new StringBuilder("[");
bus_ic.select(bus_ic.col("currentStop"),
bus_ic.col("lon").cast("double"), bus_ic.col("speed").cast("double"),
bus_ic.col("lat").cast("double"),bus_ic.col("LINEID"),
bus_ic.col("time").cast("timestamp"))
.createOrReplaceTempView("view");
StringBuilder sqlString = new StringBuilder();
for(int i = 0; i<24; i++){
sqlString.delete(0,sqlString.length());
sqlString.append("select currentStop, speed, lat, lon from view where hour(time) = ")
.append(i)
.append(" group by currentStop, speed, lat, lon");
Dataset<Row> t = spark.sql(sqlString.toString());
text.append("{")
.append("\"h\":").append(i)
.append(",\"value\":")
.append(t.toJSON().collectAsList().toString())
.append("}");
if(i!=23) text.append(",");
}
text.append("]");
There must be some other ways to solve this problem. How to write efficient sql query to achieve this goal?

You can write your code in much more concise way (Scala code):
val bus_comb = bus_ic
.groupBy(hour(to_timestamp(col("time"))).as("hour"))
.agg(collect_set(struct(
col("currentStop"), col("lat"), col("lon"), col("speed")
)).alias("value"));
bus_comb.toJSON.show(false);
// +--------------------------------------------------------------------------------------------------------------------------------------------------------+
// |value |
// +--------------------------------------------------------------------------------------------------------------------------------------------------------+
// |{"hour":23,"value":[{"currentStop":"xx","lat":"22.264856","lon":"113.520450","speed":"25.30"}]} |
// |{"hour":21,"value":[{"currentStop":"yy","lat":"22.263","lon":"113.52","speed":"34.5"},{"currentStop":"zz","lat":"21.3","lon":"113.521","speed":"13.7"}]}|
// +--------------------------------------------------------------------------------------------------------------------------------------------------------+
but with only 24 grouping records, there is no opportunity for scaling out here. It might be an interesting exercise, but it is not something you can really apply on large dataset, where using Spark makes sense.
You can add missing hours by joining with range:
spark.range(0, 24).toDF("hour").join(bus_comb, Seq("hour"), "leftouter")

Related

How to predict out of sample values using Spark (JAVA API)

I'm quite new to Spark and I need to use the JAVA api. Our goal is to serve predictions on the fly, where the user is going to provide a few of the variables, but not the label or the goal variable, of course.
But the model seems to need the data to be split in training data and test data for training and validation.
How can I get the prediction and the RMSE for the out of the sample data, that the user will query on the fly?
Dataset<Row>[] splits = df.randomSplit(new double[] {0.99, 0.1});
Dataset<Row> trainingData = splits[0];
Dataset<Row> testData = df_p;
My out of sample data has the following format (where 0s is data the user cannot provide)
IMO,PORT_ID,DWT,TERMINAL_ID,BERTH_ID,TIMESTAMP,label,OP_ID
0000000,1864,80000.00,5689,6060,2020-08-29 00:00:00.000,1,2
'label' is the result I want to predict.
This is how I used the models:
// Train a GBT model.
GBTRegressor gbt = new GBTRegressor()
.setLabelCol("label")
.setFeaturesCol("features")
.setMaxIter(10);
// Chain indexer and GBT in a Pipeline.
Pipeline pipeline = new Pipeline().setStages(new PipelineStage[] {assembler, gbt, discretizer});
// Train model. This also runs the indexer.
PipelineModel model = pipeline.fit(trainingData);
// Make predictions.
Dataset<Row> predictions = model.transform(testData);
// Select example rows to display.
predictions.select("prediction", "label", "weekofyear", "dayofmonth", "month", "year", "features").show(150);
// Select (prediction, true label) and compute test error.
RegressionEvaluator evaluator = new RegressionEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("rmse");
double rmse = evaluator.evaluate(predictions);
System.out.println("Root Mean Squared Error (RMSE) on test data = " + rmse);

How to add a sort query to my StructuredQueryBuilder before talking to Marklogic

I am trying to add a sort/ordering query.
At my java:
StructuredQueryBuilder qb = new StructuredQueryBuilder();
QueryDefinition queryDef = qb.and(qb.value(qb.jsonProperty("status"), "Active"));
SearchHandle resultsHandle = new SearchHandle();
queryManager.setPageLength(PAGE_SIZE_TEN);
int start = PAGE_SIZE_TEN * (pageNumber - 1) + 1;
queryManager.search(queryDef, resultsHandle, start);
The above will return the resultsHandle with 10 json files found for each page specified for the variable "start", with status "Active".
My question is how do I include a sorting query like maybe something along the line of the following:
QueryDefinition queryDef = qb.and(qb.value(qb.jsonProperty("status"), "Active"),
qb.sort?(qb.jsonProperty("dateCreated"));
I want it to get me the 1st 10 json files in order of latest date. It is too late to do a Comparator after getting the result, as the result returns a random 10 json files not in any particular order.
A few samples of the json files will look as such:
1.json
[
{
"id":"1",
"dateCreated":"2017-10-01 12:00:00",
"status":"Active"
"body":"This is a test"
}
]
2.json
[
{
"id":"2",
"dateCreated":"2017-10-02 12:00:00",
"status":"Active"
"body":"This is a test 2"
}
]
I realized there's a enum StructuredQueryBuilder.Ordering, how do I use it?
StructuredQueryBuilder.Ordering is specifically for use with near-query and is unrelated to what you want to do. You need to use query options to define a sort order for your search results. See the sort-order query option:
http://docs.marklogic.com/guide/search-dev/appendixa#id_44212
Options can be pre-defined and installed on MarkLogic and then referenced in your search, or you can define them at runtime and combine them with your structured query in a combined query.
Predefined: http://docs.marklogic.com/guide/java/query-options#chapter
Dynamic: http://docs.marklogic.com/guide/java/searches#id_76144

Pivoting DataFrame - Spark SQL

I have a DataFrame containing below:
TradeId|Source
ABC|"USD,333.123,20170605|USD,-789.444,20170605|GBP,1234.567,20150602"
I want to pivot this data so it turns into below
TradeId|CCY|PV
ABC|USD|333.123
ABC|USD|-789.444
ABC|GBP|1234.567
The number of CCY|PV|Date triplets in the column "Source" is not fixed. I could do it in ArrayList but that requires to load the data in JVM and defeats the whole point of Spark.
Lets say my DataFrame looks as below:
DataFrame tradesSnap = this.loadTradesSnap(reportRequest);
String tempTable = getTempTableName();
tradesSnap.registerTempTable(tempTable);
tradesSnap = tradesSnap.sqlContext().sql("SELECT TradeId, Source FROM " + tempTable);
If you read databricks pivot, it says " A pivot is an aggregation where one (or more in the general case) of the grouping columns has its distinct values transposed into individual columns." And this is not what you desire I guess
I would suggest you to use withColumn and functions to get the final output you desire. You can do as following considering dataframe is what you have
+-------+----------------------------------------------------------------+
|TradeId|Source |
+-------+----------------------------------------------------------------+
|ABC |USD,333.123,20170605|USD,-789.444,20170605|GBP,1234.567,20150602|
+-------+----------------------------------------------------------------+
You can do the following using explode, split and withColumn to get the desired output
val explodedDF = dataframe.withColumn("Source", explode(split(col("Source"), "\\|")))
val finalDF = explodedDF.withColumn("CCY", split($"Source", ",")(0))
.withColumn("PV", split($"Source", ",")(1))
.withColumn("Date", split($"Source", ",")(2))
.drop("Source")
finalDF.show(false)
The final output is
+-------+---+--------+--------+
|TradeId|CCY|PV |Date |
+-------+---+--------+--------+
|ABC |USD|333.123 |20170605|
|ABC |USD|-789.444|20170605|
|ABC |GBP|1234.567|20150602|
+-------+---+--------+--------+
I hope this solves your issue
Rather than pivoting, what you are trying to achieve looks more like flatMap.
To put it simply, by using flatMap on a Dataset you apply to each row a function (map) that itself would produce a sequence of rows. Each set of rows is then concatenated into a single sequence (flat).
The following program shows the idea:
import org.apache.spark.sql.SparkSession
case class Input(TradeId: String, Source: String)
case class Output(TradeId: String, CCY: String, PV: String, Date: String)
object FlatMapExample {
// This function will produce more rows of output for each line of input
def splitSource(in: Input): Seq[Output] =
in.Source.split("\\|", -1).map {
source =>
println(source)
val Array(ccy, pv, date) = source.split(",", -1)
Output(in.TradeId, ccy, pv, date)
}
def main(args: Array[String]): Unit = {
// Initialization and loading
val spark = SparkSession.builder().master("local").appName("pivoting-example").getOrCreate()
import spark.implicits._
val input = spark.read.options(Map("sep" -> "|", "header" -> "true")).csv(args(0)).as[Input]
// For each line in the input, split the source and then
// concatenate each "sub-sequence" in a single `Dataset`
input.flatMap(splitSource).show
}
}
Given your input, this would be the output:
+-------+---+--------+--------+
|TradeId|CCY| PV| Date|
+-------+---+--------+--------+
| ABC|USD| 333.123|20170605|
| ABC|USD|-789.444|20170605|
| ABC|GBP|1234.567|20150602|
+-------+---+--------+--------+
You can now take the result and save it to a CSV, if you want.

How to query nested mongodb fields efficiently in Java?

I'm not very experienced with running Mongo queries from Java, so I'm no expert at commands. I have a Mongo collection with ~6500 documents, each containing multiple fields (some of which have sub-fields), like below:
"_id" : NumberLong(714847),
"franchiseIds" : [
NumberLong(714848),
NumberLong(714849)
],
"profileSettings" : {
"DISCLAIMER_SETUP" : {
"settingType" : "DISCLAIMER_SETUP",
"attributeMap" : {
...
I want to have an operation which will go through the entire collection from time to time and calculate how many franchiseIds are present, since different documents could have anywhere from 1 to 4 franchiseIds.
From the Mongo shell, I did a very simple script to get this, and it calculated the result immediately:
rs_default:SECONDARY> var totalCount = 0;
rs_default:SECONDARY> db.profiles.find().forEach( function(profile) { totalCount += profile.franchiseIds.length } );
rs_default:SECONDARY> totalCount
However, when I attempted to do the same thing in Java, which is where this would run on the server from time to time, it was much less performant, taking around 15 seconds to complete:
int result = 0;
List<Profile> allProfiles = mongoTemplate.findAll(Profile.class, PROFILE_COLLECTION);
for (Profile profile : allProfiles) {
result += profile.getFranchiseIds().size();
}
return results
I realize the above isn't performant in Java as it's having to allocate memory for all of the Profiles being loaded in. In the Mongo shell script, is Mongo simply taking care of this itself?
Any ideas how I can do something similar in Java?
EDIT:
I returned only the franchiseIds field on the response from Mongo, and that helped significantly. Below is the improved code:
final Query query = new Query();
query.fields().include(FRANCHISE_IDS);
final List<Profile> allProfiles = mongoTemplate.find(query, Profile.class, PROFILE_COLLECTION);
for (Profile profile : allProfiles) {
result += profile.getFranchiseIds().size();
}
return result;

How can I retrieve DBObjects that contains a substring of a search-word?

I am using MongoDB with Java Driver (http://tinyurl.com/dyjxz8k). In my application I want it to be possible to give results that contains a substring of the users search-term. The method looks like this:
*searchlabel = the name of a field
*searchTerm = the users searchword
private void dbSearch(String searchlabel, String searchTerm){
if(searchTerm != null && (searchTerm.length() > 0)){
DBCollection coll = db.getCollection("MediaCollection");
BasicDBObject query = new BasicDBObject(searchlabel, searchTerm);
DBCursor cursor = coll.find();
cursor = coll.find(query);
try {
while(cursor.hasNext()) {
System.out.println(cursor.next());
//view.showResult(cursor.next());
}
} finally {
cursor.close();
}
}
}
Does anybody have any idea about how I can solve this? Thanks in advance =) And a small additional question: How can I handle the DBObjects according to presentation in (a JLabel in) view?
For text-searching in Mongo, there are two options:
$regex operator - however unless you have a simple prefix regexp, queries won't use an index, and will result in a full scan, which usually is slow
In Mongo 2.4, a new text index has been introduced. A text query will split your query into words, and do an or-search for documents including any of the words. Text indexes also eliminate some stop-words and have simple stemming for some languages (see the docs).
If you are looking for a more advanced full-text search engine, with more powerful tokenising, stemming, autocomplete etc., maybe a better fit would be e.g. ElasticSearch.
I use this method in the mongo console to search with a regular expression in JavaScript:
// My name to search for
var searchWord = "alex";
// Construct a query with a simple /^alex$/i regex
var query = {};
query.animalName = new RegExp("^"+searchWord+"$","i");
// Perform find operation
var lionsNamedAlex = db.lions.find(query);

Categories

Resources