How to map two datasets in spark Java

How to map two datasets in spark Java - java

Hi I'm reading data from mongodb into spark application.
My mongodb contains 2 collections.
One is profile_data(actual data with field names)
(Which holds all the input data including some unique fields)
{
"MessageStatus" : 2,
"Origin" : 1,
"_id" : ObjectId("596340fe8b0fa35d2880db1a"),
"accerlation" : 19.4,
"cylinders" : 4,
"displacement" : 119,
"file_id" : ObjectId("59633e48b760e7c8071a6c1c"),
"horsepower" : 82,
"modelyear" : 82,
"modified_date" : ISODate("2017-07-10T08:47:01.641Z"),
"mpg" : 31,
"snet_id" : "new_project",
"unique_id" : "784",
"username" : "chevy s-10",
"weight" : 2720
}
And another collection is : predictive_model_details(Which holds the ML model details like model name, feature fields and prediction field just like metadata)
{
"_id" : ObjectId("56b4351be4b064bb19a90324"),
"algorithm_id" : "55d717a53d9e22022ff2a1e9",
"algorithm_name" : "K- Nearest Neighbours (IBK)",
"client_id" : "562e1d51b760d0e408151b91",
"feature_fields" : [
{
"name" : "Origin",
"type" : "int"
},
{
"name" : "accerlation",
"type" : "Double"
},
{
"name" : "displacement",
"type" : "Int"
},
{
"name" : "horsepower",
"type" : "Int"
},
{
"name" : "modelyear",
"type" : "Int"
}
],
,
"makeActiveStatus" : "0",
"model_name" : "test1",
"parameter_type" : "system_defined",
"parameters" : [
{
"symbol" : "-K",
"value" : "1"
}
],
"predictor" : {
"name" : "mpg"
"type" : "Int"
},
"result_exists" : true,
"snet_id" : "new_project"
}
So I've created 2 datasets in spark for two collections in MongoDB. Now I want to map these 2 Datasets with all feature fields together and prediction field together.
And common field in 2 datasets is snet_id.
Could anyone please help?

Related

How to query an array in array of dictionaries in MongoDB?

I have a following mongoDB document structure -
db.menus.findOne()
{
"_id" : ObjectId("5cf25412326c3f4f26df039b"),
"restaurantId" : "301728",
"items" : [
{
"itemId" : "CEBM4H41JR",
"name" : "Crun Chicken",
"imageUrl" : "",
"price" : 572,
"attributes" : [
"Tasty",
"Spicy"
]
},
{
"itemId" : "53Q0XS3HPR",
"name" : "Devils Chicken",
"imageUrl" : "",
"price" : 595,
"attributes" : [
"Gravy",
"Salty"
]
}
]
}
I am trying to write a query to get all the menus based on the "attributes" field under "items" in the document.
I have done the following to get the menus if "name" of "items" is given and I am getting a result -
db.menus.find({ 'items' : {$elemMatch : {'name' : {$regex : "Chicken Thali", $options: 'i' }}}}).pretty()
I have tried this for getting the result for attributes but this is not working -
db.menus.find({'items' : {$elemMatch : {'attributes' : {$all : [{$regex : "Tasty", $options: 'i' }]}}}})
How do I get the list and I also want to write this query for mongoRepository in a spring boot application?
Further, based on the restaurantId's obtained, I have to query restaurant collection in order to find all the restaurants in restaurants collection having the following structure -
{
"_id" : ObjectId("5cf2540e326c3f4f26de93dd"),
"restaurantId" : "301728",
"name" : "Desire Foods",
"imageUrl" : "https://b.zmtcdn.com/data/pictures/8/301728/d690ccb500d746530f56e1d637949da2_featured_v2.jpg",
"latitude" : 28.4900591,
"longitude" : 77.3066401,
"attributes" : [
"Chinese",
" Fast Food",
" Bakery"
],
"opensAt" : "09:30",
"closesAt" : "22:30"
}
Is the whole operation possible in a single query?

I think you can modify your query to use $in instead of $all.
To achieve your intended result, you can try:
db.collection.aggregate([
{
"$match": {
"items": {
"$elemMatch": {
"attributes": {
"$in": [
"Tasty"
]
}
}
}
}
},
{
"$lookup": {
"from": "restaurant",
"localField": "restaurantId",
"foreignField": "restaurantId",
"as": "restaurants"
}
},
{
"$unwind": "restaurants"
},
{
"$replaceRoot": { "newRoot": "$restaurants" }
}
])
Use $match at appropriate stages as needed to limit the documents pulled in memory

Representing Abstract JSON Objects as models in Java

Ok so I am making API requests to retrieve certain things like movies, songs, or to ping the server. However all of these responses are contained within the same response JSON object that has varying fields depending on the response. Below are three examples.
ping
{
"response" : {
"status" : "ok",
"version" : "0.9.1"
}
}
getIndexes
{
"response" : {
"status" : "ok",
"version" : "0.9.1",
"indexes" : {
"index" : [ {
"name" : "A",
"movie" : [ {
"id" : "150",
"name" : "A Movie"
}, {
"id" : "2400",
"name" : "Another Movie"
} ]
}, {
"name" : "S",
"movie" : [ {
"id" : "439",
"name" : "Some Movie"
}, {
"id" : "209",
"name" : "Some Movie Part 2"
} ]
} ]
}
}
}
getRandomSongs
{
"response" : {
"status" : "ok"
"version" : "0.9.1"
"randomSongs" : {
"song": [ {
"id" : "72",
"parent" : "58",
"isDir" : false,
"title" : "Letter From Yokosuka",
"album" : "Metaphorical Music",
"artist" : "Nujabes",
"track" : 7,
"year" : 2003,
"genre" : "Hip-Hop",
"coverArt" : "58",
"size" : 20407325,
"contentType" : "audio/flac",
"suffix" : "flac",
"transcodedContentType" : "audio/mpeg",
"transcodedSuffix" : "mp3",
"duration" : 190,
"bitRate" : 858,
"path" : "Nujabes/Metaphorical Music/07 - Letter From Yokosuka.flac",
"isVideo" : false,
"created" : "2015-06-06T01:18:05.000Z",
"albumId" : "2",
"artistId" : "0",
"type" : "music"
}, {
"id" : "3135",
"parent" : "3109",
"isDir" : false,
"title" : "Forty One Mosquitoes Flying In Formation",
"album" : "Tame Impala",
"artist" : "Tame Impala",
"track" : 4,
"year" : 2008,
"genre" : "Rock",
"coverArt" : "3109",
"size" : 10359844,
"contentType" : "audio/mpeg",
"suffix" : "mp3",
"duration" : 258,
"bitRate" : 320,
"path" : "Tame Impala/Tame Impala/04 - Forty One Mosquitoes Flying In Formation.mp3",
"isVideo" : false,
"created" : "2015-06-29T21:50:16.000Z",
"albumId" : "101",
"artistId" : "30",
"type" : "music"
} ]
}
}
}
So basically my question is, how should I structure my model classes to use for parsing these responses? At the moment I have an abstract response object that only contains fields for the status and version. However, by using this approach I will need a response class that extends this abstract class for ever request I make (e.g. AbstractResponse, IndexesResponse, RandomSongsResponse). Also, some models with the same name may have different fields depending on the API request made. I would prefer to avoid making a model class for every possible scenario.
And as an extra note, I am using GSON for JSON serialization/deserialization and Retrofit to communicate with the API.

Customize the PrettyPrint options in Jackson?

I'm aware of the writerWithDefaultPrettyPrinter option in Jackson, but is there any way to customize it? See examples below.
If this isn't possible in Jackson, if you can't change pretty print options, then is there another popular JSON library that would do it?
Summary of options to change:
Don't open multiple containers on the same line
Don't close and open containers on the same line
Use 4 spaces as indents instead of 2
(another option, though I wouldn't use it) Open containers on a new line so that they line up vertically with their closing marker
Example of what it outputs now:
[ {
"id" : "12",
"payload" : [ {
"name" : "url",
"value" : [ {
"name" : "url",
"value" : "http://foobar.com"
} ]
}, {
"name" : "tags",
"value" : [ {
"name" : "tags",
"value" : "red"
}, {
"name" : "tags",
"value" : "green"
}, {
"name" : "tags",
"value" : "blue"
}, {
...
Example of what I'd like to get:
[
{
"id" : "12",
"payload" : [
{
"name" : "url",
"value" : [
{
"name" : "url",
"value" : "http://foobar.com"
}
]
},
{
"name" : "tags",
"value" : [
{
"name" : "tags",
"value" : "red"
},
{
"name" : "tags",
"value" : "green"
},
{
"name" : "tags",
"value" : "blue"
},
{
...

Mongo query inside Hashmap with unknown hash key

Platform: MongoDB, Spring, SpringDataMongoDB
I have a collection called "Encounter" with below structure
Encounter:
{ "_id" : "49a0515b-e020-4e0d-aa6c-6f96bb867288",
"_class" : "com.keype.hawk.health.emr.api.transaction.model.Encounter",
"encounterTypeId" : "c4f657f0-015d-4b02-a216-f3beba2c64be",
"visitId" : "8b4c48c6-d969-4926-8b8f-05d2f58491ae",
"status" : "ACTIVE",
"form" :
{
"_id" : "be3cddc5-4cec-4ce5-8592-72f1d7a0f093",
"formCode" : "CBC",
"fields" : {
"dc" : {
"label" : "DC",
"name" : "tc",
},
"tc" : {
"label" : "TC",
"name" : "tc",
},
"notes" : {
"label" : "Notes",
"name" : "notes",
}
},
"notes" : "Blood Test",
"dateCreated" : NumberLong("1376916746564"),
"dateModified" : NumberLong("1376916746564"),
"staffCreated" : 10013,
"staffModified" : 10013
},
}
The element "fields" is represented using a Java Hashmap as:
protected LinkedHashMap<String, Field> fields;
The Key to the hashmap () is not fixed, but generated at run time.
How do I query to get all documents in the collection where "label" = "TC"?
It's not possible to query like db.encounter.find({'form.fields.dc.label':'TC'}) because the element name 'dc' is NOT known. I want to skip that postion and the execute query, something like:
db.encounter.find({'form.fields.*.label':'TC'});
Any ideas?
Also, how do I best use indexes in this scenario?

If fields were an array and your key a part of the sub-document instead:
"fields" : [
{ "key" : "dc",
"label" : "DC",
"name" : "dc"
},
{ "key" : "tc",
"label" : "TC",
"name" : "tc"
}
]
In this case, you could simply query for any sub-element inside the array:
db.coll.find({"form.fields.label":"TC"})
Not sure how you would integrate that with Spring, but perhaps the idea helps? As far as indexes are concerned, you can index into the array, which gives you a multi-key index. Basically, the index will have a separate entry pointing to the document for each array value.

Is it possible to add list of objects into embedded array document in mongo document?

I am having the mongo document as below:
{
"_id" : ObjectId("506e9e54a4e8f51423679428"),
"description" : "ffffffffffffffff",
"menus" : [
{
"_id" : ObjectId("506e9e5aa4e8f51423679429"),
"description" : "ffffffffffffffffffff",
"items" : [
{
"name" : "xcvxc",
"description" : "vxvxcvxc",
"text" : "vxcvxcvx",
"menuKey" : "0",
"onSelect" : "1",
"_id" : ObjectId("506e9f07a4e8f5142367942f")
} ,
{
"name" : "abcd",
"description" : "qqq",
"text" : "qqq",
"menuKey" : "0",
"onSelect" : "3",
"_id" : ObjectId("507e9f07a4e8f5142367942f")
}
]
}
]
}
Now i want to change this to :
{
"_id" : ObjectId("506e9e54a4e8f51423679428"),
"description" : "ffffffffffffffff",
"menus" : [
{
"_id" : ObjectId("506e9e5aa4e8f51423679429"),
"description" : "ffffffffffffffffffff",
"items" : {
{
"name" : "xcvxc",
"description" : "vxvxcvxc",
"text" : "vxcvxcvx",
"menuKey" : "0",
"onSelect" : "1",
"_id" : ObjectId("506e9f07a4e8f5142367942f")
} ,
{
"name" : "abcd",
"description" : "qqq",
"text" : "qqq",
"menuKey" : "0",
"onSelect" : "3",
"_id" : ObjectId("507e9f07a4e8f5142367942f")
}
}
}
]
}
Is this possible in mongo? In the first schema, updating is not possible atomically becoz we can't use two "$" while updating deep layer. So i thought to change schema as same as second one, how can i achieve it?
For first one i have used "$push" for adding items inside menus...
Any help would be great..

Your update is changing 'menu' object, so I would suggest changing the schema so that the menu is the top level document, rather than an array in another document.
Menu could either have a field referencing the top level object (in another collection) that it belongs to, or you can denormalize the fields of the top level object into each menu document.
Without knowing complete requirements of the application, it's difficult to know when the schema is "good enough".

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to map two datasets in spark Java - java

Related

How to query an array in array of dictionaries in MongoDB?

Representing Abstract JSON Objects as models in Java

Customize the PrettyPrint options in Jackson?

Mongo query inside Hashmap with unknown hash key

Is it possible to add list of objects into embedded array document in mongo document?

Categories

Resources