i need to create a classifier by feature, i have 15M rows of data like:
{
"app_entertainment" : 1,
"app_widgets" : 2,
"arcade" : 8,
"books_and_reference" : 2,
"comics" : 0,
"brain" : 20,
"business" : 0,
"cards" : 5,
"casual" : 1,
"communication" : 4,
"education" : 0,
"finance" : 1,
"game_wallpaper" : 0,
"game_widgets" : 0,
"health_fitness" : 0,
"libraries_demo" : 0,
"racing" : 1,
"lifestyle" : 1,
"media_video" : 0,
"medical" : 0,
"music_and_audio" : 7,
"news_magazines" : 2,
"personalization" : 1,
"photography" : 0,
"productivity" : 4,
"shopping" : 1,
"social" : 1,
"sports_apps" : 1,
"sports_games" : 7,
"tools" : 15,
"transportation" : 2,
"travel_and_local" : 8,
"weather" : 3,
"app_wallpaper" : 0,
"entertainment" : 0,
"health_and_fitness" : 0,
"libraries_and_demo" : 0,
"media_and_video" : 0,
"news_and_magazines" : 0,
"sports" : 0
}
also for every dataset like this i know if its true or false,
the boolean is if the user with this dataset clicked on ad or not.
how can i use mahout to train a classifier and how do i classify after i trained it?
everything that i found on the net is very abstract, not many examples of how to do it via java
There are very few materials for Mahout on the internet. I referred to the Mahout source code and the source code in Mahout in Action.
You could refer to 20newsgroup source code for classification.
A simple example using NavieBayes classifier. The vector is the dataset.
public List<String> classifyCase(Vector vector) {
TreeMap<Double, String> resultMap = new TreeMap<Double, String>();
Vector result = classifier.classifyFull(vector);
for (Vector.Element element: result) {
int categoryId = element.index();
double score = element.get();
resultMap.put(-score, labels.get(categoryId));
}
return new ArrayList<String>(resultMap.values());
}
Related
The following is the document i'm trying to update :
{
"_id" : "12",
"cm_AccAmt" : 30,
"cmPerDaySts" : [
{
"cm_accAmt" : 30,
"cm_accTxnCount" : 2,
"cm_cpnCount" : 2,
"cm_accDate" : "2018-02-12"
},
{
"cm_accAmt" : 15,
"cm_accTxnCount" : 1,
"cm_cpnCount" : 1,
"cm_accDate" : "2018-02-13"
}
],
"cpnPerDaySts" : {
"cpnFile" : "path",
"perDayAcc" : [
{
"cm_accAmt" : 0,
"cm_accTxnCount" : 0,
"cm_cpnCount" : 0,
"cm_accDate" : "2018-02-12"
},
{
"cm_accAmt" : 0,
"cm_accTxnCount" : 0,
"cm_cpnCount" : 0,
"cm_accDate" : "2018-02-13"
}
]
}
}
I want to update the two lists cmPerDaySts and cpnPerDaySts based on the string date field : cm_accDate, if a match is available.
The code i've tried until now to achieve this task is :
ArrayList<BasicDBObject> filter = new ArrayList<>();
filter.add(new BasicDBObject("_id", "12").append("cmPerDaySts.cm_accDate", "2018-02-12"));
filter.add(new BasicDBObject("_id", "12").append("cpnPerDaySts.perDayAcc.cm_accDate", "2018-02-12"));
Document document2 = mongoCollection.findOneAndUpdate(new BasicDBObject("$or", filter),
new BasicDBObject("$inc",
new BasicDBObject("cmPerDaySts.$.cm_accAmt", 15).append("cm_AccAmt", 15).append("cmPerDaySts.$.cm_accTxnCount", 1)
.append("cmPerDaySts.$.cm_cpnCount", 1).append("cpnPerDaySts.perDayAcc.cm_accTxnCount", 1)),
new FindOneAndUpdateOptions().upsert(false).returnDocument(ReturnDocument.AFTER));
System.out.println(document2.toJson());
But it ends up failing with the below exception :
Exception in thread "main" com.mongodb.MongoCommandException: Command failed with error 16837: 'The positional operator did not find the match needed from the query. Unexpanded update:
i want to achieve this in a single update query not multiple. can anyone point me in the right direction or approach to solve this.
I have this json Array:
"LiveSessionDataCollection": [
{
"CreateDate": "2017-12-27T13:29:06.595Z",
"Data": "Khttp://www8.hp.com/us/en/large-format-printers/designjet-printers/products.html&AbSGOX+SGOXpLXpBF8CXpGOA9BFFPconsole.info('DeploymentConfigName%3DRelease_20171227%26Version%3D1')%3B&HoConfig: Release_20171227&AwDz//////8NuaCh63&Win32&SNgYAJBBYWCYKW9a&2&SGOX+SGOXpF/1en-us&AAAAAAAAAAAAQICBCXpGOAAMBBBB8jl",
"DataFlags": 8,
"DataFlagType": 264,
"LegacyLiveSessionDataType": null,
"LiveSessionId": 1545190526042650,
"MessageNumber": 0,
"StreamId": 0,
"StreamMessageId": 0,
"ProjectId": 201
},
{
"CreateDate": "2017-12-27T13:29:08.887Z",
"Data": "oDB Information Level : Detailed&9BbRoDB Annual Sales : 55000000&BoDB Audience : Mid-Market Business&AoDB%20Audience%20Segment%20%3A%20Retail%20%26%20Distribution&AoDB B2C : true&AoDB Company Name : Clicktale Inc&AoDB SID : 120325490&AoDB Employee Count : 275&AoDB Employee Range : Mid-Market&AoDB%20Industry%20%3A%20Retail%20%26%20Distribution&AoDB Revenue Range : $50M - $100M&AoDB Sub Industry : Electronics&AoDB Traffic : High&AWB9tY/8bvOBBP_({\"a\":[{\"a\":{\"s\":\"w:auto;l:auto;\"},\"n\":\"div53\"}]})&sP_({\"a\":[{\"a\":{\"s\":\"w:auto;l:auto;\"},\"n\":\"div62\"}]})&FP_({\"r\":[\"script2\"],\"m\":[{\"n\":{\"nt\":1,\"tn\":\"SCRIPT\",\"a\":{\"async\":\"\",\"src\":\"http://admin.brightcove.com/js/api/SmartPlayerAPI.js?_=1514381348598\"},\"i\":\"script55\"},\"t\":false,\"pn\":\"head1\"}]})&8GuP_({\"a\":[{\"a\":{\"s\":\"t:0px;mt:0px;l:274.5px;ml:0px;\"},\"n\":\"div442\"}]})&SP_({\"a\":[{\"a\":{\"s\":\"t:0px;mt:0px;l:274.5px;ml:0px;\"},\"n\":\"div444\"}]})&D",
"DataFlags": 8,
"DataFlagType": 264,
"LegacyLiveSessionDataType": null,
"LiveSessionId": 1545190526042650,
"MessageNumber": 1,
"StreamId": 0,
"StreamMessageId": 1,
"ProjectId": 201
},
{
"CreateDate": "2017-12-27T13:29:08.971Z",
"Data": "P_({\"a\":[{\"a\":{\"s\":\"mih:480px;\"},\"n\":\"div105\"},{\"a\":{\"s\":\"mih:480px;\"},\"n\":\"div114\"},{\"a\":{\"s\":\"mih:480px;\"},\"n\":\"div123\"}]})&9B+8P_({\"a\":[{\"a\":{\"s\":\"mih:480px;\"},\"n\":\"div167\"},{\"a\":{\"s\":\"mih:480px;\"},\"n\":\"div169\"},{\"a\":{\"s\":\"mih:480px;\"},\"n\":\"div178\"}]})&JP_({\"a\":[{\"a\":{\"s\":\"mih:457px;\"},\"n\":\"div220\"},{\"a\":{\"s\":\"mih:457px;\"},\"n\":\"div229\"},{\"a\":{\"s\":\"mih:457px;\"},\"n\":\"div238\"}]})&FP_({\"a\":[{\"a\":{\"s\":\"mih:480px;\"},\"n\":\"div282\"},{\"a\":{\"s\":\"mih:480px;\"},\"n\":\"div291\"},{\"a\":{\"s\":\"mih:480px;\"},\"n\":\"div300\"}]})&HP_({\"a\":[{\"a\":{\"s\":\"t:0px;mt:-92px;l:274.5px;ml:0px;\"},\"n\":\"div442\"}]})&HP_({\"a\":[{\"a\":{\"s\":\"t:0px;mt:-92px;l:274.5px;ml:0px;\"},\"n\":\"div444\"}]})&B",
"DataFlags": 8,
"DataFlagType": 264,
"LegacyLiveSessionDataType": null,
"LiveSessionId": 1545190526042650,
"MessageNumber": 2,
"StreamId": 0,
"StreamMessageId": 2,
"ProjectId": 201
}]
And I am trying to get all DataFlagType values using parallelStream() and forEach().
This is the code I have written but I am getting error:
jsonArray = (JSONArray) json.get("LiveSessionDataCollection");
jsonArray.parallelStream().forEach(
x -> ((JSONObject) dataFlagType = (JSONObject) jsonArray.get(1)));
I don't know exactly how to work with parallelStream(). How can I achieve getting all DataFlayType values from the json array (as int)?
You first need to get a stream of all object's DataFlagType attribute then collect it in a list:
jsonArray.parallelStream().map(x->
((JSONObject)x).get("DataFlagType")).collect(Collectors.toList())
Output:
[264,264,264]
I am running a 3 node Mongo cluster (version 3.0 wired tiger storage engine ) with 10GB RAM.
I have around 2 million doc each having 25 - 30 fields of which 2 are elementary arrays.
I am performing aggregation query which takes around 150 -170 milliseconds.
When I generate a load of 100 queries/sec, the performance starts degrading and reaches up to 2 sec.
Query
db.testCollection.aggregate( [
{ $match: { vid: { $in: ["001","002"]} , ss :"N" , spt : { $gte : new Date("2016-06-29")}, spf :{ $lte : new Date("2016-06-27")}}},
{ $match: {$or : [{sc:{$elemMatch :{$eq : "TEST"}}},{sc :{$exists : false}}]}},
{ $match: {$or : [{pt:{$ne : "RATE"}},{rpis :{$exists : true}}]}},
{ $project: { vid: 1, pid: 1, pn: 1, pt: 1, spf: 1, spt: 1, bw: 1, bwe: 1, st: 1, et: 1, ls: 1, dw: 1, at: 1, dt: 1, d1: 1, d2: 1, mldv: 1, aog: 1, nn: 1, mn: 1, rpis: 1, lmp: 1, cid: 1, van: 1, vad: 1, efo: 1, sc: 1, ss: 1, m: 1, pr: 1, obw: 1, osc: 1, m2: 1, crp: 1, sce: 1, dce: 1, cns: 1 }},
{ $group: { _id: null , data: { $push: "$$ROOT" } }
},
{ $project: { _id: 1 , data : 1 } }
]
)
There is a compound index on all the fields, in the same order as used for for query (except "rpis" since compound index can have only one array field).
Please suggest, where I am going wrong.
the two last stages are unnecessary.
last group is a very heavy as it creates new array in memory, but your result should be digested by application at this stage (not using group).
and there could be a green light to remove previous $project as maybe it could be cheaper to push full document down to client - this could be worth a try.
When $match is used on first entry - then index is used, there is a huge risk that 2nd and 3rd match works with result set from first pipeline instead of using created indexes. If you have a way try to compress $match stages to have only one and see how query performs.
Simplified version of query below:
db.testCollection.aggregate([{
$match : {
vid : {
$in : ["001", "002"]
},
ss : "N",
spt : {
$gte : new Date("2016-06-29")
},
spf : {
$lte : new Date("2016-06-27")
}
}
}, {
$match : {
$or : [{
sc : {
$elemMatch : {
$eq : "TEST"
}
}
}, {
sc : {
$exists : false
}
}
]
}
}, {
$match : {
$or : [{
pt : {
$ne : "RATE"
}
}, {
rpis : {
$exists : true
}
}
]
}
}])
Other issue could be business rules which had impact for scaling system to sharded environment - do you have estimate of load before you started working with such document structure?
Is it possible to find out the comment levels from this web like below?
https://www.ozbargain.com.au/node/249439#comment-3719026
From jsoup I am able to parse the comments, username etc, but I am having trouble getting the correct comment levels.
Viewing the source of that page, the doesn't match with the correct live posts, unless I am reading it all wrong.
Is there a way to solve this?
I was able to generate the source comment level using:
String url = "https://www.ozbargain.com.au/node/249439";
Document doc = Jsoup.connect(url).get();
Elements level = doc.select("ul.comment");
for(Element column : e.select("ul")){
//comment level
System.out.println(column.attr("class"));
levels.add(column.attr("class"));
}
But its doesn't look right. Only showing 1 of level 0 comment etc.
Thanks
for(Element column : e.select("ul")) {
//comment level
System.out.println(column.attr("class"));
levels.add(column.attr("class"));
}
From the above code where does the e comes from?
Anyway, you need to parse the class attribute value in order to find the comment level.
Here is a working sample code:
SAMPLE CODE
public static void main(String[] args) throws IOException {
String url="https://www.ozbargain.com.au/node/249439#comment-3719026";
Document doc = Jsoup.connect(url).get();
Elements comments = doc.select("div.comment-wrap");
Matcher levelMatcher = Pattern.compile("(?i)^(.*level)(\\d+)(.*)$").matcher("");
List<String> levels = new ArrayList<>();
System.out.println("Comments found: "+ comments.size());
for (Element comment : comments) {
if (levelMatcher.reset(comment.parent().parent().className()).find()) {
levels.add(levelMatcher.replaceAll("$2"));
}
}
System.out.println(levels);
}
OUTPUT [https://www.ozbargain.com.au/node/249439#comment-3719026] (may change depending on the request time)
Comments found: 38
[0, 1, 2, 3, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 2, 3, 3, 1, 2, 3, 3, 0, 1, 2, 3, 2, 3, 3, 2, 0, 0, 0, 1, 2, 3]
OUTPUT [https://www.ozbargain.com.au/node/249604] (may change depending on the request time)
Comments found: 14
[0, 1, 0, 1, 0, 1, 1, 2, 1, 0, 0, 1, 2, 0]
I've been looking for this question one week and I can't understand why it still don't work...
I have this object into my MongoDB database:
{
produc: [
{
cod_prod: "0001",
description: "Ordenador",
price: 400,
current_stock: 3,
min_stock: 1,
cod_zone: "08850"
},
{
cod_prod: "0002",
description: "Secador",
price: 30,
current_stock: 10,
min_stock: 2,
cod_zone: "08870"
},
{
cod_prod: "0003",
description: "Portatil",
price: 500,
current_stock: 8,
min_stock: 4,
cod_zone: "08860"
},
{
cod_prod: "0004",
description: "Disco Duro",
price: 100,
current_stock: 20,
min_stock: 5,
cod_zone: "08850"
},
{
cod_prod: "0005",
description: "Monitor",
price: 150,
current_stock: 0,
min_stock: 2,
cod_zone: "08850"
}
]
}
I would like to query for array elements with specific cod_zone ("08850") for example.
I found the $elemMatch projection that supposedly should return just the array elements which match the query, but I don't know why I'm getting all object.
This is the query I'm using:
db['Collection_Name'].find(
{
produc: {
$elemMatch: {
cod_zone: "08850"
}
}
}
);
And this is the result I expect:
{ produc: [
{
cod_prod: "0001",
denominacion: "Ordenador",
precio: 400,
stock_actual: 3,
stock_minimo: 1,
cod_zona: "08850"
},{
cod_prod: "0004",
denominacion: "Disco Duro",
precio: 100,
stock_actual: 20,
stock_minimo: 5,
cod_zona: "08850"
},
{
cod_prod: "0005",
denominacion: "Monitor",
precio: 150,
stock_actual: 0,
stock_minimo: 2,
cod_zona: "08850"
}]
}
I'm making a Java program using MongoDB Java Connector, so I really need a query for java connector but I think I will be able to get it if I know mongo query.
Thank you so much!
This is possible through the aggregation framework. The pipeline passes all documents in the collection through the following operations:
$unwind operator - Outputs a document for each element in the produc array field by deconstructing it.
$match operator will filter only documents that match cod_zone criteria.
$group operator will group the input documents by a specified identifier expression and applies the accumulator expression $push to each group:
$project operator then reconstructs each document in the stream:
db.collection.aggregate([
{
"$unwind": "$produc"
},
{
"$match": {
"produc.cod_zone": "08850"
}
},
{
"$group":
{
"_id": null,
"produc": {
"$push": {
"cod_prod": "$produc.cod_prod",
"description": "$produc.description",
"price" : "$produc.price",
"current_stock" : "$produc.current_stock",
"min_stock" : "$produc.min_stock",
"cod_zone" : "$produc.cod_zone"
}
}
}
},
{
"$project": {
"_id": 0,
"produc": 1
}
}
])
will produce:
{
"result" : [
{
"produc" : [
{
"cod_prod" : "0001",
"description" : "Ordenador",
"price" : 400,
"current_stock" : 3,
"min_stock" : 1,
"cod_zone" : "08850"
},
{
"cod_prod" : "0004",
"description" : "Disco Duro",
"price" : 100,
"current_stock" : 20,
"min_stock" : 5,
"cod_zone" : "08850"
},
{
"cod_prod" : "0005",
"description" : "Monitor",
"price" : 150,
"current_stock" : 0,
"min_stock" : 2,
"cod_zone" : "08850"
}
]
}
],
"ok" : 1
}