How to train mahout with java? - java

i need to create a classifier by feature, i have 15M rows of data like:
{
"app_entertainment" : 1,
"app_widgets" : 2,
"arcade" : 8,
"books_and_reference" : 2,
"comics" : 0,
"brain" : 20,
"business" : 0,
"cards" : 5,
"casual" : 1,
"communication" : 4,
"education" : 0,
"finance" : 1,
"game_wallpaper" : 0,
"game_widgets" : 0,
"health_fitness" : 0,
"libraries_demo" : 0,
"racing" : 1,
"lifestyle" : 1,
"media_video" : 0,
"medical" : 0,
"music_and_audio" : 7,
"news_magazines" : 2,
"personalization" : 1,
"photography" : 0,
"productivity" : 4,
"shopping" : 1,
"social" : 1,
"sports_apps" : 1,
"sports_games" : 7,
"tools" : 15,
"transportation" : 2,
"travel_and_local" : 8,
"weather" : 3,
"app_wallpaper" : 0,
"entertainment" : 0,
"health_and_fitness" : 0,
"libraries_and_demo" : 0,
"media_and_video" : 0,
"news_and_magazines" : 0,
"sports" : 0
}
also for every dataset like this i know if its true or false,
the boolean is if the user with this dataset clicked on ad or not.
how can i use mahout to train a classifier and how do i classify after i trained it?
everything that i found on the net is very abstract, not many examples of how to do it via java

There are very few materials for Mahout on the internet. I referred to the Mahout source code and the source code in Mahout in Action.
You could refer to 20newsgroup source code for classification.
A simple example using NavieBayes classifier. The vector is the dataset.
public List<String> classifyCase(Vector vector) {
TreeMap<Double, String> resultMap = new TreeMap<Double, String>();
Vector result = classifier.classifyFull(vector);
for (Vector.Element element: result) {
int categoryId = element.index();
double score = element.get();
resultMap.put(-score, labels.get(categoryId));
}
return new ArrayList<String>(resultMap.values());
}

Related

How to update different sub documents based on string in mongo

The following is the document i'm trying to update :
{
"_id" : "12",
"cm_AccAmt" : 30,
"cmPerDaySts" : [
{
"cm_accAmt" : 30,
"cm_accTxnCount" : 2,
"cm_cpnCount" : 2,
"cm_accDate" : "2018-02-12"
},
{
"cm_accAmt" : 15,
"cm_accTxnCount" : 1,
"cm_cpnCount" : 1,
"cm_accDate" : "2018-02-13"
}
],
"cpnPerDaySts" : {
"cpnFile" : "path",
"perDayAcc" : [
{
"cm_accAmt" : 0,
"cm_accTxnCount" : 0,
"cm_cpnCount" : 0,
"cm_accDate" : "2018-02-12"
},
{
"cm_accAmt" : 0,
"cm_accTxnCount" : 0,
"cm_cpnCount" : 0,
"cm_accDate" : "2018-02-13"
}
]
}
}
I want to update the two lists cmPerDaySts and cpnPerDaySts based on the string date field : cm_accDate, if a match is available.
The code i've tried until now to achieve this task is :
ArrayList<BasicDBObject> filter = new ArrayList<>();
filter.add(new BasicDBObject("_id", "12").append("cmPerDaySts.cm_accDate", "2018-02-12"));
filter.add(new BasicDBObject("_id", "12").append("cpnPerDaySts.perDayAcc.cm_accDate", "2018-02-12"));
Document document2 = mongoCollection.findOneAndUpdate(new BasicDBObject("$or", filter),
new BasicDBObject("$inc",
new BasicDBObject("cmPerDaySts.$.cm_accAmt", 15).append("cm_AccAmt", 15).append("cmPerDaySts.$.cm_accTxnCount", 1)
.append("cmPerDaySts.$.cm_cpnCount", 1).append("cpnPerDaySts.perDayAcc.cm_accTxnCount", 1)),
new FindOneAndUpdateOptions().upsert(false).returnDocument(ReturnDocument.AFTER));
System.out.println(document2.toJson());
But it ends up failing with the below exception :
Exception in thread "main" com.mongodb.MongoCommandException: Command failed with error 16837: 'The positional operator did not find the match needed from the query. Unexpanded update:
i want to achieve this in a single update query not multiple. can anyone point me in the right direction or approach to solve this.

Extract values using parallelStream in Java

I have this json Array:
"LiveSessionDataCollection": [
{
"CreateDate": "2017-12-27T13:29:06.595Z",
"Data": "Khttp://www8.hp.com/us/en/large-format-printers/designjet-printers/products.html&AbSGOX+SGOXpLXpBF8CXpGOA9BFFPconsole.info('DeploymentConfigName%3DRelease_20171227%26Version%3D1')%3B&HoConfig: Release_20171227&AwDz//////8NuaCh63&Win32&SNgYAJBBYWCYKW9a&2&SGOX+SGOXpF/1en-us&AAAAAAAAAAAAQICBCXpGOAAMBBBB8jl",
"DataFlags": 8,
"DataFlagType": 264,
"LegacyLiveSessionDataType": null,
"LiveSessionId": 1545190526042650,
"MessageNumber": 0,
"StreamId": 0,
"StreamMessageId": 0,
"ProjectId": 201
},
{
"CreateDate": "2017-12-27T13:29:08.887Z",
"Data": "oDB Information Level : Detailed&9BbRoDB Annual Sales : 55000000&BoDB Audience : Mid-Market Business&AoDB%20Audience%20Segment%20%3A%20Retail%20%26%20Distribution&AoDB B2C : true&AoDB Company Name : Clicktale Inc&AoDB SID : 120325490&AoDB Employee Count : 275&AoDB Employee Range : Mid-Market&AoDB%20Industry%20%3A%20Retail%20%26%20Distribution&AoDB Revenue Range : $50M - $100M&AoDB Sub Industry : Electronics&AoDB Traffic : High&AWB9tY/8bvOBBP_({\"a\":[{\"a\":{\"s\":\"w:auto;l:auto;\"},\"n\":\"div53\"}]})&sP_({\"a\":[{\"a\":{\"s\":\"w:auto;l:auto;\"},\"n\":\"div62\"}]})&FP_({\"r\":[\"script2\"],\"m\":[{\"n\":{\"nt\":1,\"tn\":\"SCRIPT\",\"a\":{\"async\":\"\",\"src\":\"http://admin.brightcove.com/js/api/SmartPlayerAPI.js?_=1514381348598\"},\"i\":\"script55\"},\"t\":false,\"pn\":\"head1\"}]})&8GuP_({\"a\":[{\"a\":{\"s\":\"t:0px;mt:0px;l:274.5px;ml:0px;\"},\"n\":\"div442\"}]})&SP_({\"a\":[{\"a\":{\"s\":\"t:0px;mt:0px;l:274.5px;ml:0px;\"},\"n\":\"div444\"}]})&D",
"DataFlags": 8,
"DataFlagType": 264,
"LegacyLiveSessionDataType": null,
"LiveSessionId": 1545190526042650,
"MessageNumber": 1,
"StreamId": 0,
"StreamMessageId": 1,
"ProjectId": 201
},
{
"CreateDate": "2017-12-27T13:29:08.971Z",
"Data": "P_({\"a\":[{\"a\":{\"s\":\"mih:480px;\"},\"n\":\"div105\"},{\"a\":{\"s\":\"mih:480px;\"},\"n\":\"div114\"},{\"a\":{\"s\":\"mih:480px;\"},\"n\":\"div123\"}]})&9B+8P_({\"a\":[{\"a\":{\"s\":\"mih:480px;\"},\"n\":\"div167\"},{\"a\":{\"s\":\"mih:480px;\"},\"n\":\"div169\"},{\"a\":{\"s\":\"mih:480px;\"},\"n\":\"div178\"}]})&JP_({\"a\":[{\"a\":{\"s\":\"mih:457px;\"},\"n\":\"div220\"},{\"a\":{\"s\":\"mih:457px;\"},\"n\":\"div229\"},{\"a\":{\"s\":\"mih:457px;\"},\"n\":\"div238\"}]})&FP_({\"a\":[{\"a\":{\"s\":\"mih:480px;\"},\"n\":\"div282\"},{\"a\":{\"s\":\"mih:480px;\"},\"n\":\"div291\"},{\"a\":{\"s\":\"mih:480px;\"},\"n\":\"div300\"}]})&HP_({\"a\":[{\"a\":{\"s\":\"t:0px;mt:-92px;l:274.5px;ml:0px;\"},\"n\":\"div442\"}]})&HP_({\"a\":[{\"a\":{\"s\":\"t:0px;mt:-92px;l:274.5px;ml:0px;\"},\"n\":\"div444\"}]})&B",
"DataFlags": 8,
"DataFlagType": 264,
"LegacyLiveSessionDataType": null,
"LiveSessionId": 1545190526042650,
"MessageNumber": 2,
"StreamId": 0,
"StreamMessageId": 2,
"ProjectId": 201
}]
And I am trying to get all DataFlagType values using parallelStream() and forEach().
This is the code I have written but I am getting error:
jsonArray = (JSONArray) json.get("LiveSessionDataCollection");
jsonArray.parallelStream().forEach(
x -> ((JSONObject) dataFlagType = (JSONObject) jsonArray.get(1)));
I don't know exactly how to work with parallelStream(). How can I achieve getting all DataFlayType values from the json array (as int)?
You first need to get a stream of all object's DataFlagType attribute then collect it in a list:
jsonArray.parallelStream().map(x->
((JSONObject)x).get("DataFlagType")).collect(Collectors.toList())
Output:
[264,264,264]

Aggregation Performance Degradation on increasing load

I am running a 3 node Mongo cluster (version 3.0 wired tiger storage engine ) with 10GB RAM.
I have around 2 million doc each having 25 - 30 fields of which 2 are elementary arrays.
I am performing aggregation query which takes around 150 -170 milliseconds.
When I generate a load of 100 queries/sec, the performance starts degrading and reaches up to 2 sec.
Query
db.testCollection.aggregate( [
{ $match: { vid: { $in: ["001","002"]} , ss :"N" , spt : { $gte : new Date("2016-06-29")}, spf :{ $lte : new Date("2016-06-27")}}},
{ $match: {$or : [{sc:{$elemMatch :{$eq : "TEST"}}},{sc :{$exists : false}}]}},
{ $match: {$or : [{pt:{$ne : "RATE"}},{rpis :{$exists : true}}]}},
{ $project: { vid: 1, pid: 1, pn: 1, pt: 1, spf: 1, spt: 1, bw: 1, bwe: 1, st: 1, et: 1, ls: 1, dw: 1, at: 1, dt: 1, d1: 1, d2: 1, mldv: 1, aog: 1, nn: 1, mn: 1, rpis: 1, lmp: 1, cid: 1, van: 1, vad: 1, efo: 1, sc: 1, ss: 1, m: 1, pr: 1, obw: 1, osc: 1, m2: 1, crp: 1, sce: 1, dce: 1, cns: 1 }},
{ $group: { _id: null , data: { $push: "$$ROOT" } }
},
{ $project: { _id: 1 , data : 1 } }
]
)
There is a compound index on all the fields, in the same order as used for for query (except "rpis" since compound index can have only one array field).
Please suggest, where I am going wrong.
the two last stages are unnecessary.
last group is a very heavy as it creates new array in memory, but your result should be digested by application at this stage (not using group).
and there could be a green light to remove previous $project as maybe it could be cheaper to push full document down to client - this could be worth a try.
When $match is used on first entry - then index is used, there is a huge risk that 2nd and 3rd match works with result set from first pipeline instead of using created indexes. If you have a way try to compress $match stages to have only one and see how query performs.
Simplified version of query below:
db.testCollection.aggregate([{
$match : {
vid : {
$in : ["001", "002"]
},
ss : "N",
spt : {
$gte : new Date("2016-06-29")
},
spf : {
$lte : new Date("2016-06-27")
}
}
}, {
$match : {
$or : [{
sc : {
$elemMatch : {
$eq : "TEST"
}
}
}, {
sc : {
$exists : false
}
}
]
}
}, {
$match : {
$or : [{
pt : {
$ne : "RATE"
}
}, {
rpis : {
$exists : true
}
}
]
}
}])
Other issue could be business rules which had impact for scaling system to sharded environment - do you have estimate of load before you started working with such document structure?

Parsing comment levels from forum posts

Is it possible to find out the comment levels from this web like below?
https://www.ozbargain.com.au/node/249439#comment-3719026
From jsoup I am able to parse the comments, username etc, but I am having trouble getting the correct comment levels.
Viewing the source of that page, the doesn't match with the correct live posts, unless I am reading it all wrong.
Is there a way to solve this?
I was able to generate the source comment level using:
String url = "https://www.ozbargain.com.au/node/249439";
Document doc = Jsoup.connect(url).get();
Elements level = doc.select("ul.comment");
for(Element column : e.select("ul")){
//comment level
System.out.println(column.attr("class"));
levels.add(column.attr("class"));
}
But its doesn't look right. Only showing 1 of level 0 comment etc.
Thanks
for(Element column : e.select("ul")) {
//comment level
System.out.println(column.attr("class"));
levels.add(column.attr("class"));
}
From the above code where does the e comes from?
Anyway, you need to parse the class attribute value in order to find the comment level.
Here is a working sample code:
SAMPLE CODE
public static void main(String[] args) throws IOException {
String url="https://www.ozbargain.com.au/node/249439#comment-3719026";
Document doc = Jsoup.connect(url).get();
Elements comments = doc.select("div.comment-wrap");
Matcher levelMatcher = Pattern.compile("(?i)^(.*level)(\\d+)(.*)$").matcher("");
List<String> levels = new ArrayList<>();
System.out.println("Comments found: "+ comments.size());
for (Element comment : comments) {
if (levelMatcher.reset(comment.parent().parent().className()).find()) {
levels.add(levelMatcher.replaceAll("$2"));
}
}
System.out.println(levels);
}
OUTPUT [https://www.ozbargain.com.au/node/249439#comment-3719026] (may change depending on the request time)
Comments found: 38
[0, 1, 2, 3, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 2, 3, 3, 1, 2, 3, 3, 0, 1, 2, 3, 2, 3, 3, 2, 0, 0, 0, 1, 2, 3]
OUTPUT [https://www.ozbargain.com.au/node/249604] (may change depending on the request time)
Comments found: 14
[0, 1, 0, 1, 0, 1, 1, 2, 1, 0, 0, 1, 2, 0]

How to return just the matched elements from a mongoDB array

I've been looking for this question one week and I can't understand why it still don't work...
I have this object into my MongoDB database:
{
produc: [
{
cod_prod: "0001",
description: "Ordenador",
price: 400,
current_stock: 3,
min_stock: 1,
cod_zone: "08850"
},
{
cod_prod: "0002",
description: "Secador",
price: 30,
current_stock: 10,
min_stock: 2,
cod_zone: "08870"
},
{
cod_prod: "0003",
description: "Portatil",
price: 500,
current_stock: 8,
min_stock: 4,
cod_zone: "08860"
},
{
cod_prod: "0004",
description: "Disco Duro",
price: 100,
current_stock: 20,
min_stock: 5,
cod_zone: "08850"
},
{
cod_prod: "0005",
description: "Monitor",
price: 150,
current_stock: 0,
min_stock: 2,
cod_zone: "08850"
}
]
}
I would like to query for array elements with specific cod_zone ("08850") for example.
I found the $elemMatch projection that supposedly should return just the array elements which match the query, but I don't know why I'm getting all object.
This is the query I'm using:
db['Collection_Name'].find(
{
produc: {
$elemMatch: {
cod_zone: "08850"
}
}
}
);
And this is the result I expect:
{ produc: [
{
cod_prod: "0001",
denominacion: "Ordenador",
precio: 400,
stock_actual: 3,
stock_minimo: 1,
cod_zona: "08850"
},{
cod_prod: "0004",
denominacion: "Disco Duro",
precio: 100,
stock_actual: 20,
stock_minimo: 5,
cod_zona: "08850"
},
{
cod_prod: "0005",
denominacion: "Monitor",
precio: 150,
stock_actual: 0,
stock_minimo: 2,
cod_zona: "08850"
}]
}
I'm making a Java program using MongoDB Java Connector, so I really need a query for java connector but I think I will be able to get it if I know mongo query.
Thank you so much!
This is possible through the aggregation framework. The pipeline passes all documents in the collection through the following operations:
$unwind operator - Outputs a document for each element in the produc array field by deconstructing it.
$match operator will filter only documents that match cod_zone criteria.
$group operator will group the input documents by a specified identifier expression and applies the accumulator expression $push to each group:
$project operator then reconstructs each document in the stream:
db.collection.aggregate([
{
"$unwind": "$produc"
},
{
"$match": {
"produc.cod_zone": "08850"
}
},
{
"$group":
{
"_id": null,
"produc": {
"$push": {
"cod_prod": "$produc.cod_prod",
"description": "$produc.description",
"price" : "$produc.price",
"current_stock" : "$produc.current_stock",
"min_stock" : "$produc.min_stock",
"cod_zone" : "$produc.cod_zone"
}
}
}
},
{
"$project": {
"_id": 0,
"produc": 1
}
}
])
will produce:
{
"result" : [
{
"produc" : [
{
"cod_prod" : "0001",
"description" : "Ordenador",
"price" : 400,
"current_stock" : 3,
"min_stock" : 1,
"cod_zone" : "08850"
},
{
"cod_prod" : "0004",
"description" : "Disco Duro",
"price" : 100,
"current_stock" : 20,
"min_stock" : 5,
"cod_zone" : "08850"
},
{
"cod_prod" : "0005",
"description" : "Monitor",
"price" : 150,
"current_stock" : 0,
"min_stock" : 2,
"cod_zone" : "08850"
}
]
}
],
"ok" : 1
}

Categories

Resources