Context aware recommendation engine - java

I am looking for context aware (location,time,companion) recommendation system.
I found bunch of good recommendation systems (mahout, PredictionIO, easyrec).
But unfortunately I am not convinced with any of those.
On further googling I found CARSKit based on librec.
I am exactly looking for similar library. At the same time I am more interested to work with mahout only.
Though mahout is not suiting me but still we can ask for number of recommendations and output is also much understandable.
As per my understanding "Context awareness" is missing in mahout.
I will explain my dataset.
calendar_seq,user_id,date,dayofweek,timehh,timemm,location_name,location_lat,location_long,companion,event_name,is_recommended,is_accepted,show_in_cal
1,1,14/12/15,Monday,13,0,Office,1.1,2.2,Colleagues,lunch,true,true,true
2,1,14/12/15,Monday,18,0,Cinema,3.3,4.4,NA,Movie,false,true,true
3,1,15/12/15,Tuesday,13,0,Office,1.1,2.2,Colleagues,lunch,true,true,true
4,1,15/12/15,Tuesday,18,0,Meeting,3.3,4.4,Colleagues,meeting,false,true,true
5,1,16/12/15,Wednesday,13,0,Office,1.1,2.2,Colleagues,lunch,true,true,true
I will have above five rows in DB and will be given it as training data.
Now I need recommendation for User 1 on 16/12/15 evening 18:00.
It can recommend Cinema or Meeting for 16/12.
When I run recomender again for 17/12, based on previous day's recommendation all those events will become like training data.
So again recomender can give recommendation based on location,time,companion etc..
Can any one suggest me best suited recommendation wrapper on top of Mahout or new library which will suit my requirement?
I prefer Java based solutions for my problem.

This may be similar to your question.
A quote from this link: "Your input file may have multiple features like age, location etc. R could help you in applying K-Means clustering on multiple features. Apache Mahout implementation overwrite features instead of applying multiple features. And when you apply clustering on these multiple features, clusters would be formed based on all features instead of one. However, I am not sure about the use-case, So I am just discussing technical feasibility here. You may need to apply based on your use-case."
Hope this helps.

Related

NLP - Determine whether a piece of text is talking about a given topic?

I have a Java application where I'm looking to determine in real time whether a given piece of text is talking about a topic supplied as a query.
Some techniques I've looked into for this are coreference detection with packages like open-nlp and Stanford-NLP coref detection, but these models take extremely long to load and don't seem practical in a production application environment. Is it possible to perform coreference analysis such that given a piece of text and a topic, I can get a boolean answer that the text is discussing the topic?
Other than document classification which requires a trained corpus, are there any other techniques that can help me achieve such a thing?
I suggest have a look at Weka. It is written in Java so will gel well with your environment, will be faster for your kind of requirement, has lots of tools and comes with a UI as well as API. If you are looking at unsupervised approach (that is one without any learning with pre-classified corpus), here is an interesting paper: http://www.newdesign.aclweb.org/anthology/C/C00/C00-1066.pdf
You can also search for "unsupervised text classification/ information retrieval" on Google. You will get lots of approaches. You can choose the one you find easiest.
for each topic(if they are predefined) you can create list of terms and for each sentence check the cosine similarity of sentence and each topic list and show the most near topic to user

Is tagging a form of data mining?

I am implementing a small CRM system. and the concept of data mining to predict and find opportunities and trends are essential for such systems. One data mining approach is clustering. This is a very small CRM project and using java to provide the interface for information retrieval from database.
My question is that when I insert a customer into database, I have a text field which allows customers to be tagged on their way into the database i.e. registration point.
Would you regard tagging technique as clustering? If so, is this a data mining technique?
I am sure there is complex API such as Java Data Mining API which allows data mining. But for the sake of my project I just wanted to know if tagging users with keyword like stackoverflow allows tagging of keywords on posting question is a form of data mining since through those tagged words, one can find trends and patterns easily through searching.
To make it short, yes, tags are additional information that will make data mining easier to conduct later on.
They probably won't be enough though. Tags are linked to entities and, depending on how you compute them, they might not show interesting relations between different entities. With your tagging system, the only relation usable I see is 'has same tag' and it might not be enough.
Clustering your data can be done using community detection techniques on graphs built using your data and relations between entities.
This example is in Python and uses the networkx library but it might give you an idea of what I'm talking about: http://perso.crans.org/aynaud/communities/
Yes, tagging is definitely one way of grouping your users. However, it’s different than ‘clustering.’ Here’s why: you’re making a conscious decision on how you want to group them, but there may be better/ different user groups based on ranging behaviors that may not be obvious to you.
Clustering methods are unsupervised learning methods that can help you uncover these patterns. These methods are “unsupervised” because you don’t have a specific target variable; rather, you want to find groups/ patterns that are most prominent in the data. You can feed CRM data to clustering algorithms to uncover ‘hidden’ relationships.
Also, if you’re using ‘tagging,’ it’s more of a descriptive analytics problem - you’ve well-defined groups in the data, and you’re identifying their behavior. Clustering would be a predictive analytics problem - algorithms will try to predict groups based on the user behavior they recognize in the data.

How to make career guidance system intelligent

Well at-last I am working on my final year project which is Intelligent web based career guidance system the core functionality of my system is
Recommendation System
Basically our recommendation system will carefully examine user preferences by taking Interest tests and user’s academic record and on the basis of this examined information it will give user the best career options i.e the course like BS Computer Science etc. .
Input of the recommendation system will be the student credentials and Interest test and in interest test the questions will be given according to user academic history and the answers that he is giving in the test, so basically test will not be asking same questions from everyone it will decide on real time about what to ask from which user according to rules defined by the system.
Its output will be the option of fields which will be decided on the basis of Interest test.
Problem
When I was defending my scope infront of committee they said "this is simple if-else" this system is not intelligent.
My question is which AI technique or Algorithm could be use to make this system intelligent. I have searched a lot but papers related to my system are much more superficial they are just emphasizing on idea not on methodology.
I want to do all my work in Java. It is great if answer is technology specific.
You people can transfer my question to any other stackexchange site if it is not related to SO Q&A criteria.
Edit
After getting some idea from answers I want to implement expert system with rule based and inference engine. Now I want to be more clear on technology aspect to implement rule based engine. After searching I have found Drools to be best but Is it also compatible with web applications? And I also found Tohu to be best dynamic form generator (as this is also need of my project). can I use tohu with drools to make my web application? Is it easy to implement this type of system or not?
If you have a large amount of question, each of them can represent a feature. Assuming you are going to have a LOT of features, finding the series of if-else statements that fulfills the criteria is hard (Recall that a full tree with n questions is going to have 2^n "leaves" - representing 2^n possible answers for these questions, assuming each question is yes/no question).
Since hard programming the above is not possible for a large enough (and probably a realistic size n - there is a place for heuristical solutions one of those is Machine Learning, and specifically - the classification problem. You can have a sample of people answering your survey, with an "expert" saying what is the best career for them, and let an algorithm find a classifier for the general problem (If you want to convert it into a series of yes-no questions automatically, it can be done with a decision tree, and an algorithm like C4.5 to create the tree).
It could also be important to determine - which questions are actually relevant? Is a gender relevant? Is height relevant? These questions as well can be answered using ML algorithms with feature selection algorithms for example (one of these is PCA)
Regarding the "technology" aspect - there is a nice library in java - called Weka which implement many of the classification algorithms out there.
One question you could ask (and try to find out in your project) which classification algorithm will be best for this problem? Some possibilities are The above mentioned C4.5, Naive Bayes, Linear Regression, Neural Networks, KNN or SVM (which usually turned out best for me). You can try and back your decision which algorithm to use with a statistical research and a statistical proof which is better. Wilcoxon test is the standard for this.
EDIT: more details on point 2:
In here an "expert" can be a human classifier from the field of HR
that reads the features and classifies the answers. Obtaining this
data (usually called the "training data") is hard and expansive
sometimes, if your university has an IE or HR faculty, maybe they
will be willing to help.
The idea is: Gather a bunch of people who first answer your survey. Then, give it to a human classifier ("expert") which will chose what is the best career for this person, based on his answers. The data with the classification given by the expert is the input of the learning algorithm, its output will be a classifier.
A classifier is a function itself, that given answers to a surveys - predicts what is the "classification" (suggested career) for the person who did this survey.
Note that once you have a classifier - you do not need to maintain the training data any more, the classifier alone is enough. However, you should have your list of questions and the answers for these questions will be the features provided to the classifier.
All you have to do to satisfy them is create a simple learning system:
Change your thesis terminology so it is described as "learning the best career" instead of using the word "intelligent". Learning is a form of artificial intelligence.
Create a training regime. Do this by giving the questionnaire to people that already have careers and also ask questions to find out how satisfied they are with their career. That way your system can train on what makes a good career match and what makes a bad one.
Choose a learning system to absorb the data from (2). For example, one source of ideas might be this recent paper: http://journals.cluteonline.com/index.php/RBIS/article/download/4405/4493. Product sum networks are cutting edge in AI and apply well to expert-system-like problems.
Finally, try to give a twist to whatever your technology is to make it specific to your problem.
In my final project, I had some experience with Jena RDF inference engine. Basically, what you do with it is create a sort of knowledge base with rules like "if user chose this answer, he has that quality" and "if user has those qualities, he might be good for that job". Adding answers into the system will let you query his current status and adjust questions accordingly. It's pretty easy to create a proof of concept with it, it's easier to do than a bunch of if-else, and if your professors worship prolog-ish style things, they'll like it.
As #amit suggested, Bayesian analysis can provide you guidance on the next question to ask. Another pitfall of dynamic tests is artificial thresholds ("if your score is 28, you are in this category, if your score is 27, you are not"), a problem which fuzzy logic can help address. Another benefit of fuzzy logic is that adding a new category is relatively easy, since the domain expert is only asked to contribute qualitative assessments, not quantitative thresholds.
A program is never more intelligent than the person who wrote it. So, I would first use the collective intelligence that has been built and open sourced already.
Pass your set of known data points as an input to Apache Mahout's PearsonCorrelationSimilarity and use the output to predict which course is the best match. In addition to being open source and scalable, you can also record the outcome and feed it back to the system to improve the accuracy over time. It is very hard to match this level of performance because it is a lot easier to tweak an out of the box algorithm or replace it with your own than it is to deal with a bunch of if else conditions.
I would suggest reading this book . It contains an example of how to use PearsonCorrelationSimilarity.
Mahout also has built in recommender algorithms like NearestNeighborClusterSimilarity
that can simplify your solution further.
There's a good starter code in the book. You can build on it.
Student credentials, Interest Test Questions and answers are inputs. Career choice is the output that you can co-relate to the input. Now that's a very simplistic approach but it might be ok to start with. Eventually, you will have to apply the classifier techniques that Amit has suggested and Mahout can help you with that as well.
Drools can be used via the web, but watch out; it can be a bit of beast to configure and is likely serious overkill for your application. It is an 'enterprise' type of solution focused around rule management, rather than rule execution.
Drools is an "IF-THEN" system, and pretty much all rules engines use the Rete algorithm. http://en.wikipedia.org/wiki/Rete_algorithm - so if your original question is about how not to use an IF-THEN system, Drools is not the right choice. Now, there is a Solver and Planner part of Drools that are not IF-THEN algorithms, but this is not the main Drools algorithm.
That said, it seems like a reasonable choice for your application. Just don't expect it to be considered an 'intelligent' system by those who deem themselves as experts. Rules engines are typically used to codify (that is, make software of) the rules and regulations of business, such as 'should you be approved for a mortgage' or 'how much is your car insurance' and so on. 'what job you should do' is a reasonable application of the same.
If you want to add more AI like intelligence here are a few ideas
Use machine learning to get feedback from the user about earlier recommendations. So, if someone likes or hates a suggestion, add that back in as a feature of the person. You are now doing some basic feedback/reinforcement learning (bayes, neural nets) to try to better classify the person to the career.
Consider the questions you ask the person. Do you need to ask all of the questions? If you can alter the flow of questions based on their responses (by estimating what kind of person they are) then you are trying to learn the series of questions that gives the most useful knowledge for a recommendation.
If you want specific software, look at Weka http://www.cs.waikato.ac.nz/ml/weka/ - it has many great algorithms for classifying. And it is a Java library, so you can easily use it within a web application.
Good luck.

Binary classification for web pages

We are interested in doing binary classification of web pages present across the web e.g. Ecommerce vs Non-Ecommerce.
Currently, we are using Mahout library with Naive Bayes algorithm. We are creating training data from existing classified URLs and feature set from the same.
What is the best possible way in terms of accuracy to perform this task?
I need help in terms of algorithm, libraries(usable with JAVA) or any better ideas that help in such types of classification.
Thanks in advance.
The question is quite general so I can add only general information.
The ways to improve the quality of your classification are (in order of importance):
use Lemmatisation and/or Stemming to use only base word forms
implement word filter to remove useless words
train separate classifiers for different languages
You may try to use some existing, well-tuned program,...
CRM411 is designed to be a spam filter, but it is generic enough to do what you want. People use it to sort resume and stuffs. It have lots of engine (HMM, SVM, CLUMP, Bayes, etc..). Give it a try.
This one is a very good demonstration of the algorithm regarding NB classifier.
Discarding most common words would lead to better predictions. IDF can be a good tool for filtering out those words. Also see Wikipedia.

Design for a Debate club assignment application

For my university's debate club, I was asked to create an application to assign debate sessions and I'm having some difficulties as to come up with a good design for it. I will do it in Java. Here's what's needed:
What you need to know about BP debates: There are four teams of 2 debaters each and a judge. The four groups are assigned a specific position: gov1, gov2, op1, op2. There is no significance to the order within a team.
The goal of the application is to get as input the debaters who are present (for example, if there are 20 people, we will hold 2 debates) and assign them to teams and roles with regards to the history of each debater so that:
Each debater should debate with (be on the same team) as many people as possible.
Each debater should uniformly debate in different positions.
The debate should be fair - debaters have different levels of experience and this should be as even as possible - i.e., there shouldn't be a team of two very experienced debaters and a team of junior debaters.
There should be an option for the user to restrict the assignment in various ways, such as:
Specifying that two people should debate together, in a specific position or not.
Specifying that a single debater should be in a specific position, regardless of the partner.
If anyone can try to give me some pointers for a design for this application, I'll be so thankful!
Also, I've never implemented a GUI before, so I'd appreciate some pointers on that as well, but it's not the major issue right now.
Also, there is the issue of keeping Debater information in file, which I also never implemented in Java, and would like some tips on that as well.
This seems like a textbook constraint problem. GUI notwithstanding, it'd be perfect for a technology like Prolog (ECLiPSe prolog has a couple of different Java integration libraries that ship with it).
But, since you want this in Java why not store the debaters' history in a sql database, and use the SQL language to structure the constraints. You can then wrap those SQL queries as Java methods.
There are two parts (three if you count entering and/or saving the data), the underlying algorithm and the UI.
For the UI, I'm weird. I use this technique (there is a link to my sourceforge project). A Java version would have to be done, which would not be too hard. It's weird because very few people have ever used it, but it saves an order of magnitude coding effort.
For the algorithm, the problem looks small enough that I would approach it with a simple tree search. I would have a scoring algorithm and just report the schedule with the best score.
That's a bird's-eye overview of how I would approach it.

Categories

Resources