algorithm to calculate similarity between texts

algorithm to calculate similarity between texts - java

I am trying to score similarity between posts from social networks, but didn't find any good algorithms for that, thoughts?
I just tried Levenshtein, JaroWinkler, and others, but those one are more used to compare texts without sentiments. In posts we can get one text saying "I really love dogs" and an other saying "I really hate dogs", we need to classify this case as totally different.
Thanks

Ahh... but "I really love dogs" and "I really hate dogs" are totally similar ;), both discuss one's feelings towards dogs. It seems that you're missing a step in there:
Run your algorithm and get the general topic groups (i.e. "feelings towards dogs").
Run your algorithm again, but this time on each previously "discovered" group and let your algorithm further classify them into subgroups (i.e. "i hate dogs"/"i love dogs").
If your algorithm adjusts itself based on its experience (i.e. there some learning involved)., then make sure you run separate instances of the algorithm for the first classification, and a new instance of the algorithm for each sub-classification... if you don't, you may end up with a case where you find some groups and any time you run your algo on the same groups the results are nearly identical and/or nothing has changed at all.
Update
Apache Mahout provides a lot of useful algorithms and examples of Clustering, Classification, Genetic Programming, Decision Forest, Recommendation Mining. Here are a some of the text classification examples from mahout:
Wikipedia classification
Twenty Newsgroups classification
Creating Vectors from Text
Document Similarity with Mahout
Item Based Recommender
I'm not sure which one would best apply to your problem, but maybe if you look them over you'll figure out which one is the most suitable for your specific application.

My research is about sentiment analysis, and I agree with Pierre, it's a hard problem, and given its subjective nature, no general algorithm exists. One of the approaches I had first tried was mapping the sentences into an emotional space and decide on its sentiment regarding the distance of the sentence to the sentiment centroids. You may have a look at it at:
http://dtminredis.housing.salle.url.edu:8080/EmoLib/
The sentences above work well ;)

You might want to have a look at Opinion mining and sentiment analysis to give you an idea of the complexity of the task.
Short answer: there a no "good algorithms" for this, only mediocre ones. And this is a very hard problem. Good luck.

Related

Pre trained vectors, nlp, word2vec, word embedding for particular topic?

is there any pretrained vector for particular topic only? for example "java", so i want vectors related java in file. mean if i give input inheritance then cosine similarity show me polymorphism and other related stuff only!
i am using corpus as GoogleNews-vectors-negative300.bin and Glove vectors. still not getting related words.

Not sure if I understand your question/problem statement, but if you want to work with a corpus of java source code you can use code2vec which provides pre-trained word-embeddings models. Check it out: https://code2vec.org/

Yes, you can occasionally find other groups' pre-trained vectors for download, which may have better coverage of whatever problem domains they've been trained on: both more specialized words, and word-vectors matching the word sense in that domain.
For example, the GoogleNews word-vectors were trained on news articles circa 2012, so its vector for 'Java' may be dominated by stories of the Java island of Indosnesia as much as the programming language. And many other vector-sets are trained on Wikipedia text, which will be dominated by usages in that particular reference-style of writing. But there could be other sets that better emphasize the word-senses you need.
However, the best approach is often to train your own word-vectors, from a training corpus that closely matches the topics/documents you are concerned about. Then, the word-vectors are well-tuned to your domain-of-concern. As long as you have "enough" varied examples of a word used in context, the resulting vector will likely be better than generic vectors from someone else's corpus. ("Enough" has no firm definition, but is usually at least 5, and ideally dozens to hundreds, of representative, diverse uses.)
Let's consider your example goal – showing some similarity between the ideas of 'polymorphism' and 'input inheritance'. For that, you'd need a training corpus that discusses those concepts, ideally many times, from many authors, in many problem-contexts. (Textbooks, online articles, and Stack Overflow pages might be possible sources.)
You'd further need a tokenization strategy that manages to create a single word-token for the two-word concept 'input_inheritance' - which is a separate challenge, and might be tackled via (1) a hand-crafted glossary of multi-word-phrases that should be combined; (2) statistical analysis of word-pairs that seem to occur so often together, they should be combined; (3) more sophisticated grammar-aware phrase- and entity-detection preprocessing.
(The multiword phrases in the GoogleNews set were created via a statistical algorithm which is also available in the gensim Python library as the Phrases class. But, the exact parameters Google used have not, as far as I know, been revealed.And, good results from this algorithm can require a lot of data and tuning, and still result in some combinations that a person would consider nonsense, and missing others that a person would consider natural.)

Bias towards negative sentiments from Stanford CoreNLP

I'm experimenting with deriving sentiment from Twitter using Stanford's CoreNLP library, a la https://www.openshift.com/blogs/day-20-stanford-corenlp-performing-sentiment-analysis-of-twitter-using-java - so see here for the code that I'm implementing.
I am getting results, but I've noticed that there appears to be a bias towards 'negative' results, both in my target dataset and another dataset I use with ground truth - the Sanders Analytics Twitter Sentiment Corpus http://www.sananalytics.com/lab/twitter-sentiment/ - even though the ground truth data do not have this bias.
I'm posting this question on the off chance that someone else has experienced this and/or may know if this is the result of something I've done or some bug in the CoreNLP code.
(edit - sorry it took me so long to respond)
I am posting links to plots showing what I mean. I don't have enough reputation to post the images, and can only include two links in this post, so I'll add the links in the comments.

I'd like to suggest this is simply a domain mismatch. The Stanford RNTN is trained on movie review snippets and you are testing on twitter data. Other than the topics mismatch, tweets also tend to be ungrammatical and use abbreviated ("creative") language.
If I had to suggest a more concrete reason, I would start with a lexical mismatch. Perhaps negative emotions are expressed in a domain-independent way, e.g. with common adjectives, and positive emotions are more domain-dependent or more subtle.
It's still interesting that you're getting a negative bias. The Polyanna hypothesis suggests a positive bias, IMHO.
Going beyond your original question, there are several approaches to do sentiment analysis specifically on microblogging data. See e.g. "The Good, The Bad and the OMG!" by Kouloumpis et al.

Michael Haas points out correctly that there is a domain mismatch, which is also specified by Richard Socher in the comments section.
Sentences with a lot of unknown words and imperfect punctuation get flagged as negative.
If you are using Python, VADER is a great tool for twitter sentiment analysis. It is a rule based tool with only ~300 lines of code and a custom made lexicon for twitter, which has ~8000 words including slangs and emoticons.
It is easy to modify the rules as well as the lexicon, without any need for re-training. It is fully free and open source.

How to make career guidance system intelligent

Well at-last I am working on my final year project which is Intelligent web based career guidance system the core functionality of my system is
Recommendation System
Basically our recommendation system will carefully examine user preferences by taking Interest tests and user’s academic record and on the basis of this examined information it will give user the best career options i.e the course like BS Computer Science etc. .
Input of the recommendation system will be the student credentials and Interest test and in interest test the questions will be given according to user academic history and the answers that he is giving in the test, so basically test will not be asking same questions from everyone it will decide on real time about what to ask from which user according to rules defined by the system.
Its output will be the option of fields which will be decided on the basis of Interest test.
Problem
When I was defending my scope infront of committee they said "this is simple if-else" this system is not intelligent.
My question is which AI technique or Algorithm could be use to make this system intelligent. I have searched a lot but papers related to my system are much more superficial they are just emphasizing on idea not on methodology.
I want to do all my work in Java. It is great if answer is technology specific.
You people can transfer my question to any other stackexchange site if it is not related to SO Q&A criteria.
Edit
After getting some idea from answers I want to implement expert system with rule based and inference engine. Now I want to be more clear on technology aspect to implement rule based engine. After searching I have found Drools to be best but Is it also compatible with web applications? And I also found Tohu to be best dynamic form generator (as this is also need of my project). can I use tohu with drools to make my web application? Is it easy to implement this type of system or not?

If you have a large amount of question, each of them can represent a feature. Assuming you are going to have a LOT of features, finding the series of if-else statements that fulfills the criteria is hard (Recall that a full tree with n questions is going to have 2^n "leaves" - representing 2^n possible answers for these questions, assuming each question is yes/no question).
Since hard programming the above is not possible for a large enough (and probably a realistic size n - there is a place for heuristical solutions one of those is Machine Learning, and specifically - the classification problem. You can have a sample of people answering your survey, with an "expert" saying what is the best career for them, and let an algorithm find a classifier for the general problem (If you want to convert it into a series of yes-no questions automatically, it can be done with a decision tree, and an algorithm like C4.5 to create the tree).
It could also be important to determine - which questions are actually relevant? Is a gender relevant? Is height relevant? These questions as well can be answered using ML algorithms with feature selection algorithms for example (one of these is PCA)
Regarding the "technology" aspect - there is a nice library in java - called Weka which implement many of the classification algorithms out there.
One question you could ask (and try to find out in your project) which classification algorithm will be best for this problem? Some possibilities are The above mentioned C4.5, Naive Bayes, Linear Regression, Neural Networks, KNN or SVM (which usually turned out best for me). You can try and back your decision which algorithm to use with a statistical research and a statistical proof which is better. Wilcoxon test is the standard for this.
EDIT: more details on point 2:
In here an "expert" can be a human classifier from the field of HR
that reads the features and classifies the answers. Obtaining this
data (usually called the "training data") is hard and expansive
sometimes, if your university has an IE or HR faculty, maybe they
will be willing to help.
The idea is: Gather a bunch of people who first answer your survey. Then, give it to a human classifier ("expert") which will chose what is the best career for this person, based on his answers. The data with the classification given by the expert is the input of the learning algorithm, its output will be a classifier.
A classifier is a function itself, that given answers to a surveys - predicts what is the "classification" (suggested career) for the person who did this survey.
Note that once you have a classifier - you do not need to maintain the training data any more, the classifier alone is enough. However, you should have your list of questions and the answers for these questions will be the features provided to the classifier.

All you have to do to satisfy them is create a simple learning system:
Change your thesis terminology so it is described as "learning the best career" instead of using the word "intelligent". Learning is a form of artificial intelligence.
Create a training regime. Do this by giving the questionnaire to people that already have careers and also ask questions to find out how satisfied they are with their career. That way your system can train on what makes a good career match and what makes a bad one.
Choose a learning system to absorb the data from (2). For example, one source of ideas might be this recent paper: http://journals.cluteonline.com/index.php/RBIS/article/download/4405/4493. Product sum networks are cutting edge in AI and apply well to expert-system-like problems.
Finally, try to give a twist to whatever your technology is to make it specific to your problem.

In my final project, I had some experience with Jena RDF inference engine. Basically, what you do with it is create a sort of knowledge base with rules like "if user chose this answer, he has that quality" and "if user has those qualities, he might be good for that job". Adding answers into the system will let you query his current status and adjust questions accordingly. It's pretty easy to create a proof of concept with it, it's easier to do than a bunch of if-else, and if your professors worship prolog-ish style things, they'll like it.

As #amit suggested, Bayesian analysis can provide you guidance on the next question to ask. Another pitfall of dynamic tests is artificial thresholds ("if your score is 28, you are in this category, if your score is 27, you are not"), a problem which fuzzy logic can help address. Another benefit of fuzzy logic is that adding a new category is relatively easy, since the domain expert is only asked to contribute qualitative assessments, not quantitative thresholds.

A program is never more intelligent than the person who wrote it. So, I would first use the collective intelligence that has been built and open sourced already.
Pass your set of known data points as an input to Apache Mahout's PearsonCorrelationSimilarity and use the output to predict which course is the best match. In addition to being open source and scalable, you can also record the outcome and feed it back to the system to improve the accuracy over time. It is very hard to match this level of performance because it is a lot easier to tweak an out of the box algorithm or replace it with your own than it is to deal with a bunch of if else conditions.
I would suggest reading this book . It contains an example of how to use PearsonCorrelationSimilarity.
Mahout also has built in recommender algorithms like NearestNeighborClusterSimilarity
that can simplify your solution further.
There's a good starter code in the book. You can build on it.
Student credentials, Interest Test Questions and answers are inputs. Career choice is the output that you can co-relate to the input. Now that's a very simplistic approach but it might be ok to start with. Eventually, you will have to apply the classifier techniques that Amit has suggested and Mahout can help you with that as well.

Drools can be used via the web, but watch out; it can be a bit of beast to configure and is likely serious overkill for your application. It is an 'enterprise' type of solution focused around rule management, rather than rule execution.
Drools is an "IF-THEN" system, and pretty much all rules engines use the Rete algorithm. http://en.wikipedia.org/wiki/Rete_algorithm - so if your original question is about how not to use an IF-THEN system, Drools is not the right choice. Now, there is a Solver and Planner part of Drools that are not IF-THEN algorithms, but this is not the main Drools algorithm.
That said, it seems like a reasonable choice for your application. Just don't expect it to be considered an 'intelligent' system by those who deem themselves as experts. Rules engines are typically used to codify (that is, make software of) the rules and regulations of business, such as 'should you be approved for a mortgage' or 'how much is your car insurance' and so on. 'what job you should do' is a reasonable application of the same.
If you want to add more AI like intelligence here are a few ideas
Use machine learning to get feedback from the user about earlier recommendations. So, if someone likes or hates a suggestion, add that back in as a feature of the person. You are now doing some basic feedback/reinforcement learning (bayes, neural nets) to try to better classify the person to the career.
Consider the questions you ask the person. Do you need to ask all of the questions? If you can alter the flow of questions based on their responses (by estimating what kind of person they are) then you are trying to learn the series of questions that gives the most useful knowledge for a recommendation.
If you want specific software, look at Weka http://www.cs.waikato.ac.nz/ml/weka/ - it has many great algorithms for classifying. And it is a Java library, so you can easily use it within a web application.
Good luck.

ML technique for classification with probability estimates

I want to implement a OCR system. I need my program to not make any mistakes on the letters it does choose to recognize. It doesn't matter if it cannot recognize a lot of them (i.e high precision even with a low recall is Okay).
Can someone help me choose a suitable ML algorithm for this. I've been looking around and find some confusing things. For example, I found contradicting statements about SVM. In the scikits learn docs, it was mentioned that we cannot get probability estimates for SVM. Whereas, I found another post that says it is possible to do this in WEKA.
Anyway, I am looking for a machine learning algorithm that best suites this purpose. It would be great if you could suggest a library for the algorithm as well. I prefer Python based solutions, but I am OK to work with Java as well.

It is possible to get probability estimates from SVMs in scikit-learn by simply setting probability=True when constructing the SVC object. The docs only warn that the probability estimates might not be very good.
The quintessential probabilistic classifier is logistic regression, so you might give that a try. Note that LR is a linear model though, unlike SVMs which can learn complicated non-linear decision boundaries by using kernels.

I've seen people using neural networks with good results, but that was already a few years ago. I asked an expert colleague and he said that nowadays people use things like nearest-neighbor classifiers.
I don't know scikit or WEKA, but any half-decent classification package should have at least k-nearest neighbors implemented. Or you can implement it yourself, it's ridiculously easy. Give that one a try: it will probably have lower precision than you want, however you can make a slight modification where instead of taking a simple majority vote (i.e. the most frequent class among the neighbors wins) you require larger consensus among the neighbors to assign a class (for example, at least 50% of neighbors must be of the same class). The larger the consensus you require, the larger your precision will be, at the expense of recall.

Point me in the right direction on NLP datastructures and search algorithm

I've got a school assignment to make a language analyzer that's able to guess the language of an input. The assignment states this has to be done by pre-parsing language defined texts and making statistics about letters used, combinations of letter etc and then making a guess based on this data.
The data structure we're supposed to use is simple multi-dimensional hashtables but I'd like to take this opportunity to learn a bit more about implementing structures etc. What'd I'd like to know is what to read up about. My knowledge of algorithms is very limited but I'm keen on learning if someone could point me in the right direction.
Without any real knowledge and just reading up on different posts I'm currently planing on studying undirected graphs as a datastructure for letter combinations (and somehow storing the statistics within the graph as well) and boyer-moore for the per-word search algorithm.
Am I totally on the wrong track and these would be impossible to implement in this situation or is there something else superior for this problem?

If you can get your hands on a copy of Cormen et al. "Introduction to Algorithms"
http://www.amazon.com/Introduction-Algorithms-Second-Thomas-Cormen/dp/0262032937
It's a very very good book to read up on data structures and algorithms.

Language detection using character trigrams

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.