Bias towards negative sentiments from Stanford CoreNLP - java

I'm experimenting with deriving sentiment from Twitter using Stanford's CoreNLP library, a la https://www.openshift.com/blogs/day-20-stanford-corenlp-performing-sentiment-analysis-of-twitter-using-java - so see here for the code that I'm implementing.
I am getting results, but I've noticed that there appears to be a bias towards 'negative' results, both in my target dataset and another dataset I use with ground truth - the Sanders Analytics Twitter Sentiment Corpus http://www.sananalytics.com/lab/twitter-sentiment/ - even though the ground truth data do not have this bias.
I'm posting this question on the off chance that someone else has experienced this and/or may know if this is the result of something I've done or some bug in the CoreNLP code.
(edit - sorry it took me so long to respond)
I am posting links to plots showing what I mean. I don't have enough reputation to post the images, and can only include two links in this post, so I'll add the links in the comments.

I'd like to suggest this is simply a domain mismatch. The Stanford RNTN is trained on movie review snippets and you are testing on twitter data. Other than the topics mismatch, tweets also tend to be ungrammatical and use abbreviated ("creative") language.
If I had to suggest a more concrete reason, I would start with a lexical mismatch. Perhaps negative emotions are expressed in a domain-independent way, e.g. with common adjectives, and positive emotions are more domain-dependent or more subtle.
It's still interesting that you're getting a negative bias. The Polyanna hypothesis suggests a positive bias, IMHO.
Going beyond your original question, there are several approaches to do sentiment analysis specifically on microblogging data. See e.g. "The Good, The Bad and the OMG!" by Kouloumpis et al.

Michael Haas points out correctly that there is a domain mismatch, which is also specified by Richard Socher in the comments section.
Sentences with a lot of unknown words and imperfect punctuation get flagged as negative.
If you are using Python, VADER is a great tool for twitter sentiment analysis. It is a rule based tool with only ~300 lines of code and a custom made lexicon for twitter, which has ~8000 words including slangs and emoticons.
It is easy to modify the rules as well as the lexicon, without any need for re-training. It is fully free and open source.

Related

Pre trained vectors, nlp, word2vec, word embedding for particular topic?

is there any pretrained vector for particular topic only? for example "java", so i want vectors related java in file. mean if i give input inheritance then cosine similarity show me polymorphism and other related stuff only!
i am using corpus as GoogleNews-vectors-negative300.bin and Glove vectors. still not getting related words.
Not sure if I understand your question/problem statement, but if you want to work with a corpus of java source code you can use code2vec which provides pre-trained word-embeddings models. Check it out: https://code2vec.org/
Yes, you can occasionally find other groups' pre-trained vectors for download, which may have better coverage of whatever problem domains they've been trained on: both more specialized words, and word-vectors matching the word sense in that domain.
For example, the GoogleNews word-vectors were trained on news articles circa 2012, so its vector for 'Java' may be dominated by stories of the Java island of Indosnesia as much as the programming language. And many other vector-sets are trained on Wikipedia text, which will be dominated by usages in that particular reference-style of writing. But there could be other sets that better emphasize the word-senses you need.
However, the best approach is often to train your own word-vectors, from a training corpus that closely matches the topics/documents you are concerned about. Then, the word-vectors are well-tuned to your domain-of-concern. As long as you have "enough" varied examples of a word used in context, the resulting vector will likely be better than generic vectors from someone else's corpus. ("Enough" has no firm definition, but is usually at least 5, and ideally dozens to hundreds, of representative, diverse uses.)
Let's consider your example goal – showing some similarity between the ideas of 'polymorphism' and 'input inheritance'. For that, you'd need a training corpus that discusses those concepts, ideally many times, from many authors, in many problem-contexts. (Textbooks, online articles, and Stack Overflow pages might be possible sources.)
You'd further need a tokenization strategy that manages to create a single word-token for the two-word concept 'input_inheritance' - which is a separate challenge, and might be tackled via (1) a hand-crafted glossary of multi-word-phrases that should be combined; (2) statistical analysis of word-pairs that seem to occur so often together, they should be combined; (3) more sophisticated grammar-aware phrase- and entity-detection preprocessing.
(The multiword phrases in the GoogleNews set were created via a statistical algorithm which is also available in the gensim Python library as the Phrases class. But, the exact parameters Google used have not, as far as I know, been revealed.And, good results from this algorithm can require a lot of data and tuning, and still result in some combinations that a person would consider nonsense, and missing others that a person would consider natural.)

NLP - Determine whether a piece of text is talking about a given topic?

I have a Java application where I'm looking to determine in real time whether a given piece of text is talking about a topic supplied as a query.
Some techniques I've looked into for this are coreference detection with packages like open-nlp and Stanford-NLP coref detection, but these models take extremely long to load and don't seem practical in a production application environment. Is it possible to perform coreference analysis such that given a piece of text and a topic, I can get a boolean answer that the text is discussing the topic?
Other than document classification which requires a trained corpus, are there any other techniques that can help me achieve such a thing?
I suggest have a look at Weka. It is written in Java so will gel well with your environment, will be faster for your kind of requirement, has lots of tools and comes with a UI as well as API. If you are looking at unsupervised approach (that is one without any learning with pre-classified corpus), here is an interesting paper: http://www.newdesign.aclweb.org/anthology/C/C00/C00-1066.pdf
You can also search for "unsupervised text classification/ information retrieval" on Google. You will get lots of approaches. You can choose the one you find easiest.
for each topic(if they are predefined) you can create list of terms and for each sentence check the cosine similarity of sentence and each topic list and show the most near topic to user

ML technique for classification with probability estimates

I want to implement a OCR system. I need my program to not make any mistakes on the letters it does choose to recognize. It doesn't matter if it cannot recognize a lot of them (i.e high precision even with a low recall is Okay).
Can someone help me choose a suitable ML algorithm for this. I've been looking around and find some confusing things. For example, I found contradicting statements about SVM. In the scikits learn docs, it was mentioned that we cannot get probability estimates for SVM. Whereas, I found another post that says it is possible to do this in WEKA.
Anyway, I am looking for a machine learning algorithm that best suites this purpose. It would be great if you could suggest a library for the algorithm as well. I prefer Python based solutions, but I am OK to work with Java as well.
It is possible to get probability estimates from SVMs in scikit-learn by simply setting probability=True when constructing the SVC object. The docs only warn that the probability estimates might not be very good.
The quintessential probabilistic classifier is logistic regression, so you might give that a try. Note that LR is a linear model though, unlike SVMs which can learn complicated non-linear decision boundaries by using kernels.
I've seen people using neural networks with good results, but that was already a few years ago. I asked an expert colleague and he said that nowadays people use things like nearest-neighbor classifiers.
I don't know scikit or WEKA, but any half-decent classification package should have at least k-nearest neighbors implemented. Or you can implement it yourself, it's ridiculously easy. Give that one a try: it will probably have lower precision than you want, however you can make a slight modification where instead of taking a simple majority vote (i.e. the most frequent class among the neighbors wins) you require larger consensus among the neighbors to assign a class (for example, at least 50% of neighbors must be of the same class). The larger the consensus you require, the larger your precision will be, at the expense of recall.

Defining Trending Topics in a specific collection of tweets

Im doing a Java application where I'll have to determine what are the Trending Topics from a specific collection of tweets, obtained trough the Twitter Search. While searching in the web, I found out that the algorithm defines that a topic is trending, when it has a big number of mentions in a specific time, that is, in the exact moment. So there must be a decay calculation so that the topics change often. However, I have another doubt:
How does twitter determines what specific terms in a tweet should be the TT? For example, I've observed that most TT's are hashtag or proper nouns. Does this make any sense? Or do they analyse all words and determine the frequency?
I hope someone can help me! Thanks!
I don't think anyone knows except Twitter, however it seems hashtags do play a big part, but there are other factors in play. I think mining the whole text would take more time than needed, and would result in too many false positives.
Here is an interested article from Mashable:
http://www.sparkmediasolutions.com/pdfs/SMS_Twitter_Trending.pdf
-Ralph Winters
You may be interested in meme tracking, which as I recall, does interesting things with proper nouns, but basically identifies topics in a stream as they become more and less popular:
And in Eddi, interactive topic-based browsing of social status streams

algorithm to calculate similarity between texts

I am trying to score similarity between posts from social networks, but didn't find any good algorithms for that, thoughts?
I just tried Levenshtein, JaroWinkler, and others, but those one are more used to compare texts without sentiments. In posts we can get one text saying "I really love dogs" and an other saying "I really hate dogs", we need to classify this case as totally different.
Thanks
Ahh... but "I really love dogs" and "I really hate dogs" are totally similar ;), both discuss one's feelings towards dogs. It seems that you're missing a step in there:
Run your algorithm and get the general topic groups (i.e. "feelings towards dogs").
Run your algorithm again, but this time on each previously "discovered" group and let your algorithm further classify them into subgroups (i.e. "i hate dogs"/"i love dogs").
If your algorithm adjusts itself based on its experience (i.e. there some learning involved)., then make sure you run separate instances of the algorithm for the first classification, and a new instance of the algorithm for each sub-classification... if you don't, you may end up with a case where you find some groups and any time you run your algo on the same groups the results are nearly identical and/or nothing has changed at all.
Update
Apache Mahout provides a lot of useful algorithms and examples of Clustering, Classification, Genetic Programming, Decision Forest, Recommendation Mining. Here are a some of the text classification examples from mahout:
Wikipedia classification
Twenty Newsgroups classification
Creating Vectors from Text
Document Similarity with Mahout
Item Based Recommender
I'm not sure which one would best apply to your problem, but maybe if you look them over you'll figure out which one is the most suitable for your specific application.
My research is about sentiment analysis, and I agree with Pierre, it's a hard problem, and given its subjective nature, no general algorithm exists. One of the approaches I had first tried was mapping the sentences into an emotional space and decide on its sentiment regarding the distance of the sentence to the sentiment centroids. You may have a look at it at:
http://dtminredis.housing.salle.url.edu:8080/EmoLib/
The sentences above work well ;)
You might want to have a look at Opinion mining and sentiment analysis to give you an idea of the complexity of the task.
Short answer: there a no "good algorithms" for this, only mediocre ones. And this is a very hard problem. Good luck.

Categories

Resources