Im doing a Java application where I'll have to determine what are the Trending Topics from a specific collection of tweets, obtained trough the Twitter Search. While searching in the web, I found out that the algorithm defines that a topic is trending, when it has a big number of mentions in a specific time, that is, in the exact moment. So there must be a decay calculation so that the topics change often. However, I have another doubt:
How does twitter determines what specific terms in a tweet should be the TT? For example, I've observed that most TT's are hashtag or proper nouns. Does this make any sense? Or do they analyse all words and determine the frequency?
I hope someone can help me! Thanks!
I don't think anyone knows except Twitter, however it seems hashtags do play a big part, but there are other factors in play. I think mining the whole text would take more time than needed, and would result in too many false positives.
Here is an interested article from Mashable:
http://www.sparkmediasolutions.com/pdfs/SMS_Twitter_Trending.pdf
-Ralph Winters
You may be interested in meme tracking, which as I recall, does interesting things with proper nouns, but basically identifies topics in a stream as they become more and less popular:
And in Eddi, interactive topic-based browsing of social status streams
Related
I'm using OCR to recognize (German) text in an image. It works well but not perfectly. Sometimes a word gets messed-up. Therefore, I want to implement some sort of validation. Of course, I can just use a word list and find words that are similar to the messed-up word, but is there a way to check if the sentence is plausible with these words?
After all, my smartphone can give me good suggestions on how to complete a sentence.
You need to look for Natural Language Processing (NLP) solutions. With them, you can validate syntactically the lexical (either the whole text, which may be better as some of them may take on consideration the context, or phrase by phrase).
I am not an expert in the area, but this article can help you to choose a tool to start trying.
Also, please notice: your keyboard on your cellphone is developed and maintained by specialized teams, either on Apple, Google or any other company that you use their app. So, please, don't underestimate this task: there are dozens of research areas on this, that includes either software engineers and linguistics specialists to achieve proper results.
Edit: well, two days later, I've just came to this link: https://medium.com/quick-code/12-best-natural-language-processing-courses-2019-updated-2a6c28aebd48
I am creating a sort of forum, and would like to create a function that gives the user suggestions for similar posts before posting a new post, just like in SO.
I am not sure how to create this function in the most efficient way?
What will be the parameters to determine a similar post? This needs to be searched by the title of the post only, but still not sure what would be the logic behind it.
Also, I am working with cloud Firestore and I am charged by reads, so that logic needs to be efficient in terms of not needing to read every document in my database to bring up the relevant posts. Should be some smart query.
Would appreciate input and advice on that, thanks!
** Tried to google that many times but the search query brings up unrelated results ("how to create similar posts"). So if there's information out there about that I'd love a link too, couldn't find it myself.
You can achieve this by comparing the user input string on each keystroke with the titles stored in database or any other data structures by using a custom made string searching algorithm or by Apache Commons library, which provides efficient algorithms for calculating string similarity.
Levenshtein Distance is one of the popular calculation algorithm, with Levenshtein Distance, the lower the score, the more similar strings are:
StringUtils.getLevenshteinDistance("book", "back") == 2
StringUtils.getLevenshteinDistance("gold", "cold") == 1
StringUtils.getLevenshteinDistance("gold", "coin") == 3
I want to use Lucene to process several millions of news data. I'm quite new to Lucene, so I'm trying to learn more and more about how it is working.
By several tutorials throughout the web I found the class TopScoreDocCollector to be of high relevance for querying the Lucene index.
You create it like this
int hitsPerPage = 10000;
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
and it will later collect the results of your query (only the amount you defined in hitsPerPage). I initially thought the results taken in would be just randomly distributed or something (like you have 100.000 documents that match your query and just get random 10.000). I think now that I was wrong.
After reading more and more about Lucene I came to the javadoc of the class (please see here). Here it says
A Collector implementation that collects the top-scoring hits,
returning them as a TopDocs
So for me it now seems that Lucene is using some very smart technology to somehow return me the top scored Documents for my input query. But how does that Scorer work? What does he take into account? I've extended my research on this topic but could not find an answer I completely understood so far.
Can you explain me how the Scorer in TopcScoreDocCollector scores my news documents and if this can be of use for me?
Lucene uses an inverted index to produce an iterator over the list of doc ids that matches your query.
It then goes through each one of them and computes a score. By default that score is based on so-called Tf-idf. In a nutshell, it will take in account how many times the terms of your query appear in the document, and also the number of documents that contain the term.
The idea is that if you look for (warehouse work), having the word work many times is not as significant as having the word warehouse many times.
Then, rather than sorting the whole set of matching documents, lucene takes in account the fact that you really only need the top K documents or so. Using a heap (or priority queue), one can compute these top K, with a complexity of O(N log K) instead of O(N log N).
That's the role of the TopScoreDocCollector.
You can implement your own logic for a scorer (assign a score to a document), or a collector (aggregates results).
This might be not the best answer since there is defently sooner or later someone available to explain the internal behaviour of Lucene but based on my days as a student there are two things regarding "information retrieval" - one is taking benefit of existing solutions such as Lucene and those - the other is the whole theory behind it.
If you are interested in the later one i recomend to take http://en.wikipedia.org/wiki/Information_retrieval as a starting point to get a overview and dig into the whole thematics.
I personally think it is one of the most interesting fields with a huge potential yet i never had the "academic hardskills" to realy get in touch with it.
To parametrize the solutions available it is crucial to at least have a overview of the theory - there are for example "challanges" whereas there has information been manually indexed/ valued as a reference to be able to compare the quality of a programmic solution.
Based on such a challange we managed to aquire a slightly higher quality than "luceene out of the box" after we feed luceene with 4 different information bases (sorry its a few years back i can barely remember hence the missing key words..) which all were the result of luceene itself but with different parameters.
To come back to your question i can answer none directly but hope to give you a certain base to determine if you realy need/ want to know whats behind luceene or if you rather just want to use it as a blackbox (and or make it a greybox by parameterization)
Sorry if i got you totally wrong.
I'm experimenting with deriving sentiment from Twitter using Stanford's CoreNLP library, a la https://www.openshift.com/blogs/day-20-stanford-corenlp-performing-sentiment-analysis-of-twitter-using-java - so see here for the code that I'm implementing.
I am getting results, but I've noticed that there appears to be a bias towards 'negative' results, both in my target dataset and another dataset I use with ground truth - the Sanders Analytics Twitter Sentiment Corpus http://www.sananalytics.com/lab/twitter-sentiment/ - even though the ground truth data do not have this bias.
I'm posting this question on the off chance that someone else has experienced this and/or may know if this is the result of something I've done or some bug in the CoreNLP code.
(edit - sorry it took me so long to respond)
I am posting links to plots showing what I mean. I don't have enough reputation to post the images, and can only include two links in this post, so I'll add the links in the comments.
I'd like to suggest this is simply a domain mismatch. The Stanford RNTN is trained on movie review snippets and you are testing on twitter data. Other than the topics mismatch, tweets also tend to be ungrammatical and use abbreviated ("creative") language.
If I had to suggest a more concrete reason, I would start with a lexical mismatch. Perhaps negative emotions are expressed in a domain-independent way, e.g. with common adjectives, and positive emotions are more domain-dependent or more subtle.
It's still interesting that you're getting a negative bias. The Polyanna hypothesis suggests a positive bias, IMHO.
Going beyond your original question, there are several approaches to do sentiment analysis specifically on microblogging data. See e.g. "The Good, The Bad and the OMG!" by Kouloumpis et al.
Michael Haas points out correctly that there is a domain mismatch, which is also specified by Richard Socher in the comments section.
Sentences with a lot of unknown words and imperfect punctuation get flagged as negative.
If you are using Python, VADER is a great tool for twitter sentiment analysis. It is a rule based tool with only ~300 lines of code and a custom made lexicon for twitter, which has ~8000 words including slangs and emoticons.
It is easy to modify the rules as well as the lexicon, without any need for re-training. It is fully free and open source.
I have a Java application where I'm looking to determine in real time whether a given piece of text is talking about a topic supplied as a query.
Some techniques I've looked into for this are coreference detection with packages like open-nlp and Stanford-NLP coref detection, but these models take extremely long to load and don't seem practical in a production application environment. Is it possible to perform coreference analysis such that given a piece of text and a topic, I can get a boolean answer that the text is discussing the topic?
Other than document classification which requires a trained corpus, are there any other techniques that can help me achieve such a thing?
I suggest have a look at Weka. It is written in Java so will gel well with your environment, will be faster for your kind of requirement, has lots of tools and comes with a UI as well as API. If you are looking at unsupervised approach (that is one without any learning with pre-classified corpus), here is an interesting paper: http://www.newdesign.aclweb.org/anthology/C/C00/C00-1066.pdf
You can also search for "unsupervised text classification/ information retrieval" on Google. You will get lots of approaches. You can choose the one you find easiest.
for each topic(if they are predefined) you can create list of terms and for each sentence check the cosine similarity of sentence and each topic list and show the most near topic to user