Binary classification for web pages

Binary classification for web pages - java

We are interested in doing binary classification of web pages present across the web e.g. Ecommerce vs Non-Ecommerce.
Currently, we are using Mahout library with Naive Bayes algorithm. We are creating training data from existing classified URLs and feature set from the same.
What is the best possible way in terms of accuracy to perform this task?
I need help in terms of algorithm, libraries(usable with JAVA) or any better ideas that help in such types of classification.
Thanks in advance.

The question is quite general so I can add only general information.
The ways to improve the quality of your classification are (in order of importance):
use Lemmatisation and/or Stemming to use only base word forms
implement word filter to remove useless words
train separate classifiers for different languages

You may try to use some existing, well-tuned program,...
CRM411 is designed to be a spam filter, but it is generic enough to do what you want. People use it to sort resume and stuffs. It have lots of engine (HMM, SVM, CLUMP, Bayes, etc..). Give it a try.

This one is a very good demonstration of the algorithm regarding NB classifier.
Discarding most common words would lead to better predictions. IDF can be a good tool for filtering out those words. Also see Wikipedia.

Related

Pre trained vectors, nlp, word2vec, word embedding for particular topic?

is there any pretrained vector for particular topic only? for example "java", so i want vectors related java in file. mean if i give input inheritance then cosine similarity show me polymorphism and other related stuff only!
i am using corpus as GoogleNews-vectors-negative300.bin and Glove vectors. still not getting related words.

Not sure if I understand your question/problem statement, but if you want to work with a corpus of java source code you can use code2vec which provides pre-trained word-embeddings models. Check it out: https://code2vec.org/

Yes, you can occasionally find other groups' pre-trained vectors for download, which may have better coverage of whatever problem domains they've been trained on: both more specialized words, and word-vectors matching the word sense in that domain.
For example, the GoogleNews word-vectors were trained on news articles circa 2012, so its vector for 'Java' may be dominated by stories of the Java island of Indosnesia as much as the programming language. And many other vector-sets are trained on Wikipedia text, which will be dominated by usages in that particular reference-style of writing. But there could be other sets that better emphasize the word-senses you need.
However, the best approach is often to train your own word-vectors, from a training corpus that closely matches the topics/documents you are concerned about. Then, the word-vectors are well-tuned to your domain-of-concern. As long as you have "enough" varied examples of a word used in context, the resulting vector will likely be better than generic vectors from someone else's corpus. ("Enough" has no firm definition, but is usually at least 5, and ideally dozens to hundreds, of representative, diverse uses.)
Let's consider your example goal – showing some similarity between the ideas of 'polymorphism' and 'input inheritance'. For that, you'd need a training corpus that discusses those concepts, ideally many times, from many authors, in many problem-contexts. (Textbooks, online articles, and Stack Overflow pages might be possible sources.)
You'd further need a tokenization strategy that manages to create a single word-token for the two-word concept 'input_inheritance' - which is a separate challenge, and might be tackled via (1) a hand-crafted glossary of multi-word-phrases that should be combined; (2) statistical analysis of word-pairs that seem to occur so often together, they should be combined; (3) more sophisticated grammar-aware phrase- and entity-detection preprocessing.
(The multiword phrases in the GoogleNews set were created via a statistical algorithm which is also available in the gensim Python library as the Phrases class. But, the exact parameters Google used have not, as far as I know, been revealed.And, good results from this algorithm can require a lot of data and tuning, and still result in some combinations that a person would consider nonsense, and missing others that a person would consider natural.)

Context aware recommendation engine

I am looking for context aware (location,time,companion) recommendation system.
I found bunch of good recommendation systems (mahout, PredictionIO, easyrec).
But unfortunately I am not convinced with any of those.
On further googling I found CARSKit based on librec.
I am exactly looking for similar library. At the same time I am more interested to work with mahout only.
Though mahout is not suiting me but still we can ask for number of recommendations and output is also much understandable.
As per my understanding "Context awareness" is missing in mahout.
I will explain my dataset.
calendar_seq,user_id,date,dayofweek,timehh,timemm,location_name,location_lat,location_long,companion,event_name,is_recommended,is_accepted,show_in_cal
1,1,14/12/15,Monday,13,0,Office,1.1,2.2,Colleagues,lunch,true,true,true
2,1,14/12/15,Monday,18,0,Cinema,3.3,4.4,NA,Movie,false,true,true
3,1,15/12/15,Tuesday,13,0,Office,1.1,2.2,Colleagues,lunch,true,true,true
4,1,15/12/15,Tuesday,18,0,Meeting,3.3,4.4,Colleagues,meeting,false,true,true
5,1,16/12/15,Wednesday,13,0,Office,1.1,2.2,Colleagues,lunch,true,true,true
I will have above five rows in DB and will be given it as training data.
Now I need recommendation for User 1 on 16/12/15 evening 18:00.
It can recommend Cinema or Meeting for 16/12.
When I run recomender again for 17/12, based on previous day's recommendation all those events will become like training data.
So again recomender can give recommendation based on location,time,companion etc..
Can any one suggest me best suited recommendation wrapper on top of Mahout or new library which will suit my requirement?
I prefer Java based solutions for my problem.

This may be similar to your question.
A quote from this link: "Your input file may have multiple features like age, location etc. R could help you in applying K-Means clustering on multiple features. Apache Mahout implementation overwrite features instead of applying multiple features. And when you apply clustering on these multiple features, clusters would be formed based on all features instead of one. However, I am not sure about the use-case, So I am just discussing technical feasibility here. You may need to apply based on your use-case."
Hope this helps.

NLP - Determine whether a piece of text is talking about a given topic?

I have a Java application where I'm looking to determine in real time whether a given piece of text is talking about a topic supplied as a query.
Some techniques I've looked into for this are coreference detection with packages like open-nlp and Stanford-NLP coref detection, but these models take extremely long to load and don't seem practical in a production application environment. Is it possible to perform coreference analysis such that given a piece of text and a topic, I can get a boolean answer that the text is discussing the topic?
Other than document classification which requires a trained corpus, are there any other techniques that can help me achieve such a thing?

I suggest have a look at Weka. It is written in Java so will gel well with your environment, will be faster for your kind of requirement, has lots of tools and comes with a UI as well as API. If you are looking at unsupervised approach (that is one without any learning with pre-classified corpus), here is an interesting paper: http://www.newdesign.aclweb.org/anthology/C/C00/C00-1066.pdf
You can also search for "unsupervised text classification/ information retrieval" on Google. You will get lots of approaches. You can choose the one you find easiest.

for each topic(if they are predefined) you can create list of terms and for each sentence check the cosine similarity of sentence and each topic list and show the most near topic to user

Metalanguage like BNF or XML-Schema to validate a tree-instance against a tree-model

I'm implementing a new machine learning algorithm in Java that extracts a prototype datastructure from a set of structured datasets (tree-structure). As im developing a generic library for that purpose, i kept my design independent from concrete data-representations like XML.
My problem now is that I need a way to define a data model, which is basically a ruleset describing valid trees, against which a set of trees is being matched. I thought of using BNF or a similar dialect.
Basically I need a way to iterate through the space of all valid TreeNodes defined by the ModelTree (Like a search through the search space for algorithms like A*) so that i can compare my set of concrete trees with the model. I know that I'll have to deal with infinite spaces there but first things first.
I know, it's rather tricky (and my sentences are pretty bumpy) but I would appreciate any clues.
Thanks in advance,
Stefan

I believe that you are talking about a Regular Tree Grammar. This Wikipedia page is an entry point for the topic, and the book that it links to might be helpful.

Image Classification Algorithms Using Java

My goal is to implements different image classification methods to show how they function and the advantages and disadvantages behind such methods. The ones I want to try and implement using Java include;
Minimum distance classifier
k-nearest neighbour classifier.
I was wondering what can be used to accomplish my task that already exists in Java so that I can alter the way the algorithms operates.

Although not entirely sure this is what you are looking for (sorry, your question is a bit unclear), if what you want is a library / system to help you with the classification part of the work, then you may want to look at Weka (http://www.cs.waikato.ac.nz/ml/weka/), in my opinion the best Java library for data mining experimentation.
If, instead, you are looking for algorithms that would allow you to analyze images in order to extract features that can, in turn, be used to perform the classification, you may want to start with targeted descriptions of such algorithms in Java, such as those found in the nice on-line book Java Image Processing Cookbook by Rafael Santos; here's a direct link to the section "A Brief Tutorial on Supervised Image Classification".

You can also use RapidMiner with IMMI (IMage MIning) extension:
http://www.burgsys.com/mumi-image-mining-community.php
For image classification you can use for example global feature extraction and then use some classification algorithm (e.g. Artificial Neural Networks).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.