I'm trying to do a document classification using Weka java API.
Here is my directory structure of the data files.
+- text_example
|
+- class1
| |
| 3 html files
|
+- class2
| |
| 1 html file
|
+- class3
|
3 html files
I have the 'arff' file created with 'TextDirectoryLoader'. Then I use the StringToWordVector filter on the created arff file, with filter.setOutputWordCounts(true).
Below is a sample of the output once the filter is applied. I need to get few things clarified.
#attribute </form> numeric
#attribute </h1> numeric
.
.
#attribute earth numeric
#attribute easy numeric
This huge list should be the tokenization of the content of the initial html files. right?
Then I have,
#data
{1 2,3 2,4 1,11 1,12 7,..............}
{10 4,34 1,37 5,.......}
{2 1,5 6,6 16,...}
{0 class2,34 11,40 15,.....,4900 3,...
{0 class3,1 2,37 3,40 5....
{0 class3,1 2,31 20,32 17......
{0 class3,32 5,42 1,43 10.........
why there is no class attribute for the first 3 items? (it should have class1).
what does the leading 0 means as in {0 class2,..}, {0 class3..}.
It says, for instance, that in the 3rd html file in the class3 folder, the word identified by the integer 32 appears 5 times. Just to see how do I get the word (token) referred by 32?
How do I reduce the dimensionality of the feature vector? don't we need to make all the feature vectors the same size? (like consider only the say 100 most frequent terms from the training set and later when it comes to testing, consider the occurrence of only those 100 terms in test documents. Because, in this way what happens if we come up with a totally new word in the testing phase, will the classifier just ignore it?).
Am I missing something here? I'm new to Weka.
Also I really appreciate the help if someone can explain me how the classifier uses this vector created with StringToWordVector filter. (like creating the vocabulary with the training data, dimensionality reduction, are those happening inside the Weka code?)
The huge list of #attribute contains all the tokens derived from your input.
Your #data section is in the sparse format, that is for each attribute, the value is only stated if it is different from zero. For the first three lines, the class attribute is class1, you just can't see it (if it were unknown, you would see a 0 ? at the beginning of the first three lines). Why is that so? Weka internally represents nominal attributes (that includes classes) as doubles and starts counting at zero. So your three classes are internally: class1=0.0, class2=1.0, class3=2.0. As zero-values are not stated in the sparse format, you can't see the class in the first three lines. (Also see the section "Sparse ARFF files" on http://www.cs.waikato.ac.nz/ml/weka/arff.html)
To get the word/token represented by index n, you can either count or, if you have the Instances object, invoke attribute(n).name() on it. For that, n starts counting at 0.
To reduce dimensionality of the feature vector, there are a lot of options. If you only want to have the 100 most frequent terms, you stringToWordVector.setWordsToKeep(100). Note that this will try to keep 100 words of every class. If you do not want to keep 100 words per class, stringToWordVector.setDoNotOperateOnPerClassBasis(true). You will get slightly above 100 if there are several words with the same frequency, so the 100 is just a kind of target value.
As for the new words occuring in the test phase, I think that cannot happen because you have to hand the stringToWordVector all instances before classifying. I am not 100% sure on that one though, as I am using a two-class setup and I let StringToWordVector transform all my instances before telling the classifier anything about it.
I can generally recomment to you, to experiment with the Weka KnowledgeFlow tool to learn how to use the different classes. If you know how to do things there, you can use that knowledge for your Java code quite easily.
Hope I was able to help you, although the answer is a bit late.
Related
I'm seeking help with the logic and not the technology to solve this problem. I was writing a program in Java to use categorized data(consisting temperature and blood pressure mapped to a state of Infected/NotInfected/unknown) and classify a given set of travelers as
“Infected”, “NotInfected” or “Unknown” accordingly.
Input:
The input comprises of a string containing two parts separated by ‘#’. The first part contains
categorized data for any number of individuals separated by comma. Data for each individual
contains space separated three values as follows:
Temperature bloodpressure category
The second part contains space separated temperature and blood pressure of multiple travelers
separated by comma.
Output:
Categorization of travelers separated by comma.
Sample Input & Output
90 120 Infected,90 150 NotInfected,100 140 Infected,80 130 NotInfected#95 125,95 145,75 160 | Output: Infected,Unknown,NotInfected
80 120 Infected,70 145 Infected,90 100 Infected,80 150 NotInfected,80 80 NotInfected,100 120 NotInfected#120 148,75 148,60 90 | Output: NotInfected,Unknown,Unknown
I went on to solve this by splitting the strings provided in to substring one containing the categorized data and the other containing the input data set.
public static void main(String[] args) {
String s="90 120 Infected, 90 150 NotInfected, 100 140 NotInfected, 80 130 NotInfected#95 125, 95 145, 75 160";
String categories = s.split("#")[0];
String inputs = s.split("#")[1];
System.out.println(categories+"\n"+inputs);
for (String input: inputs.split(",")){
//iterate through categories and match against input
}
}
But I realized that I was not able to find any pattern that could help me get the desired output as mentioned in the "Sample above". Which type of temperature-BP leads to Infected category?
So, your problem is to learn a classifier from your sample (training data) and then be able to classify new cases (described by explanatory variables, temperature and blood pressure) into one of three classes.
There are numerous ways to learn classifiers, but first you should find out, if your explanatory variables actually explain the classes (i.e. if there is a pattern). For this purpose, I would suggest a simple check: plot you training data in two dimensions (explanatory variables) and give each of the three classes a different symbol (e.g. letter N, I, U). You will see if all classes are randomly mixed or if the same symbols tend to aggregate together. Or are you able to draw lines that separate different classes sufficiently well? You don't need to be able to separate classes perfectly - some classification errors just belong to life - but your should be able to see some tendency.
If there is a clear class division, then you should just select a classifier to learn. Learning algorithms are widely available, so you don't need to code it yourself. You could try e.g. classification trees (classical c4.5 learning algorithm). Or if your training set is sufficiently large, you could use a K-nearest neighbour classifier that doesn't require any learning phase: you simply classify a new case according to its K nearest neighbours in the training data (you can just calculate Euclidean distances between points in the temperature and blood pressure space, select K points with shortest distances from you new query point and select the most common class among neighbours).
What is the correct abstract syntax tree for representing algebra? I have tried way too many setups, and constantly been rewriting the syntax tree, and all of my configurations end up forgetting something important (e.g. fractions not being supported). Currently my configurations for equations and expressions seem to be fine. Expressions simply consist of an array of terms, each with a positive/negative sign, and a coefficient. That's where the trouble comes in. What exactly is a term? Wikipedia helps some, and even has an example AST for a couple of terms. However, for practical purposes I'm trying to keep everything closer to the concepts we use when we learn algebra, rather than breaking it down into nothing but variables and operators. It appears that just about anything can be contained in a term: terms can contain fractions (which contain expressions), sub-terms, sub-expressions, and regular variables, each of them having their own exponents.
Currently my configuration is something like this:
Term
|
-----------------------------------------------------------------
| | | | |
Coefficient ArrayList of ArrayList of ArrayList of ArrayList of
| sub-expressions powers of fractions powers of
| sub-expressions* (may contain fractions*
--------------- variables)
| |
integer/decimal fraction
(no variables)
*Expressions/fractions don't have exponents on their own, but may have one outside sometimes (e.g. 2(x+3)^3).
NOTE: For the sake of simplicity the diagram leaves out an ArrayList of variables (and one for roots), an an ArrayList of their respective exponents, all contained by the term.
NOTE 2: In case it's not clear, the diagram doesn't show inheritance. It's showing members of the Term class.
This seems rather sloppy to me, and might not scale well with the project when things get more complex. Is a term really supposed to be this kind of soup? I have a feeling yet another thing should be included in term, but I can't think of what it would be. Although I've been strugling with this for some months, I haven't taken the discipline to just stop and really work it out, which I should have done before starting.
Am I making a mistake in making nearly everything fit in a term? If so, what should I be doing instead? If not, is it really supposed to be this... ugly/non-intuitive? Part of my feeling that this must be wrong is due to the fact that that almost no one thinks of an algebraic term this way.
Example term: 2.3x(2/3)^4(√23)((x+6)/(x-6)) (overly complex, I know, but it contains everything mentioned above).
My real question: What is the correct syntax structure for the the term, the heart and soul of algebra?
I am working on an engine that does OCR post-processing, and currently I have a set of organizations in the database, including Chamber of Commerce Numbers.
Also from the OCR output I have a list of possible Chamber of Commerce (COC) numbers.
What would be the best way to search the most similar one? Currently I am using Levenshtein Distance, but the result range is simply too big and on big databases I really doubt it's feasibility. Currently it's implemented in Java, and the database is a MySQL database.
Side note: A Chamber of Commerce number in The Netherlands is defined to be an 8-digit number for every company, an earlier version of this system used another 4 digits (0000, 0001, etc.) to indicate an establishment of an organization, nowadays totally new COC numbers are being given out for those.
Example of COCNumbers:
30209227
02045251
04087614
01155720
20081288
020179310000
09053023
09103292
30039925
13041611
01133910
09063023
34182B01
27124701
List of possible COCNumbers determined by post-processing:
102537177
000450093333
465111338098
NL90223l30416l
NLfl0737D447B01
12juni2013
IBANNL32ABNA0242244777
lncassantNL90223l30416l10000
KvK13041611
BtwNLfl0737D447B01
A few extra notes:
The post-processing picks up words and word groups from the invoice, and those word groups are being concatenated in one string. (A word group is at it says, a group of words, usually denoted by a space between them).
The condition that the post-processing uses for it to be a COC number is the following: The length should be 8 or more, half of the content should be numbers and it should be alphanumerical.
The amount of possible COCNumbers determined by post-processing is relatively small.
The database itself can grow very big, up to 10.000s of records.
How would I proceed to find the best match in general? (In this case (13041611, KvK13041611) is the best (and moreover correct) match)
Doing this matching exclusively in MySQL is probably a bad idea for a simple reason: there's no way to use a regular expression to modify a string natively.
You're going to need to use some sort of scoring algorithm to get this right, in my experience (which comes from ISBNs and other book-identifying data).
This is procedural -- you probably need to do it in Java (or some other procedural programming language).
Is the candidate string found in the table exactly? If yes, score 1.0.
Is the candidate string "kvk" (case-insensitive) prepended to a number that's found in the table exactly? If so, score 1.0.
Is the candidate string the correct length, and does it match after changing lower case L into 1 and upper case O into 0? If so, score 0.9
Is the candidate string the correct length after trimming all alphabetic characters from either beginning or the end, and does it match? If so, score 0.8.
Do both steps 3 and 4, and if you get a match score 0.7.
Trim alpha characters from both the beginning and end, and if you get a match score 0.6.
Do steps 3 and 6, and if you get a match score 0.55.
The highest scoring match wins.
Take a visual look at the ones that don't match after this set of steps and see if you can discern another pattern of OCR junk or concatenated junk. Perhaps your OCR is seeing "g" where the input is "8", or other possible issues.
You may be able to try using Levenshtein's distance to process these remaining items if you match substrings of equal length. They may also be few enough in number that you can correct your data manually and proceed.
Another possibility: you may be able to use Amazon Mechanical Turk to purchase crowdsourced labor to resolve some difficult cases.
I have written a kernel density estimator in Java that takes input in the form of ESRI shapefiles and outputs a GeoTIFF image of the estimated surface. To test this module I need an example shapefile, and for whatever reason I have been told to retrieve one from the sample data included in R. Problem is that none of the sample data is a shapefile...
So I'm trying to use the shapefiles package's funciton convert.to.shapefile(4) to convert the bei dataset included in the spatstat package in R to a shapefile. Unfortunately this is proving to be harder than I thought. Does anyone have any experience in doing this? If you'd be so kind as to lend me a hand here I'd greatly appreciate it.
Thanks,
Ryan
References:
spatstat,
shapefiles
There are converter functions for Spatial objects in the spatstat and maptools packages that can be used for this. A shapefile consists of at least points (or lines or polygons) and attributes for each object.
library(spatstat)
library(sp)
library(maptools)
data(bei)
Coerce bei to a Spatial object, here just points without attributes since there are no "marks" on the ppp object.
spPoints <- as(bei, "SpatialPoints")
A shapefile requires at least one column of attribute data, so create a dummy.
dummyData <- data.frame(dummy = rep(0, npoints(bei)))
Using the SpatialPoints object and the dummy data, generate a SpatialPointsDataFrame.
spDF <- SpatialPointsDataFrame(spPoints, dummyData)
At this point you should definitely consider what the coordinate system used by bei is and whether you can represent that with a WKT CRS (well-known text coordinate reference system). You can assign that to the Spatial object as another argument to SpatialPointsDataFrame, or after create with proj4string(spDF) <- CRS("+proj=etc...") (but this is an entire problem all on its own that we could write pages on).
Load the rgdal package (this is the most general option as it supports many formats and uses the GDAL library, but may not be available because of system dependencies.
library(rgdal)
(Use writePolyShape in the maptools package if rgdal is not available).
The syntax is the object, then the "data source name" (here the current directory, this can be a full path to a .shp, or a folder), then the layer (for shapefiles the file name without the extension), and then the name of the output driver.
writeOGR(obj = spDF, dsn = ".", layer = "bei", driver = "ESRI Shapefile")
Note that the write would fail if the "bei.shp" already existed and so would have to be deleted first unlink("bei.shp").
List any files that start with "bei":
list.files(pattern = "^bei")
[1] "bei.dbf" "bei.shp" "bei.shx"
Note that there is no general "as.Spatial" converter for ppp objects, since decisions must be made as to whether this is a point patter with marks and so on - it might be interesting to try writing one, that reports on whether dummy data was required and so on.
See the following vignettes for further information and details on the differences between these data representations:
library(sp); vignette("sp")
library(spatstat); vignette("spatstat")
A general solution is:
convert the "ppp" or "owin" classed objects to appropriate classed objects from the sp package
use the writeOGR() function from package rgdal to write the Shapefile out
For example, if we consider the hamster data set from spatstat:
require(spatstat)
require(maptools)
require(sp)
require(rgdal)
data(hamster)
first convert this object to a SpatialPointsDataFrame object:
ham.sp <- as.SpatialPointsDataFrame.ppp(hamster)
This gives us a sp object to work from:
> str(ham.sp, max = 2)
Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots
..# data :'data.frame': 303 obs. of 1 variable:
..# coords.nrs : num(0)
..# coords : num [1:303, 1:2] 6 10.8 25.8 26.8 32.5 ...
.. ..- attr(*, "dimnames")=List of 2
..# bbox : num [1:2, 1:2] 0 0 250 250
.. ..- attr(*, "dimnames")=List of 2
..# proj4string:Formal class 'CRS' [package "sp"] with 1 slots
This object has a single variable in the #data slot:
> head(ham.sp#data)
marks
1 dividing
2 dividing
3 dividing
4 dividing
5 dividing
6 dividing
So say we now want to write out this variable as an ESRI Shapefile, we use writeOGR()
writeOGR(ham.sp, "hamster", "marks", driver = "ESRI Shapefile")
This will create several marks.xxx files in directory hamster created in the current working directory. That set of files is the ShapeFile.
One of the reasons why I didn't do the above with the bei data set is that it doesn't contain any data and thus we can't coerce it to a SpatialPointsDataFrame object. There are data we could use, in bei.extra (loaded at same time as bei), but these extra data or on a regular grid. So we'd have to
convert bei.extra to a SpatialGridDataFrame object (say bei.spg)
convert bei to a SpatialPoints object (say bei.sp)
overlay() the bei.sp points on to the bei.spg grid, yielding values from the grid for each of the points in bei
that should give us a SpatialPointsDataFrame that can be written out using writeOGR() as above
As you see, that is a bit more involved just to give you a Shapefile. Will the hamster data example I show suffice? If not, I can hunt out my Bivand et al tomorrow and run through the steps for bei.
Suppose you need to perform some kind of comparison amongst 2 files. You only need to do it when it makes sense, in other words, you wouldn't want to compare JSON file with Property file or .txt file with .jar file
Additionally suppose that you have a mechanism in place to sort all of these things out and what it comes down to now is the actual file name. You would want to compare "myFile.txt" with "myFile.txt", but not with "somethingElse.txt". The goal is to be as close to "apples to apples" rules as possible.
So here we are, on one side you have "myFile.txt" and on another side you have "_myFile.txt", "_m_y_f_i_l_e.txt" and "somethingReallyClever.txt".
Task is to pick the closest name to later compare. Unfortunately, identical name is not found.
Looking at the character composition, it is not hard to figure out what the relationship is. My algo says:
_myFile.txt to _m_y_f_i_l_e.txt 0.312
_myFile.txt to somethingReallyClever.txt 0.16
So _m_y_f_i_l_e.txt is closer to_myFile.txt then somethingReallyClever.txt. Fantastic. But also says that ist is only 2 times closer, where as in reality we can look at the 2 files and would never think to compare somethingReallyClever.txt with _myFile.txt.
Why?
What logic would you suggest i apply to not only figure out likelihood by having chars on the same place, but also test whether determined weight makes sense?
In my example, somethingReallyClever.txt should have had a weight of 0.0
I hope i am being clear.
Please share your experience and thoughts on this.
(whatever approach you suggest should not depend on number of characters filename consists out of)
Possibly helpful previous question which highlights several possible algorithms:
Word comparison algorithm
These algorithms are based on how many changes would be needed to get from one string to the other - where a change is adding a character, deleting a character, or replacing a character.
Certainly any sensible metric here should have a low score as meaning close (think distance between the two strings) and larger scores as meaning not so close.
Sounds like you want the Levenshtein distance, perhaps modified by preconverting both words to the same case and normalizing spaces (e.g. replace all spaces and underscores with empty string)