So I have been set the task of making a program that takes in a wide set of text files, and calculating strange communication patterns in that set. Basically I think that we need to work out extremes of frequency of communication, along with clustering groups of people together that communicate very often.
As well as that I'd like to create a 'social network style' graph that shows who communicates with everyone who, and graphs of the form 'number of emails over time'.
Additionally, any other ideas of types of graphs to include would be great, though my main predicament is which algorithms to use for this.
Thanks.
Eventually, I decided on using K Means Clustering, alongside Barycenter.
Related
I have always used Python for clustering, but recently I came across a situation in which I need the implementations of both CluStream and DenStream (stream clustering algorithms), available in R and Java (there are some implementations in Python from the community but I already tried them and they do no work).
The thing is that I have to compare many clustering algorithms written in Python, and as a prev stage I was using the well known scikit learn data sets (to show how algorithms handle non-globular clusters - of course then I will use time series data).
Now, I wanna know if the proper way to try those R/Java algorithms and compute a metric coded in Python (DBCV) with the R/Java clustering results ....
--> So, summing up, I need to compare many algorithms (coded in Python and R/Java) using the same data sets (which I figured could be persisted into csv files) and computing the same validity metric (Python).
Any help would be appreciated. Thanks in advance!
EDIT: the solution I came across is the following:
Generate the toy data sets with sklearn and persist them into csv files
Use the different clustering algorithms with those data sets and persist also the clustering results into csv files (it does not matter which programming language it's used)
Develop another app which:
takes the clustering solutions stored in the cvs files
computes the metric and shows the results
PLEASE let me know if you find a better solution!
Notes:
This R package is the one i wanna try: streamMOA
I do not know anything about R and I have worked with Java before (what implementation I choose depends on the better approach regarding the integration with Python)
MOA is a Java software. There is no good reason to use it via R unless you are already in the R ecosystem (which you aren't).
You can write the data to CSV and load it in whatever tool you like
These data sets are not streams. They lack all the difficulties and challenges of streams - a simple subsample will be enough to identify the clustering structure. Conclusions drawn from this data are useless. Use real data streams, not synthetic data with no sequential order to it.
Suppose I have to develop a Desktop Application using Java SE. I have finished writing User requirements document. In this document, I mentionned the functionnalities of my futur application. I analysed the needs of the user and established What the ideal application has to perform.
Now, I have to conceive the architecture of the application and ddetailled conception of the app. This is what I don't know how to do ?
I have an idea, which is as follow : elaborate a use case diagram, then for every use case, make an sequence diagram finally produce a class diagram from which I can generate the code.
Is this correct ? How about using a database management system at which level I add the use of DBMS ? from the first uml diagramme ?
please any help is a welcome.
Well you know the functions you will implement and you know the requirements. While at this point you should know or infer some database requirements you don't have the whole picture yet. If you want to do an iterative software development you can start with whatever you feel you got more to progress with, then go back to your other tasks and work in increments. Because you are doing an iterative process you will be erasing bits here and there, polishing your work as you go.
To work sequentially, you'd finish all the analysis documentation before doing design, and that before touching code. Initial databases can be generated from java classes (beans), so that's when that comes in.
Under your chosen methodology the wiki link you provided lists what it is expected to be done in order. For the part of high level design, which you claim to have problems with, you'll want the appropriate UML diagrams, use components for modules/software architecture.
Because it is high level design keep it high level, don't delve in details. eg for a videogame: Graphics, Audio, Network, etc, and how they will interact (Interfaces), don't define anything smaller, no classes, no methods, main packages/libraries can be done. For hardware architecture you may use a deployment diagram I guess, have each cube represent the hardware of the boxes that will run your code, you aren't prepared for deployment but you can make changes to your initial proposal next iteration if you need.
Database design is at the end, but the wiki specifically tells you to only define tables, don't define columns yet. You will define that in the low level detail phase.
I am planing to write a Java application to identify certain types of trees using google maps . I have knoweldge in google maps api. my problem is how to recognize trees from map. I found some libs like jjil. But i don't know whether it is useful or not. Can any one give some inputs for this project ?
The general idea would probably be something like this:
Collect a large number of positive and negative samples (i.e. images from Google Maps that you want to recognize and ones that you don't want to recognize). It's hard to say how many you'll really need; depending on how similar they are and how many features you need, 100 might be enough or 10000 might be too few.
Find a set of texture features. Nobody can tell you what features are optimal without seeing the sample images (1) first.
Train a machine learning algorithm (e.g. SVM, neural network). Split the sample set into a training set and a test set to judge the discrimination quality.
Step 1 is just work, and I doubt anyone on Stackoverflow will do it for you. If you have the samples and post them, we might be able to help with steps 2/3, though.
I'm planning to develop program in Java which will provide diagnosis. The data set is divided into two parts one for training and the other for testing. My program should learn to classify from the training data (BTW which contain answer for 30 questions each in new column, each record in new line the last column will be diagnosis 0 or 1, in the testing part of data diagnosis column will be empty - data set contain about 1000 records) and then make predictions in testing part of data :/
I've never done anything similar so I'll appreciate any advice or information about solution to similar problem.
I was thinking about Java Machine Learning Library or Java Data Mining Package but I'm not sure if it's right direction... ? and I'm still not sure how to tackle this challenge...
Please advise.
All the best!
I strongly recommend you use Weka for your task
Its a collection of machine learning algorithms with a user friendly front-end which facilitates a lot of different kinds of feature and model selection strategies
You can do a lot of really complicated stuff using this without really having to do any coding or math
The makers have also published a pretty good textbook that explains the practical aspects of data mining
Once you get the hang of it, you could use its API to integrate any of its classifiers into your own java programs
Hi As Gann Bierner said, this is a classification problem. The best classification algorithm for your needs I know of is, Ross Quinlan algorithm. It's conceptually very easy to understand.
For off-the-shelf implementations of the classification algorithms, the best bet is Weka. http://www.cs.waikato.ac.nz/ml/weka/. I have studied Weka but not used, as I discovered it a little too late.
I used a much simpler implementation called JadTi. It works pretty good for smaller data sets such as yours. I have used it quite a bit, so can confidently tell so. JadTi can be found at:
http://www.run.montefiore.ulg.ac.be/~francois/software/jaDTi/
Having said all that, your challenge will be building a usable interface over web. To do so, the dataset will be of limited use. The data set basically works on the premise that you have the training set already, and you feed the new test dataset in one step, and you get the answer(s) immediately.
But my application, probably yours also, was a step by step user discovery, with features to go back and forth on the decision tree nodes.
To build such an application, I created a PMML document from my training set, and built a Java Engine that traverses each node of the tree asking the user to give an input (text/radio/list) and use the values as inputs to the next possible node predicate.
The PMML standard can be found here: http://www.dmg.org/ Here you need the TreeModel only. NetBeans XML Plugin is a good schema-aware editor for PMML authoring. Altova XML can do a better job, but costs $$.
It is also possible to use an RDBMS to store your dataset and create the PMML automagically! I have not tried that.
Good luck with your project, please feel free to let me know if you need further inputs.
There are various algorithms that fall into the category of "machine learning", and which is right for your situation depends on the type of data you're dealing with.
If your data essentially consists of mappings of a set of questions to a set of diagnoses each of which can be yes/no, then I think methods that could potentially work include neural networks and methods for automatically building a decision tree based on the test data.
I'd have a look at some of the standard texts such as Russel & Norvig ("Artificial Intelligence: A Modern Approach") and other introductions to AI/machine learning and see if you can easily adapt the algorithms they mention to your particular data. See also O'Reilly, "Programming Collective Intelligence" for some sample Python code of one or two algorithms that might be adaptable to your case.
If you can read Spanish, the Mexican publishing house Alfaomega have also published various good AI-related introductions in recent years.
This is a classification problem, not really data mining. The general approach is to extract features from each data instance and let the classification algorithm learn a model from the features and the outcome (which for you is 0 or 1). Presumably each of your 30 questions would be its own feature.
There are many classification techniques you can use. Support vector machines is popular as is maximum entropy. I haven't used the Java Machine Learning library, but at a glance I don't see either of these. The OpenNLP project has a maximum entropy implementation. LibSVM has a support vector machine implementation. You'll almost certainly have to modify your data to something that the library can understand.
Good luck!
Update: I agree with the other commenter that Russel and Norvig is a great AI book which discusses some of this. Bishop's "Pattern Recognition and Machine Learning" discusses classification issues in depth if you're interested in the down and dirty details.
Your task is classical for neural networks, which are intended first of all to solve exactly classification tasks. Neural network has rather simple realization in any language, and it is the "mainstream" of "machine learning", closer to AI than anything other.
You just implement (or get existing implementation) standart neural network, for example multilayered network with learning by error back propagation, and give it learning examples in cycle. After some time of such learning you will get it working on real examples.
You can read more about neural networks starting from here:
http://en.wikipedia.org/wiki/Neural_network
http://en.wikipedia.org/wiki/Artificial_neural_network
Also you can get links to many ready implementations here:
http://en.wikipedia.org/wiki/Neural_network_software
I've got a doozy of a problem here. I'm aiming to build a framework to allow for the integration of different traffic simulation models. This integration is based upon the sharing of link connectivities, link costs, and vehicles between simulations.
To make a distributed simulation, I plan to have a 'coordinator' (star topology). All participating simulations simply register with it, and talk only to the coordinator. The coordinator then coordinates the execution of various tasks between each simulation.
A quick example of a distribution problem, is when one simulation is 'in charge' of certain objects, like a road. And another is 'in charge' of other roads. However, these roads are interconnected (and hence, we need synchronisation between these simulations, and need to be able to exchange data / invoke methods remotely).
I've had a look at RMI and am thinking it may be suited for this task. (To abstract out having to create an over-wire signalling discipline).
Is this sane? The issue here, is that simulation participants need to centralize some of their data storage in the 'coordinator' to ensure explicit synchronisation between simulations. Furthermore, some simulations may require components or methods from other simulations. (Hence the idea of using RMI).
My basic approach is to have the 'coordinator' run a giant RMI registry. And every simulation simply looks up everything in the registry, ensuring that the correct objects are used at each step.
Anyone have any tips for heading down this path?
You may want to check out Hazelcast also. Hazelcast is an open source transactional, distributed/partitioned implementation of queue, topic, map, set, list, lock and executor service. It is super easy to work with; just add hazelcast.jar into your classpath and start coding. Almost no configuration is required.
If you are interested in executing your Runnable, Callable tasks in a distributed fashion, then please check out Distributed Executor Service documentation at http://code.google.com/docreader/#p=hazelcast
Hazelcast is released under Apache license and enterprise grade support is also available.
Is this sane? IMHO no. And I'll tell you why. But first I'll add the disclaimer that this is a complicated topic so any answer has to be viewed as barely scratching the surface.
First instead of repeating myself I'll point you to a summary of Java grid/cluster technologies that I wrote awhile ago. Its a mostly complete list.
The star topology is "natural" for a "naive" (I don't mean that in a bad way) implementation because point-to-point is simple and centralizing key controller logic is also simple. It is however not fault-tolerant. It introduces scalability problems and a single bottleneck. It introduces communication inefficiences (namely the points communicate via a two-step process through the center).
What you really want for this is probably a cluster (rather than a data/compute grid) solution and I'd suggest you look at Terracotta. Ideally you'd look at Oracle Coherence but it's no doubt expensive (compared to free). It is a fantastic product though.
These two products can be used a number of ways but the core of both is to treat a cache like a distributed map. You put things in, you take things out and you fire off code that alters the cache. Coherence (with which I'm more familiar) in this regards scales fantastically well. These are more "server" based products though for a true cluster.
If you're looking at a more distributed model then perhaps you should be looking at more of an SOA based approach.
Have a look at http://www.terracotta.org/
its a distributed Java VM, so it has the advantage of being clustered application looks no different than a standard Java application.
I have used it in applications and the speed is very impressive so far.
Paul
Have you considered using a message queue approach? You could use JMS to communicate/coordinate tasks and results among a set of servers/nodes. You could even use Amazon's SQS (Simple Queue Service: aws.amazon.com/sqs) and have your servers running on EC2 to allow you to scale up and down as required.
Just my 2 cents.
Take a look at JINI, it might be of some use to you.
Well, Jini, or more specifically Javaspaces is a good place to start for a simple approach to the problem. Javaspaces lets you implement a master-worker model, where your master (coordinator in your case) writes tasks to the Javaspace, and the workers query for and process those tasks, writing the results back for the master. Since your problem is not embarrassingly parallel, and your workers need to synchronize/exchanging data, this will add some complexity to your solution.
Using Javaspaces will add a whole lot more abstraction to your implementation that using plain RMI (which is used by the Jini framework internally as the default "wire protocol").
Have a look at this article from sun for an intro.
And Jan Newmarch's Jini Tutorial is a pretty good place to start learning Jini
Just as an addition to the other answers which as far as I have seen all focus on grid and cloud computing, you should notice that simulation models have one unique characteristic: simulation time.
When running distributed simulation models in parallel and synchronized then I see two options:
When each simulation model has its own simulation clock and event list then these should be synchronized over the network.
Alternatively there could be a single simulation clock and event list which will "tick the time" for all distributed (sub) models.
The first option has been extensively researched for the High Level Architecture (HLA) see for example http://en.wikipedia.org/wiki/IEEE_1516 as a starter.
However the second option seems more simple and with less overhead to me.
GridGain is a good alternative. They have a map/reduce implementation with "direct API support for split and aggregation" and "distributed task session". You can browse their examples and see if some of them fits with your needs.