I am planning on using LibSVM to predict user authenticity in web applications.
(1) Collect Data on particular user behavior(eg. LogIn time, IP Address, Country etc.)
(2) Use Collected Data to train an SVM
(3) Use real time data to compare and generate an output on level of authenticity
Can some one tell me how can I do such a thing with LibSVM? Can Weka be helpful in these types of problems?
The three steps you mention are an outline of the solution. In some more detail:
Make sure you get plenty of labeled data, i.e. behavior logs annotated with authentic/non-authentic. (Without labeled data, you get into the pretty advanced field of semisupervised learning, or must consider other solutions.)
Design a number of features based on the data that you think predict authenticity well. Try the method and refine it until it works well enough by some statistical standard. Use ten-fold cross validation to assure you're not overfitting.
LibSVM can output a probability estimate along with its answer; see section 8 of its manual.
Related
I want to check user's balance of a couple of ERC20 compliant tokens using web3j.
Is there a generic way of doing that (generic for every ERC20 contract) or should I get ABI for each of the contracts and generate java classes from it?
I have never used web3j, but I have used web3js quite a bit. I will link you to relevant information.
Here is an interface that is already created in the tests of the web3j library, so the best place to start.
Extra notes (which might well be basic for you)
Checking the balance is something that you don't want to generate a transaction for (since it doesn't change the state of the blockchain) and so you should use a 'call', as explained here.
Also, it may be useful to understand how Ethereum creates the ABI in the first place. Every transaction or call can contain data with it, and the network then uses this data to determine which function is being called and it's parameters. The logic for this function is sitting at the address of the first 4 bytes of the kekak hash of the functions name/parameters (some info), which is one reason why it is so important that this hash is collision free (imagine 2 different functions hashing to the same address). But the take home of this is that all erc20 tokens (if they follow the standard) have common ABIs for those functions.
PS. For next time I think this question is better suited for Ethereum Stackexchange.
I am planning a task to read all the Bank related SMS from the users android mobile inbox and extract their account number and balance from it. I am guessing this could be done in 2 ways as,
Using RegEx to extract the data from the SMS body as stated link here. This certainly has the advantage of giving generic representation of any Bank Balance message
Store a template message of every bank in the database and compare it with the read SMS to extract the data
I would like to know which path is efficient or Is there any other way to do it ?
The two approaches have different qualities:
Option 1 might lead to many different, complex regular expressions. Alone glancing into the answer you linked made my head spin. Meaning: maintaining such a list of regular expressions will not be an easy undertaking from the developer perspective.
Whereas for option 2, of course you have to keep track regarding your collection of "templates", but: once your infrastructure is in place, the only work required for you: adding new templates; or adapting them.
So, from a "development" efforts side I would tend to option 2 --- because such "templates" are easier to manage by you. You don't even need much understanding of the Java language in order to deal with such templates. They are just text; containing some defined keywords here and there.
One could even think about telling your users how to define templates themselves! They know how the SMS from their bank looks like; so you could think about some "import" mechanism where your APP pulls the SMS text, and then the user tells the APP (once) where the relevant parts can be found in there!
Regarding runtime efficiency: I wouldn't rely on people making guesses here. Instead: make experiments with real world data; and see if matching SMS text against a larger set of complex regular expressions is cheaper or more expensive than matching them against much simpler "templates".
Storing the template for each bank cost more memory (if you load them on at start up for efficiency) and file system storage, and also as you stated, there is the downside of requiring previous know each bank template and setup the user application properly to it.
Using the regex will not cost file system store neither more memory, however it could create false positives for something which looks like a bank message, but it is not. However there is the facility to not need to know all the banks out there in order to do it properly.
Let me describe the problem. A lot of suppliers send us data files in various formats (with various headers). We do not have any control on the data format (what columns the suppliers send us). Then this data needs to be converted to our standard transactions (this standard is constant and defined by us).
The challenge here is that we do not have any control on what columns suppliers send us in their files. The destination standard is constant. Now I have been asked to develop a framework through which the end users can define their own data transformation rules through UI. (say field A in destination transaction is equal to columnX+columnY or first 3 characters of columnZ from input file). There will be many such data transformation rules.
The goal is that the users should be able to add all these supplier files (and convert all their data to my company data from front end UI with minimum code change). Please suggest me some frameworks for this (preferably java based).
Worked in a similar field before. Not sure if I would trust customers/suppliers to use such a tool correctly and design 100% bulletproof transformations. Mapping columns is one thing, but how about formatting problems in dates, monetary values and the likes? You'd probably need to manually check their creations anyway or you'll end up with some really nasty data consistency issues. Errors caused by faulty data transformation are little beasts hiding in the dark and jumping at you when you need them the least.
If all you need is a relatively simple, graphical way to design data conversations, check out something like Talend Open Studio (just google it). It calls itself an ETL tool, but we used for all kinds of stuff.
I am implementing a small CRM system. and the concept of data mining to predict and find opportunities and trends are essential for such systems. One data mining approach is clustering. This is a very small CRM project and using java to provide the interface for information retrieval from database.
My question is that when I insert a customer into database, I have a text field which allows customers to be tagged on their way into the database i.e. registration point.
Would you regard tagging technique as clustering? If so, is this a data mining technique?
I am sure there is complex API such as Java Data Mining API which allows data mining. But for the sake of my project I just wanted to know if tagging users with keyword like stackoverflow allows tagging of keywords on posting question is a form of data mining since through those tagged words, one can find trends and patterns easily through searching.
To make it short, yes, tags are additional information that will make data mining easier to conduct later on.
They probably won't be enough though. Tags are linked to entities and, depending on how you compute them, they might not show interesting relations between different entities. With your tagging system, the only relation usable I see is 'has same tag' and it might not be enough.
Clustering your data can be done using community detection techniques on graphs built using your data and relations between entities.
This example is in Python and uses the networkx library but it might give you an idea of what I'm talking about: http://perso.crans.org/aynaud/communities/
Yes, tagging is definitely one way of grouping your users. However, it’s different than ‘clustering.’ Here’s why: you’re making a conscious decision on how you want to group them, but there may be better/ different user groups based on ranging behaviors that may not be obvious to you.
Clustering methods are unsupervised learning methods that can help you uncover these patterns. These methods are “unsupervised” because you don’t have a specific target variable; rather, you want to find groups/ patterns that are most prominent in the data. You can feed CRM data to clustering algorithms to uncover ‘hidden’ relationships.
Also, if you’re using ‘tagging,’ it’s more of a descriptive analytics problem - you’ve well-defined groups in the data, and you’re identifying their behavior. Clustering would be a predictive analytics problem - algorithms will try to predict groups based on the user behavior they recognize in the data.
I am using iso 19794-2 fingerprint data format. All the data are in the iso 19794-2 format. I have more than hundred thousand fingerprints. I wish to make efficient search to identify the match. Is it possible to construct a binary tree like structure to perform an efficient(fastest) search for match? or suggest me a better way to find the match. and also suggest me an open source api for java to do fingerprint matching. Help me. Thanks.
Do you have a background in fingerprint matching? It is not a simple problem and you'll need a bit of theory to tackle such a problem. Have a look at this introduction to fingerprint matching by Bologna University's BioLab (a leading research lab in this field).
Let's now answer to your question, that is how to make the search more efficient.
Fingerprints can be classified into 5 main classes, according to the type of macro-singularity that they exhibit.
There are three types of macro-singularities:
whorl (a sort of circle)
loop (a U inversion)
delta (a sort of three-way crossing)
According to the position of those macro-singularities, you can classify the fingerprint in those classes:
arch
tented arch
right loop
left loop
whorl
Once you have narrowed the search to the correct class, you can perform your matches. From your question it looks like you have to do an identification task, so I'm afraid that you'll have to do all the comparisons, or else add some layers of pre-processing (like the classification I wrote about) to further narrow the search field.
You can find lots of information about fingerprint matching in the book Handbook of Fingerprint Recognition, by Maltoni, Maio, Jain and Prabhakar - leading researchers in this field.
In order to read ISO 19794-2 format, you could use some utilities developed by NIST called BiomDI, Software Tools supporting Standard Biometric Data Interchange Formats. You could try to interface it with open source matching algorithms like the one found in this biometrics SDK. It would however need a lot of work, including the conversion from one format to another and the fine-tuning of algorithms.
My opinion (as a Ph.D. student working in biometrics) is that in this field you can easily write code that does the 60% of what you need in no time, but the remaining 40% will be:
hard to write (20%); and
really hard to write without money and time (20%).
Hope that helps!
Edit: added info about NIST BiomDI
Edit 2: since people sometimes email me asking for a copy of the standard, I unfortunately don't have one to share. All I have is a link to the ISO page that sells the standard.
The iso format specifies useful mechanisms for matching and decision parameters. Decide on what mechanism you wish to employ to identify the match, and the relevant decision parameters. When you have determined these mechanisms and decision parameters, examine them to see which are capable of being put into an order - with a fairly high degree of individual values, as you want to avoid multiple collisions on the data. When you have identified a small number of data items (preferably one) that have this property, calculate the property for each fingerprint - preferably as they are added to the database, though a bulk load can be done initially. Then the search for a match is done on the calculated characteristic, and can be done by a binary tree, a black-red tree, or a variety of other search processes. I cannot recommend a particular search strategy without knowing what form and degree of differentiation of values you have in your database. Such a search strategy should, however, be capable of delivering a (small) range of possible matches - which can then be tested individually against your match mechanism and parameters, before deciding on a specific match.