What practical (and lightweight) techniques are there for semantic/data matching?

What practical (and lightweight) techniques are there for semantic/data matching? - java

I have an application that lets users publish unstructured keywords. Simultaneously, other users can publish items that must be matched to one or more specified keywords. There is no restriction on the keywords either set of users may use, so simply hoping for a collision is likely to mean very few matches, when the reality is users might have used different keywords for the same thing or they are close enough (eg, 'bicycles' and 'cycling', or 'meat' and 'food').
I need this to work on mobile devices (Android), so I'm happy to sacrifice matching accuracy for efficiency and a small footprint. I know about s-match but this relies on a backing dictionary of 15MB, so it isn't ideal.
What other ideas/approaches/frameworks might help with this?

Your example of 'bicycles' and 'cycling' could be addressed by a take on the Levenshtein edit-distance algorithm since the two words are somewhat related. But your example of 'meat' and 'food' would indeed require a sizable backing dictionary, unless of course the concept set or target audience is limited to say, foodies.
Have you considered hosting the dictionary as a web service and accessing the data as needed? The drawback of course is that your app would only work while in network coverage.

Related

ArrayList or Multiple LinkedHashMap

I have an ArrayList of a custom object A. I need to retrieve 2 variables from A based on certain conditions. Should I simply use for loop to retrieve data from the list each time or create 2 LinkedHashMap and store the required variable in it as key/value pair for faster access later? Which is more efficient? Does creating 2 additional map objects justify the efficiency during search?
List will contain about 100-150 objects so does the two maps.
It will be used by concurrent users on daily basis.

Asking about "efficiency" is like asking about "beauty". What is "efficiency"? I argue that efficiency is what gets the code out soonest without bugs or other misbehavior. What's most efficient in terms of software costs is what saves programmer time, both for initial development and maintenance. In the time it took you to find "answers" on SO, you could have had a correct implementation coded and correct, and still had time to test your alternatives rigorously under controlled conditions to see which made any difference in the program's operation.
If you save 10 ms of program run time at the cost of horridly complex, over-engineered code that is rife with bugs and stupidly difficult to refactor or fix, is that "efficient"?
Furthermore, as phrased, the question is useless on SO. You provided no definition of "efficient" from your context. You provided no information on how the structures in question fit into your project architecture, or the extent of their use, or the size of the problem, or anything else relevant to any definition of "efficiency".
Even if you had, we'd have no more ability to answer such a question than if you asked a roomful of lawyers, "Should I sue so-and-so for what they did?" It all depends. You need advice, if you need advice at all, that is very specific to your situation and the exact circumstances of your development environment and process, your runtime environment, your team, the project goals, budget, and other relevant data.
If you are interested in runtime "efficiency", do the following. Precisely define what exactly you mean by "efficient", including an answer to "how 'efficient' is 'efficient' enough?", and including criteria to measure such "efficiency". Once you have such a precise and (dis)provable definition, then set up a rigorous test protocol to compare the alternatives in your context, and actually measure "efficiency".
When defining "efficiency", make sure that what you define matters. It makes no difference to be "efficient" in an area that has very low project cost or impact, and ignore an area that has huge cost or impact.
Don't expect any meaningful answer for your situation here on SO.

Use LinkedHashMap because it made for key value pair (according to your requirement).because data will increase in production environment.

NLG - Create text descriptions with simplenlg

I'm trying to generate product descriptions with the help of NLG. For example if I specify the properties of the product(say a mobile phone) such as its OS, RAM, processor, display, battery etc., It should output me a readable description of the mobile phone.
I see there are some paid services (Quill, Wordsmith etc.) which does the same.
Then I came across the open source Java API for NLG - simplenlg. I see how to create sentences by specifying the the sentence phrases and the features(such as tense, interrogation etc), but don't see option to create a description from texts.
Do anyone know how to create text description from words with simplenlg?
Is there any other tools/frameworks/APIs available to accomplish this task (not limited to Java)?

SimpleNLG is primarily a Surface Realizer. It requires a well formatted input but can then perform tasks such as changing the tense of the sentence. An explanation of the types of task which a realizer can perform can be found at the above link.
Generating sentence like those you describe would require additional components to handle the document planning and microplanning. The exact boundaries between these components is blurred but broadly speaking will have you define what you want to say in a document plan, then have the microplanner perform task such as referring expressing generation (choosing whether to say 'it' rather than 'the mobile phone') and aggregation, which is the merging of sentences. SimpleNLG has some support for aggregation.
It is also worth noting that this 3 stage process is not the only way to perform NLG, it is just a common one.
There is no magic solution I am aware of to take some information from a random domain and generate readable and meaningful text. In your mobile phone example it would be trivial to chain descriptions together and form something like:
The iPhone 7 has iOS11, 2GB RAM, a 1960 mA·h Li-ion battery and a $649 retail cost for the 32GB model.
But this would just be simple string concatenation or interpolation from your data. It does not account for nuance like the question of whether it would be better to say:
The iPhone 7 runs iOS11, has 2GB of RAM and is powered
by a 1960 mA·h Li-ion battery. It costs $649 retail for the 32GB model.
In this second example I have adjusted verbs (and therefore noun phrases), used the referring expression of 'it' and split our long sentence in two (with some further changes because of the split). Making these changes requires knowledge (and therefore computational rules) of the words and their usage within the domain. It becomes non-trivial very quickly.
If your requirements are as simple as 5 or 6 pieces of information about a phone, you could probably do it pretty well without NLG software, just create some kind of template and make sure all of your data makes sense when inserted. As soon as you go beyond mobile phones however, describing say cars, you would need to do all this work again for the new domain.
It would be worthwhile to look at Ehud Reiter's blog (the initial author of SimpleNLG). There are also papers such as Albert Gatt (Survey of the State of the Art in Natural
Language Generation: Core tasks, applications
and evaluation) although the latter is a bit dense if you are only dabbling in a little programming, it does however give an account of what NLG is, what it can do and what its current limitations are.

Store and search sets (with many possible values) in a database (from Java)

The problem is how to store (and search) a set of items a user likes and dislikes. Although each user may have 2-100 items in their set, the possible values for the items numbers in the tens of thousands (and is expanding).
Associated with each item is a value say from 10 (like) to 0 (neutral) to -10 (dislike).
So given a user with a particular set, how to find users with similar sets (say a percentage overlap on the intersection)? Ideally the set of matches could be reduced via a filter that includes only items with like/dislike values within a certain percentage.
I don't see how to use key/value or column-store for this, and walking relational table of items for each user would seem to consume too many resources. Making the sets into documents would seem to lose clarity.
The web app is in Java. I've searched ORMS, NoSQL, ElasticSearch and related tools and databases. Any suggestions?

Ok this seems like the actual storage isn’t the problem, but you want to make a suggestion system based on the likes/dislikes.
The point is that you can store things however you want, even in SQL, most SQL RDBMS will be good enough for your data store, but you can of course also use anything else you want. The point, is that no SQL solution (which I know of) will give you good results with this. The thing you are looking for is a suggestion system based on artificial intelligence, and the best one for distributed systems, where they have many libraries implemented, is Apache Mahout.
According to what I’ve learned about it so far, it can do what you need basically out of the box. I know that it’s based on Hadoop and Yarn but I’m not sure if you can import data from anywhere you want, or need to have it in HDFS.
Other option would be to implement a machine learning algorithm on your own, which would run only on one machine, but you just won’t get the results you want with a simple query in any sql system.
The reason you need machine learning algorithms and a query with some numbers won’t be enough in most of the cases, is the diversity of users you are facing… What if you have a user B which liked / disliked everything he has in common with user A the same way - but the coverage is only 15%. On the other hand you have user C which is pretty similar to A (while not at 100%, the directions are pretty much the same) and C has marked over 90% of the things, which A also marked. In this scenario C is much closer to A than B would be, but B has 100% coverage. There are many other scenarios where most simple percentages won’t be enough, and that’s why many companies which have suggestion systems (Amazon, Netflix, Spotify, …) use Apache Mahout and similar systems to get those done.

NSL KDD Features from Raw Live Packets?

I want to extract raw data using pcap and wincap. Since i will be testing it against a neural network trained with NSLKDD dataset, i want to know how to get those 41 attributes from raw data?.. or even if that is not possible is it possible to obtain features like src_bytes, dst host_same_srv_rate, diff_srv_rate, count, dst_host_serror_rate, wrong_fragment from raw live captured packets from pcap?

If someone would like to experiment with KDD '99 features despite the bad reputation of the dataset, I created a tool named kdd99extractor to extract subset of KDD features from live traffic or .pcap file.
This tool was created as part of one university project. I haven't found detailed documentation of KDD '99 features so the resulting values may be bit different compared to original KDD. Some sources used are mentioned in README. Also the implementation is not complete. For example, the content features dealing with payload are not implemented.
It is available in my github repository.

The 1999 KDD Cup Data is flawed and should not be used anymore
Even this "cleaned up" version (NSL KDD) is not realistic.
Furthermore, many of the "cleanups" they did are not sensible. Real data has duplicates, and the frequencies of such records is important. By removing duplicates, you bias your data towards the more rare observations. You must not do this blindly "just because", or even worse: to reduce the data set size.
The biggest issue however remains:
KDD99 is not realistic in any way
It wasn't realistic even in 1999, but the internet has changed a lot since back then.
It's not reasonable to use this data set for machine learning. The attacks in it are best detected by simple packet inspection firewall rules. The attacks are well understood, and appropriate detectors - highly efficient, with 100% detection rate and 0% false positives - should be available in many cases on modern routers. They are so omnipresent that these attacks virtually do not exist anymore since 1998 or so.
If you want real attacks, look for SQL injections and similar. But these won't show up in pcap files, yet the largely undocumented way the KDDCup'99 features were extracted from this...
Stop using this data set.
Seriously, it's useless data. Labeled, large, often used, but useless.

It seems that I am late to reply. But, as other people already answered, the KDD99 data-set is outdated.
I don't know about the usefulness of the NSL-KDD dataset. However, there is a couple of things:
When getting information from network traffic, the best you can do is to get statistical information (content-based information is usually encrypted). What you can do is to create your own data-set to describe the behaviors you want to consider as "normal". Then, train the neural network to detect deviations from that "normal" behavior.
Be careful knowing that even the definition of "normal" behavior changes from network to network and from time to time.
You can have a look to this work, I was involved in it, in which besides taking the statistical features of the original KDD, takes additional features from a real network environment.
The software is under request and it is free for academic purposes! Here two links to publications:
http://link.springer.com/chapter/10.1007/978-94-007-6818-5_30
http://www.iaeng.org/publication/WCECS2012/WCECS2012_pp30-35.pdf
Thanks!

Sort a list with SQL or as a collection?

I have some entries with dates in my database. What is best?:
Fetch them with a sql statement and also apply order by.
Get the list with sql, and order them within the application with collection.sort or so?
Thanks

This a very broad question that is very difficult to answer, and it depends a lot on what you mean by best?
From a performance perspective, you will simply have to measure to determine what part of your system is the bottleneck. Databases are usually very efficient, but it could still be relevant to off-load that work to the client.
From a separation of concern perspective, it depends on how the sorting matters in the application and how the application is layered.
Ask your self: "where does the knowledge that the data is sorted belong?" and "What would happen if I where to change from a relational database storage to something different".

To some extent, it depends on how many values are in the complete collection. If it is, say, 20-30 values then you can sort anywhere — even a relatively poor sorting algorithm can do that quickly (avoid Stooge Sort though; that's terrible) — as that is the sort of size of data chunk which you might expect to actually fetch in one service response.
But once you get into larger datasets you need to plan much more carefully. In particular, you want to avoid moving data around if you don't have to. If the data is currently only present in the database, you really don't want to fetch it all into the client just to sort it (a relatively expensive operation) and then throw virtually all of it away. It's far better to actually keep the data sorted in the database to start with, so that picking it up in order is trivial; in relational database terms, keeping the data sorted is functionally identical to maintaining an index on the data. Indeed, you can have multiple indices on the data, which can make even rather complex queries quick. (NoSQL DBs are more varied; some even don't support the concept of keeping data sorted.) The downside of maintaining indices is that they take up more space and they take time to maintain, particularly when the data is being created in the first place.
So… to return to your question, you probably want to try to not sort the data in the application: for most data, an appropriate index can be much more efficient as it lets your code not even look at unwanted data. But if you have to fetch it all into your application for some other reason and you can't bring it in pre-sorted, there's no reason to avoid sorting it yourself: Java's sorting algorithms are efficient and stable. But you should measure whether fetching it from the DB in the new order is faster. (The question is whether the DB overheads exceed the super-linear costs of re-sorting; lots of problems are in the domain where “maybe; hard to tell” is the answer.)
The other thing to balance is whether it is simpler for your code to not do sorting itself and instead always delegate that to the DB. Keeping your code simpler (and more bug-free) is a good goal to have…

Database management systems (DMBS) are optimized for these tasks, so I think you should stick with them. Especially if you are accessing the database from a script written in PHP or (other scripting language), it might be slower to perform that task using a script. You might also reach a memory limit allowed to be used by PHP if you sort the array using a script.
I don't mean to raise a question of performance of different programming languages, just want to point out that it is a very good practice to rely on the DMBS whenever you can.

This is a very interesting question to me, and I want to present the other side of the accepted answer, which BTW is a very good answer with which I don't necessarily *dis*agree. Just want to present the other side.
When I started in my career, I was working on mainframe DB2, and the old-timers that taught me were VERY INSISTENT that sorting be done OUTSIDE of the db. Their rational for this is that it's work that CAN be offloaded, and this leaves the DB free to service other requests.
Of course, it's far more nuanced than this. In general, I'd say the factors you're weighing are:
A) How busy, or central to your system, is your database? If your db is very busy, if you have a lot of OLTP processing on clients or app servers, and your client or application servers have lots of excess capacity, why not sort on the app server or client? Even if it's less efficient, it spreads the work through the system and gets you more throughput from a whole-systems perspective.
B) How big is the sort? It would be silly to, say, blow your call stack or java heap because you sorted a gazillion MB of data.
C) Will sorting in your app or app server cause pauses, latency, etc? In other words, if your particular programming language has REALLY bad sorting libraries, and you don't want to write your own, maybe letting the DB take 0.5 seconds is better than making your application take 5.0 seconds.
So, as with all things, "it depends" ;-). But, I think these are the things upon which it depends.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.