NLG - Create text descriptions with simplenlg - java

I'm trying to generate product descriptions with the help of NLG. For example if I specify the properties of the product(say a mobile phone) such as its OS, RAM, processor, display, battery etc., It should output me a readable description of the mobile phone.
I see there are some paid services (Quill, Wordsmith etc.) which does the same.
Then I came across the open source Java API for NLG - simplenlg. I see how to create sentences by specifying the the sentence phrases and the features(such as tense, interrogation etc), but don't see option to create a description from texts.
Do anyone know how to create text description from words with simplenlg?
Is there any other tools/frameworks/APIs available to accomplish this task (not limited to Java)?

SimpleNLG is primarily a Surface Realizer. It requires a well formatted input but can then perform tasks such as changing the tense of the sentence. An explanation of the types of task which a realizer can perform can be found at the above link.
Generating sentence like those you describe would require additional components to handle the document planning and microplanning. The exact boundaries between these components is blurred but broadly speaking will have you define what you want to say in a document plan, then have the microplanner perform task such as referring expressing generation (choosing whether to say 'it' rather than 'the mobile phone') and aggregation, which is the merging of sentences. SimpleNLG has some support for aggregation.
It is also worth noting that this 3 stage process is not the only way to perform NLG, it is just a common one.
There is no magic solution I am aware of to take some information from a random domain and generate readable and meaningful text. In your mobile phone example it would be trivial to chain descriptions together and form something like:
The iPhone 7 has iOS11, 2GB RAM, a 1960 mA·h Li-ion battery and a $649 retail cost for the 32GB model.
But this would just be simple string concatenation or interpolation from your data. It does not account for nuance like the question of whether it would be better to say:
The iPhone 7 runs iOS11, has 2GB of RAM and is powered
by a 1960 mA·h Li-ion battery. It costs $649 retail for the 32GB model.
In this second example I have adjusted verbs (and therefore noun phrases), used the referring expression of 'it' and split our long sentence in two (with some further changes because of the split). Making these changes requires knowledge (and therefore computational rules) of the words and their usage within the domain. It becomes non-trivial very quickly.
If your requirements are as simple as 5 or 6 pieces of information about a phone, you could probably do it pretty well without NLG software, just create some kind of template and make sure all of your data makes sense when inserted. As soon as you go beyond mobile phones however, describing say cars, you would need to do all this work again for the new domain.
It would be worthwhile to look at Ehud Reiter's blog (the initial author of SimpleNLG). There are also papers such as Albert Gatt (Survey of the State of the Art in Natural
Language Generation: Core tasks, applications
and evaluation) although the latter is a bit dense if you are only dabbling in a little programming, it does however give an account of what NLG is, what it can do and what its current limitations are.

Related

Pre trained vectors, nlp, word2vec, word embedding for particular topic?

is there any pretrained vector for particular topic only? for example "java", so i want vectors related java in file. mean if i give input inheritance then cosine similarity show me polymorphism and other related stuff only!
i am using corpus as GoogleNews-vectors-negative300.bin and Glove vectors. still not getting related words.
Not sure if I understand your question/problem statement, but if you want to work with a corpus of java source code you can use code2vec which provides pre-trained word-embeddings models. Check it out: https://code2vec.org/
Yes, you can occasionally find other groups' pre-trained vectors for download, which may have better coverage of whatever problem domains they've been trained on: both more specialized words, and word-vectors matching the word sense in that domain.
For example, the GoogleNews word-vectors were trained on news articles circa 2012, so its vector for 'Java' may be dominated by stories of the Java island of Indosnesia as much as the programming language. And many other vector-sets are trained on Wikipedia text, which will be dominated by usages in that particular reference-style of writing. But there could be other sets that better emphasize the word-senses you need.
However, the best approach is often to train your own word-vectors, from a training corpus that closely matches the topics/documents you are concerned about. Then, the word-vectors are well-tuned to your domain-of-concern. As long as you have "enough" varied examples of a word used in context, the resulting vector will likely be better than generic vectors from someone else's corpus. ("Enough" has no firm definition, but is usually at least 5, and ideally dozens to hundreds, of representative, diverse uses.)
Let's consider your example goal – showing some similarity between the ideas of 'polymorphism' and 'input inheritance'. For that, you'd need a training corpus that discusses those concepts, ideally many times, from many authors, in many problem-contexts. (Textbooks, online articles, and Stack Overflow pages might be possible sources.)
You'd further need a tokenization strategy that manages to create a single word-token for the two-word concept 'input_inheritance' - which is a separate challenge, and might be tackled via (1) a hand-crafted glossary of multi-word-phrases that should be combined; (2) statistical analysis of word-pairs that seem to occur so often together, they should be combined; (3) more sophisticated grammar-aware phrase- and entity-detection preprocessing.
(The multiword phrases in the GoogleNews set were created via a statistical algorithm which is also available in the gensim Python library as the Phrases class. But, the exact parameters Google used have not, as far as I know, been revealed.And, good results from this algorithm can require a lot of data and tuning, and still result in some combinations that a person would consider nonsense, and missing others that a person would consider natural.)

Which way is it better to extract Account number and Balance from a SMS body?

I am planning a task to read all the Bank related SMS from the users android mobile inbox and extract their account number and balance from it. I am guessing this could be done in 2 ways as,
Using RegEx to extract the data from the SMS body as stated link here. This certainly has the advantage of giving generic representation of any Bank Balance message
Store a template message of every bank in the database and compare it with the read SMS to extract the data
I would like to know which path is efficient or Is there any other way to do it ?
The two approaches have different qualities:
Option 1 might lead to many different, complex regular expressions. Alone glancing into the answer you linked made my head spin. Meaning: maintaining such a list of regular expressions will not be an easy undertaking from the developer perspective.
Whereas for option 2, of course you have to keep track regarding your collection of "templates", but: once your infrastructure is in place, the only work required for you: adding new templates; or adapting them.
So, from a "development" efforts side I would tend to option 2 --- because such "templates" are easier to manage by you. You don't even need much understanding of the Java language in order to deal with such templates. They are just text; containing some defined keywords here and there.
One could even think about telling your users how to define templates themselves! They know how the SMS from their bank looks like; so you could think about some "import" mechanism where your APP pulls the SMS text, and then the user tells the APP (once) where the relevant parts can be found in there!
Regarding runtime efficiency: I wouldn't rely on people making guesses here. Instead: make experiments with real world data; and see if matching SMS text against a larger set of complex regular expressions is cheaper or more expensive than matching them against much simpler "templates".
Storing the template for each bank cost more memory (if you load them on at start up for efficiency) and file system storage, and also as you stated, there is the downside of requiring previous know each bank template and setup the user application properly to it.
Using the regex will not cost file system store neither more memory, however it could create false positives for something which looks like a bank message, but it is not. However there is the facility to not need to know all the banks out there in order to do it properly.

How to speed up the model creation process of OpenNLP

I am using OpenNLP Token Name finder for parsing the Unstructured data, I have created a corpus(training set) of 4MM records but as I am creating a model out of this corpus using OpenNLP API's in Eclipse, process is taking around 3 hrs which is very time consuming. Model is building on default parameters that is iteration 100 and cutoff 5.
So my question is, how can I speed up this process, how can I reduce the time taken by the process for building the model.
Size of the corpus could be the reason for this but just wanted to know if someone came across this kind of problem and if so, then how to solve this.
Please provide some clue.
Thanks in advance!
Usually the first approach to handle such issues is to split the training data to several chunks, and let each one to create a model of its own. Afterwards you merge the models. I am not sure that this is valid in this case (I'm not an OpenNLP expert), there's another solution below. Also, as it seems that the OpenNLP API provides only a single threaded train() methods, I would file an issue requesting a multi threaded option.
For a slow single threaded operation the two main slowing factors are IO and CPU, and both can be handled separately:
IO - which hard drive do you use? Regular (magnetic) or SSD? moving to SSD should help.
CPU - which CPU are you using? moving to a faster CPU will help. Don't pay attention to the number of cores, as here you want the raw speed.
An option you may want to consider to to get an high CPU server from Amazon web services or Google Compute Engine and run the training there - you can download the model afterwards. Both give you high CPU servers utilizing Xeon (Sandy Bridge or Ivy Bridge) CPUs and local SSD storage.
I think you should make algorithm related changes before upgrading the hardware.
Reducing the sentence size
Make sure you don't have unnecessarily long sentences in the training sample. Such sentences don't increase the performance but have a huge impact on computation. (Not sure of the order) I generally put a cutoff at 200 words/sentence. Also look at the features closely, these are the default feature generators
two kinds of WindowFeatureGenerator with a default window size of only two
OutcomePriorFeatureGenerator
PreviousMapFeatureGenerator
BigramNameFeatureGenerator
SentenceFeatureGenerator
These features generators generate the following features in the given sentence for the word: Robert.
Sentence: Robert, creeley authored many books such as Life and Death, Echoes and Windows.
Features:
w=robert
n1w=creeley
n2w=authored
wc=ic
w&c=robert,ic
n1wc=lc
n1w&c=creeley,lc
n2wc=lc
n2w&c=authored,lc
def
pd=null
w,nw=Robert,creeley
wc,nc=ic,lc
S=begin
ic is Initial Capital, lc is lower case
Of these features S=begin is the only sentence dependant feature, which marks that Robert occurred in the start of the sentence.
My point is to explain the role of a complete sentence in training. You can actually drop the SentenceFeatureGenerator and reduce the sentence size further to only accomodate few words in the window of the desired entity. This will work just as well.
I am sure this will have a huge impact on complexity and very little on performace.
Have you considered sampling?
As I have described above, the features are very sparse representation of the context. May be you have many sentences with duplicates, as seen by the feature generators. Try to detect these and sample in a way to represent sentences with diverse patterns, ie. it should be impossible to write only a few regular expressions that matches them all. In my experience, training samples with diverse patterns did better than those that represent only a few patterns, even though the former had a much smaller number of sentences. Sampling this way should not affect the model performance at all.
Thank you.

NSL KDD Features from Raw Live Packets?

I want to extract raw data using pcap and wincap. Since i will be testing it against a neural network trained with NSLKDD dataset, i want to know how to get those 41 attributes from raw data?.. or even if that is not possible is it possible to obtain features like src_bytes, dst host_same_srv_rate, diff_srv_rate, count, dst_host_serror_rate, wrong_fragment from raw live captured packets from pcap?
If someone would like to experiment with KDD '99 features despite the bad reputation of the dataset, I created a tool named kdd99extractor to extract subset of KDD features from live traffic or .pcap file.
This tool was created as part of one university project. I haven't found detailed documentation of KDD '99 features so the resulting values may be bit different compared to original KDD. Some sources used are mentioned in README. Also the implementation is not complete. For example, the content features dealing with payload are not implemented.
It is available in my github repository.
The 1999 KDD Cup Data is flawed and should not be used anymore
Even this "cleaned up" version (NSL KDD) is not realistic.
Furthermore, many of the "cleanups" they did are not sensible. Real data has duplicates, and the frequencies of such records is important. By removing duplicates, you bias your data towards the more rare observations. You must not do this blindly "just because", or even worse: to reduce the data set size.
The biggest issue however remains:
KDD99 is not realistic in any way
It wasn't realistic even in 1999, but the internet has changed a lot since back then.
It's not reasonable to use this data set for machine learning. The attacks in it are best detected by simple packet inspection firewall rules. The attacks are well understood, and appropriate detectors - highly efficient, with 100% detection rate and 0% false positives - should be available in many cases on modern routers. They are so omnipresent that these attacks virtually do not exist anymore since 1998 or so.
If you want real attacks, look for SQL injections and similar. But these won't show up in pcap files, yet the largely undocumented way the KDDCup'99 features were extracted from this...
Stop using this data set.
Seriously, it's useless data. Labeled, large, often used, but useless.
It seems that I am late to reply. But, as other people already answered, the KDD99 data-set is outdated.
I don't know about the usefulness of the NSL-KDD dataset. However, there is a couple of things:
When getting information from network traffic, the best you can do is to get statistical information (content-based information is usually encrypted). What you can do is to create your own data-set to describe the behaviors you want to consider as "normal". Then, train the neural network to detect deviations from that "normal" behavior.
Be careful knowing that even the definition of "normal" behavior changes from network to network and from time to time.
You can have a look to this work, I was involved in it, in which besides taking the statistical features of the original KDD, takes additional features from a real network environment.
The software is under request and it is free for academic purposes! Here two links to publications:
http://link.springer.com/chapter/10.1007/978-94-007-6818-5_30
http://www.iaeng.org/publication/WCECS2012/WCECS2012_pp30-35.pdf
Thanks!

What practical (and lightweight) techniques are there for semantic/data matching?

I have an application that lets users publish unstructured keywords. Simultaneously, other users can publish items that must be matched to one or more specified keywords. There is no restriction on the keywords either set of users may use, so simply hoping for a collision is likely to mean very few matches, when the reality is users might have used different keywords for the same thing or they are close enough (eg, 'bicycles' and 'cycling', or 'meat' and 'food').
I need this to work on mobile devices (Android), so I'm happy to sacrifice matching accuracy for efficiency and a small footprint. I know about s-match but this relies on a backing dictionary of 15MB, so it isn't ideal.
What other ideas/approaches/frameworks might help with this?
Your example of 'bicycles' and 'cycling' could be addressed by a take on the Levenshtein edit-distance algorithm since the two words are somewhat related. But your example of 'meat' and 'food' would indeed require a sizable backing dictionary, unless of course the concept set or target audience is limited to say, foodies.
Have you considered hosting the dictionary as a web service and accessing the data as needed? The drawback of course is that your app would only work while in network coverage.

Categories

Resources