I am using OpenNLP Token Name finder for parsing the Unstructured data, I have created a corpus(training set) of 4MM records but as I am creating a model out of this corpus using OpenNLP API's in Eclipse, process is taking around 3 hrs which is very time consuming. Model is building on default parameters that is iteration 100 and cutoff 5.
So my question is, how can I speed up this process, how can I reduce the time taken by the process for building the model.
Size of the corpus could be the reason for this but just wanted to know if someone came across this kind of problem and if so, then how to solve this.
Please provide some clue.
Thanks in advance!
Usually the first approach to handle such issues is to split the training data to several chunks, and let each one to create a model of its own. Afterwards you merge the models. I am not sure that this is valid in this case (I'm not an OpenNLP expert), there's another solution below. Also, as it seems that the OpenNLP API provides only a single threaded train() methods, I would file an issue requesting a multi threaded option.
For a slow single threaded operation the two main slowing factors are IO and CPU, and both can be handled separately:
IO - which hard drive do you use? Regular (magnetic) or SSD? moving to SSD should help.
CPU - which CPU are you using? moving to a faster CPU will help. Don't pay attention to the number of cores, as here you want the raw speed.
An option you may want to consider to to get an high CPU server from Amazon web services or Google Compute Engine and run the training there - you can download the model afterwards. Both give you high CPU servers utilizing Xeon (Sandy Bridge or Ivy Bridge) CPUs and local SSD storage.
I think you should make algorithm related changes before upgrading the hardware.
Reducing the sentence size
Make sure you don't have unnecessarily long sentences in the training sample. Such sentences don't increase the performance but have a huge impact on computation. (Not sure of the order) I generally put a cutoff at 200 words/sentence. Also look at the features closely, these are the default feature generators
two kinds of WindowFeatureGenerator with a default window size of only two
OutcomePriorFeatureGenerator
PreviousMapFeatureGenerator
BigramNameFeatureGenerator
SentenceFeatureGenerator
These features generators generate the following features in the given sentence for the word: Robert.
Sentence: Robert, creeley authored many books such as Life and Death, Echoes and Windows.
Features:
w=robert
n1w=creeley
n2w=authored
wc=ic
w&c=robert,ic
n1wc=lc
n1w&c=creeley,lc
n2wc=lc
n2w&c=authored,lc
def
pd=null
w,nw=Robert,creeley
wc,nc=ic,lc
S=begin
ic is Initial Capital, lc is lower case
Of these features S=begin is the only sentence dependant feature, which marks that Robert occurred in the start of the sentence.
My point is to explain the role of a complete sentence in training. You can actually drop the SentenceFeatureGenerator and reduce the sentence size further to only accomodate few words in the window of the desired entity. This will work just as well.
I am sure this will have a huge impact on complexity and very little on performace.
Have you considered sampling?
As I have described above, the features are very sparse representation of the context. May be you have many sentences with duplicates, as seen by the feature generators. Try to detect these and sample in a way to represent sentences with diverse patterns, ie. it should be impossible to write only a few regular expressions that matches them all. In my experience, training samples with diverse patterns did better than those that represent only a few patterns, even though the former had a much smaller number of sentences. Sampling this way should not affect the model performance at all.
Thank you.
Related
is there any pretrained vector for particular topic only? for example "java", so i want vectors related java in file. mean if i give input inheritance then cosine similarity show me polymorphism and other related stuff only!
i am using corpus as GoogleNews-vectors-negative300.bin and Glove vectors. still not getting related words.
Not sure if I understand your question/problem statement, but if you want to work with a corpus of java source code you can use code2vec which provides pre-trained word-embeddings models. Check it out: https://code2vec.org/
Yes, you can occasionally find other groups' pre-trained vectors for download, which may have better coverage of whatever problem domains they've been trained on: both more specialized words, and word-vectors matching the word sense in that domain.
For example, the GoogleNews word-vectors were trained on news articles circa 2012, so its vector for 'Java' may be dominated by stories of the Java island of Indosnesia as much as the programming language. And many other vector-sets are trained on Wikipedia text, which will be dominated by usages in that particular reference-style of writing. But there could be other sets that better emphasize the word-senses you need.
However, the best approach is often to train your own word-vectors, from a training corpus that closely matches the topics/documents you are concerned about. Then, the word-vectors are well-tuned to your domain-of-concern. As long as you have "enough" varied examples of a word used in context, the resulting vector will likely be better than generic vectors from someone else's corpus. ("Enough" has no firm definition, but is usually at least 5, and ideally dozens to hundreds, of representative, diverse uses.)
Let's consider your example goal – showing some similarity between the ideas of 'polymorphism' and 'input inheritance'. For that, you'd need a training corpus that discusses those concepts, ideally many times, from many authors, in many problem-contexts. (Textbooks, online articles, and Stack Overflow pages might be possible sources.)
You'd further need a tokenization strategy that manages to create a single word-token for the two-word concept 'input_inheritance' - which is a separate challenge, and might be tackled via (1) a hand-crafted glossary of multi-word-phrases that should be combined; (2) statistical analysis of word-pairs that seem to occur so often together, they should be combined; (3) more sophisticated grammar-aware phrase- and entity-detection preprocessing.
(The multiword phrases in the GoogleNews set were created via a statistical algorithm which is also available in the gensim Python library as the Phrases class. But, the exact parameters Google used have not, as far as I know, been revealed.And, good results from this algorithm can require a lot of data and tuning, and still result in some combinations that a person would consider nonsense, and missing others that a person would consider natural.)
I'm trying to generate product descriptions with the help of NLG. For example if I specify the properties of the product(say a mobile phone) such as its OS, RAM, processor, display, battery etc., It should output me a readable description of the mobile phone.
I see there are some paid services (Quill, Wordsmith etc.) which does the same.
Then I came across the open source Java API for NLG - simplenlg. I see how to create sentences by specifying the the sentence phrases and the features(such as tense, interrogation etc), but don't see option to create a description from texts.
Do anyone know how to create text description from words with simplenlg?
Is there any other tools/frameworks/APIs available to accomplish this task (not limited to Java)?
SimpleNLG is primarily a Surface Realizer. It requires a well formatted input but can then perform tasks such as changing the tense of the sentence. An explanation of the types of task which a realizer can perform can be found at the above link.
Generating sentence like those you describe would require additional components to handle the document planning and microplanning. The exact boundaries between these components is blurred but broadly speaking will have you define what you want to say in a document plan, then have the microplanner perform task such as referring expressing generation (choosing whether to say 'it' rather than 'the mobile phone') and aggregation, which is the merging of sentences. SimpleNLG has some support for aggregation.
It is also worth noting that this 3 stage process is not the only way to perform NLG, it is just a common one.
There is no magic solution I am aware of to take some information from a random domain and generate readable and meaningful text. In your mobile phone example it would be trivial to chain descriptions together and form something like:
The iPhone 7 has iOS11, 2GB RAM, a 1960 mA·h Li-ion battery and a $649 retail cost for the 32GB model.
But this would just be simple string concatenation or interpolation from your data. It does not account for nuance like the question of whether it would be better to say:
The iPhone 7 runs iOS11, has 2GB of RAM and is powered
by a 1960 mA·h Li-ion battery. It costs $649 retail for the 32GB model.
In this second example I have adjusted verbs (and therefore noun phrases), used the referring expression of 'it' and split our long sentence in two (with some further changes because of the split). Making these changes requires knowledge (and therefore computational rules) of the words and their usage within the domain. It becomes non-trivial very quickly.
If your requirements are as simple as 5 or 6 pieces of information about a phone, you could probably do it pretty well without NLG software, just create some kind of template and make sure all of your data makes sense when inserted. As soon as you go beyond mobile phones however, describing say cars, you would need to do all this work again for the new domain.
It would be worthwhile to look at Ehud Reiter's blog (the initial author of SimpleNLG). There are also papers such as Albert Gatt (Survey of the State of the Art in Natural
Language Generation: Core tasks, applications
and evaluation) although the latter is a bit dense if you are only dabbling in a little programming, it does however give an account of what NLG is, what it can do and what its current limitations are.
I have previously trained a german classifier using the Stanford NER and a training-file with 450.000 tokens. Because I had almost 20 classes, this took about 8 hours and I had to cut a lot of features short in the prop file.
I now have a gazette-file with 16.000.000 unique tagged tokens. I want to retrain my classifier under use of those tokens, but I keep running into memory issues. The gazette-txt is 386mb and mostly contains two-token objects (first + second name), all unique.
I have reduced the amount of classes to 5, reduced the amount of tokens in the gazette by 4 million and I've removed all the features listed on the Stanford NER FAQ-site from the prop-file but I still run into the out of memory: java heap space error. I have 16gb of ram and start the jvm with -mx15g -Xmx14g.
The error occurs about 5 hours into the process.
My problem is that I don't know how to further reduce the memory usage without arbitrarily deleting entries from the gazette. Does someone have further suggestions on how I could reduce my memory-usage?
My prop-file looks like this:
trainFile = ....tsv
serializeTo = ...ser.gz
map = word=0,answer=1
useWordPairs=false
useNGrams=false
useClassFeature=true
useWord=true
noMidNGrams=true
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useDisjunctive=true
saveFeatureIndexToDisk=true
qnSize=2
printFeatures=true
useObservedSequencesOnly=true
cleanGazette=true
gazette=....txt
Hopefully this isnt too troublesome. Thank you in advance!
RegexNER could help you with this:
http://nlp.stanford.edu/static/software/regexner/
Some thoughts:
start with 1,000,000 entries and see how big of a gazetteer you can handle, or if 1,000,000 is too large shrink it down more.
sort the entries by how frequent they are in a large corpus and eliminate the infrequent ones
Hopefully a lot of the rarer entries in your gazetteer aren't ambiguous, so you can just use RegexNER and have a rule based layer in your system that automatically tags them as PERSON
Heres an update to what I've been doing:
First I tried to train the Classifier using all available data on our universities Server with 128gb RAM available. But since the progress was incredibly slow (~120 Iterations of optimization after 5 days) I decided to filter the gazetteer.
I checked the german wikipedia for all n-Grams in my gazeteer and only kept those that occured more than once. This reduced the amount of PER from ~12 mio. to 260k.
I only did this for my PER-list at first and retrained my classifier. This resulted in an F-Value increase of 3% (from ~70,5% to 73,5%). By now I filtered the ORG and LOC lists as well but I am uncertain whether or not I should use those.
The ORG-list contains a lot of acronyms. Those are all written in capital letters but I dont know whether or not the training process takes capitalization into account. Because if it didnt this would lead to tons of unwanted ambiguity between the acronyms and actual words in german.
I also noticed that whenever I uses either the unfiltered ORG or the unfiltered LOC-list the F-Value of that one class might have risen a bit, but the F-Values of other classes went down somewhat significantly. This is why for now I am only using the PER-list.
This is my progress so far. Thanks again to everyone that helped.
I need to put together a data structure that will efficiently provide keyword search facilities.
My metrics are:
Circa 500,000 products.
Circa 20+ keywords per product (a guess).
Products are identified by an ID of about 10 digits but may be any ASCII codes going forward.
I would like to try to fit the data structure in memory if possible. I will be on a server so I can assume some significant memory availability.
Speed is important. Using LIKE database queries will not be an acceptable solution.
Any ideas for a data structure?
My thoughts:
TrieMap
Very efficient for the keywords but there would need to be a list of product IDs hanging off any leaf so seriously memory hungry. Any ideas that could help with that?
Compression
Various compression schemes come to mind but none jump out as of significant value.
Has anyone else put something like this together? Could you share your experiences?
The data may change but not often. It would be reasonable to rebuild the structure on a daily basis to accommodate changes.
Have you thought about using lucene either in memory or as a file system index?
It is quite fast and has lots of room for further requirements that might arise in the future.
What are some generic methods for optimizing a program in Java, in terms of speed. I am using a DOM Parser to parse an XML file and then store certain words in an ArrayList, remove any duplicates then spell check those words by creating Google search URL's for each word, get the html document, locate the corrected word and save it to another ArrayList.
Any help would be appreciated! Thanks.
Why do you need to improve performance? From your explanation, it is pretty obvious that the big bottleneck here (or performance hit) is going to be the IO resulting from the fact that you are accessing a URL.
This will surely dwarf by orders of magnitude any minor improvements you make in data structures or XML frameworks.
It is a good general rule of thumb that your big performance problems will involve IO. Humorously enough, I am at this very moment waiting for a database query to return in a batch process. It has been running for almost an hour. But I welcome any suggested improvements to my XML parsing library nevertheless!
Here are my general methods:
Does your program perform any obviously expensive task from the perspective of latency (IO)? Do you have enough logging to see that this is where the delay is (if significant)?
Is your program prone to lock-contention (i.e. can it wait around, doing nothing, waiting for some resource to be "free")? Perhaps you are locking an entire Map whilst you make an expensive calculation for a value to store, blocking other threads from accessing the map
Is there some obvious algorithm (perhaps for data-matching, or sorting) that might have poor characteristics?
Run up a profiler (e.g. jvisualvm, which ships with the JDK itself) and look at your code hotspots. Where is the JVM spending its time?
SAX is faster than DOM. If you don't want to go through the ArrayList searching for duplicates, put everything in a LinkedHashMap -- no duplicates, and you still get the order-of-insertion that ArrayList gives you.
But the real bottleneck is going to be sending the HTTP request to Google, waiting for the response, then parsing the response. Use a spellcheck library, instead.
Edit: But take my educated guesses with a grain of salt. Use a code profiler to see what's really slowing down your program.
Generally the best method is to figure out where your bottleneck is, and fix it. You'll usually find that you spend 90% of your time in a small portion of your code, and that's where you want to focus your efforts.
Once you've figured out what's taking a lot of time, focus on improving your algorithms. For example, removing duplicates from an ArrayList can be O(n²) complexity if you're using the most obvious algorithm, but that can be reduced to O(n) if you leverage the correct data structures.
Once you've figured out which portions of your code are taking the most time, and you can't figure out how best to fix it, I'd suggest narrowing down your question and posting another question here on StackOverflow.
Edit
As #oxbow_lakes so snidely put it, not all performance bottlenecks are to be found in the code's big-O characteristics. I certainly had no intention to imply that they were. Since the question was about "general methods" for optimizing, I tried to stick to general ideas rather than talking about this specific program. But here's how you can apply my advice to this specific program:
See where your bottleneck is. There are a number of ways to profile your code, ranging from high-end, expensive profiling software to really hacky. Chances are, any of these methods will indicate that your program spends the 99% of its time waiting for a response from Google.
Focus on algorithms. Right now your algorithm is (roughly):
Parse the XML
Create a list of words
For each word
Ping Google for a spell check.
Return results
Since most of your time is spent in the "ping Google" phase, an obvious way to fix this would be to avoid doing that step more times than necessary. For example:
Parse the XML
Create a list of words
Send list of words to spelling service.
Parse results from spelling service.
Return results
Of course, in this case, the biggest speed boost would probably be by using spell checker that runs on the same machine, but that isn't always an option. For example, TinyMCE runs as a javascript program within the browser, and it can't afford to download the entire dictionary as part of the web page. So it packages up all the words into a distinct list and performs a single AJAX request to get a list of those words that aren't in the dictionary.
These folks are probably right, but a few random pauses will turn *probably" into "definitely, and here's why".