I'm trying to use Ragel to implement a simple yes/no fsm. Unfortunately the language specification consists of the union of about a thousand regular expressions, with * operators appearing once or more in the majority of them. So, the number of possible states explodes and it seems it will be impossible to use Ragel to generate an fsm for my language. Is there a tool out there that can do what I need, or should I swap approaches? I need something better than checking input strings against each regular expression in turn. I could chop up the thousand regular expressions into chunks of ~50 and generate an fsm for each, and run every input string against all the machines, but if there's a tool that can handle this kind of job without such a hack I'd be pleased to hear of it.
Thanks!
Well, I've ended up breaking the machine into multiple machines in order to prevent Ragel from eating all available memory - in fact, I had to break up the machine into a couple of separate Ragel files because the generated java class had too many constants in it from the huge state tables generated. I'm still interested in hearing of a better solution for this, if anybody has one!
Related
I am using OpenNLP Token Name finder for parsing the Unstructured data, I have created a corpus(training set) of 4MM records but as I am creating a model out of this corpus using OpenNLP API's in Eclipse, process is taking around 3 hrs which is very time consuming. Model is building on default parameters that is iteration 100 and cutoff 5.
So my question is, how can I speed up this process, how can I reduce the time taken by the process for building the model.
Size of the corpus could be the reason for this but just wanted to know if someone came across this kind of problem and if so, then how to solve this.
Please provide some clue.
Thanks in advance!
Usually the first approach to handle such issues is to split the training data to several chunks, and let each one to create a model of its own. Afterwards you merge the models. I am not sure that this is valid in this case (I'm not an OpenNLP expert), there's another solution below. Also, as it seems that the OpenNLP API provides only a single threaded train() methods, I would file an issue requesting a multi threaded option.
For a slow single threaded operation the two main slowing factors are IO and CPU, and both can be handled separately:
IO - which hard drive do you use? Regular (magnetic) or SSD? moving to SSD should help.
CPU - which CPU are you using? moving to a faster CPU will help. Don't pay attention to the number of cores, as here you want the raw speed.
An option you may want to consider to to get an high CPU server from Amazon web services or Google Compute Engine and run the training there - you can download the model afterwards. Both give you high CPU servers utilizing Xeon (Sandy Bridge or Ivy Bridge) CPUs and local SSD storage.
I think you should make algorithm related changes before upgrading the hardware.
Reducing the sentence size
Make sure you don't have unnecessarily long sentences in the training sample. Such sentences don't increase the performance but have a huge impact on computation. (Not sure of the order) I generally put a cutoff at 200 words/sentence. Also look at the features closely, these are the default feature generators
two kinds of WindowFeatureGenerator with a default window size of only two
OutcomePriorFeatureGenerator
PreviousMapFeatureGenerator
BigramNameFeatureGenerator
SentenceFeatureGenerator
These features generators generate the following features in the given sentence for the word: Robert.
Sentence: Robert, creeley authored many books such as Life and Death, Echoes and Windows.
Features:
w=robert
n1w=creeley
n2w=authored
wc=ic
w&c=robert,ic
n1wc=lc
n1w&c=creeley,lc
n2wc=lc
n2w&c=authored,lc
def
pd=null
w,nw=Robert,creeley
wc,nc=ic,lc
S=begin
ic is Initial Capital, lc is lower case
Of these features S=begin is the only sentence dependant feature, which marks that Robert occurred in the start of the sentence.
My point is to explain the role of a complete sentence in training. You can actually drop the SentenceFeatureGenerator and reduce the sentence size further to only accomodate few words in the window of the desired entity. This will work just as well.
I am sure this will have a huge impact on complexity and very little on performace.
Have you considered sampling?
As I have described above, the features are very sparse representation of the context. May be you have many sentences with duplicates, as seen by the feature generators. Try to detect these and sample in a way to represent sentences with diverse patterns, ie. it should be impossible to write only a few regular expressions that matches them all. In my experience, training samples with diverse patterns did better than those that represent only a few patterns, even though the former had a much smaller number of sentences. Sampling this way should not affect the model performance at all.
Thank you.
I have an input that's about 35KB of text that I need to pull a bunch of small bits of data from. I use multiple regexes to find the data, and that part works fine.
My question: should I split the large text into multiple smaller strings and run the appropriate regexes on each string, or just keep it in one big string and reset the matcher for each regex? Which way is best for efficiency?
If it isn't running too slow then go with whatever you currently have that is working fast enough.
Otherwise, you shouldn't be using raw regexes for this task anyway. As soon as you mention "multiple regexes" extracting "small bits of data" from, you are talking about writing a parser and should use a decent parsing tool.
As you are using java I would recommend starting with jFlex, which is a mature java implementation of an extremely mature and stable C tool.
For most tasks jFlex will be all you need, but it also integrates smoothly with a number of java parser-generators should the problem prove to be more complicated. My personal preference is the slightly obscure Beaver.
Of course, if you can implement it as a set of regexes it isn't more complicated and jFlex will do the job for you.
I want to make a Java program to help people with basic discrete mathematics (that is to say, checking the truth values of statements). To do this, I need to be able to detect how many variables the user inputs, what operators there are, and what quantifiers there are, if any (∃ and ∀). Is there a good algorithm for being able to do all these things?
Just so you know, I don't just want a result; I want full control over their input, so I can show them the logical proof. (so doing something like passing it to JavaScript won't work).
Okay, so, your question is a bit vague, but I think I understand what you'd like to do: an educational aid that processes first-order logic formulas, showing the user step by step how to work with such formulas, right? I think the idea has merit, and it's perfectly doable, even as a one-man project, but it's not terribly easily, and you'll have to learn a lot of new things -- but they're all very interesting things, so even if nothing at all comes out of it, you'd certainly get yourself some valuable knowledge.
I'd suggest you to start small. I'd start by building a recursive descent parser to recognize zero-order logic formulas (a machine that would decide if a formula is valid, i.e. it'd accept "A ^ B" but it'd reject "^ A ^"). Next up you'd have to devise a way to store the formula, and then you'd be able to actually work on it. Then again, start small: a little machine that accepts valid zero-order logic formulas like TRUE AND NOT (TRUE AND FALSE), and successfully reduces it step by step to true is already something that people can learn from, and it's not too hard to write. If you're feeling adventurous, add variables and make equations: A AND TRUE = TRUE -- it's easy to work these out with reductions and truth tables.
Things get tricky with quantifiers that bind variables, that's where the Automated theorem proving may come into play; but then, it all depends on exactly what you'd like to do: implementing transformations into the various normal forms, and showing the process step by step to the student would be fairly easy, and rather useful.
At any rate, I think it's a decent personal project, and you could learn a lot from it. If you're in a university, you could even get some credit for it eventually.
The technique I have used is to parse the input string using a context free grammar. There are many frameworks to help you do this, I have personally used ANTLR in the past to parse an input string into a descrete logic tree. ANTLR allows you to define a CFG which you can map to Java types. This allows you to map to a data structure to store and evaluate the truth value of the expression. Of course, you would also be able to pull out the variables contained in the data structure.
I have a quite general question about java and regular expression.
If we lock at embedded use, say mobile phones with J2ME or Android,
how common is it that regexp is included and how resource hungry is it?
I mean regular expression is a powerful beast, and a lot of magic is done in the background to make it happen. And my question is if there maybe are to much magic?
Or if it is safe to use it with care (like most things).
Update:
Thanks DigitalRoss for pointing out that java.util.regex is a part of android.
Regex is a programming lanaguage -- it's a way of defining a finite state machine, and there's really no upper limit to the complexity of that FSM, beyond your own sanity.
It's not "magic" - you can understand how RE matching works behind the scenes, and once you do that, you'll be in control of how resource-hungry your REs are.
Simple REs are very cheap, but it's possible to write expensive REs that have to look ahead and do lots of backtracking.
I thoroughly recommend Jeff Friedl's "Mastering Regular Expressions". It's not just for Perl, and you don't have to grind through the whole thing just lose the idea that RE is magic, and learn that it's a programming language you can optimise (or, indeed, write poorly perfoming code in).
Something about regex solutions bother me too, probably too many code-golf solutions mapped to a barely-works-on-the-example-case regex.
But they rule, and I love them in vi(1).
java.util.regex is certainly a part of android.
Regular expressions use memory dynamically. Most don't use very much but here is an expression that can potentially use a lot. Apparently it first came from perl but is mostly floating around these days set up for a ruby test:
ruby -wle 'puts "Prime" unless ("1" * ARGV[0].to_i) =~ /^1$|^(11+?)\1+$/' THENUMBER
So, say something like:
ruby -wle 'puts "Prime" unless ("1" * ARGV[0].to_i) =~ /^1$|^(11+?)\1+$/' 8191
Yes, it's a regular expression that tests primality.
Most definitely they are faster than
ad hoc string search/replace solution
and as well I would think they would be included in CLDC 1.1/MIDP 2.0 so ( if you find them there ) we can conclude that their footprint is negligible plus probably implemented in an optimized / built-in, thus being of virtually no cost to use them.
I now use .split("//p{Cntrl}") routinely to store / recover from disk - seems to be a built-in && no-cost tool.
I think RE's are can be faster then any ad hoc string search/replace solution you could come up with so I would go ahead with RE's. However as many have pointed out -- a badly built regex or using it in situations it shouldn't be used can be bad.
I have a set of strings with numbers embedded in them. They look something like /cal/long/3/4/145:999 or /pa/metrics/CosmicRay/24:4:bgp:EnergyKurtosis. I'd like to have an expression parser that is
Easy to use. Given a few examples someone should be able to form a new expression. I want end users to be able to form new expressions to query this set of strings. Some of the potential users are software engineers, others are testers and some are scientists.
Allows for constraints on numbers. Something like '/cal/long/3/4/143:#>100&<1110' to specify that a string prefix with '/cal/long/3/4/143:' and then a number between (100,1110) is expected.
Supports '|' and . So the expression '/cal/(long|short)/3/4/' would match '/cal/long/3/4/1:2' as well as '/cal/short/3/4/1:2'.
Has a Java implementation available or would be easy to implement in Java.
Interesting alternative ideas would be useful. I'm also entertaining the idea of just implementing the subset of regular expressions that I need plus the numerical constraints.
Thanks!
There's no reason to reinvent the wheel! The core of a regular expression engine is built on a strong foundation of mathematics and computer science; the reason we continue to use them today is they are principally sound and won't be improved in the foreseeable future.
If you do find or create some alternative parsing language that only covers a subset of the possibilities Regex can, you will quickly have a user asking for a concept that can be expressed in Regex but your flavor just plain leaves out. Spend your time solving problems that haven't been solved instead!
I'm inclined to agree with Rex M, although your second requirement for numerical constraints complicates things. Unless you only allowed very basic constraints, I'm not aware of a way to succinctly express that in a regular expression. If there is such a way, please disregard the rest of my answer and follow the other suggestions here. :)
You might want to consider a parser generator - things like the classic lex and yacc. I'm not really familiar with the Java choices, but here's a list:
http://java-source.net/open-source/parser-generators
If you're not familiar, the standard approach would be to first create a lexer that turns your strings into tokens. Then you would pass those tokens onto a parser that applies your grammar to them and spits out some kind of result.
In your case, I envision the parser resulting in a combination of a regular expression and additional conditions. For your numerical constraint example, it might give you the regular expression \/cal/long/3/4/143:(\d+)\ and a constraint to apply to the first grouping (the \d+ portion) that requires that the number lie between 100 and 1100. You'd then apply the RE to your strings for candidates, and apply the constraint to those candidates to find your matches.
It's a pretty complicated approach, so hopefully there's a simpler way. I hope that gives you some ideas, at least.
The Java constraint is a severe one. I would recommend using parsing combinators, but you will have to translate the ideas to Java using classes instead of functions. There are many, many papers available on this topic; one of the easiest to approach is Graham Hutton's Higher-Order Functions for Parsing. Hutton's approach makes it especially easy to decide to succeed or fail based on conditions like the magnitude of a number, as you show in your example.
Unfortunately, not all programmers (myself included) are as familiar with RegEx as they ought be. This often means we end up writing our own string-parsing logic where RegEx could otherwise have served us well.
This isn't always bad. It's possible in some cases to write a DSL (a class, a cohesive set of methods) that's more elegant and readable and meets the precise needs of your problem domain. The trouble is that it can take dozens of iterations to distill the problem into a DSL that is simple and intuitive. And only if the DSL will be used far and wide in the application or by a large community is this trouble warranted. Don't write a elegant solution to a problem that only appears sporadically.
If you're going to go the parser route, check out GOLD Parsing System. It's often a better option than something like YACC, cleaner looking than pure regexes, and supports Java.
http://goldparser.org/about/how-it-works.htm
Actually what you described is the Java Pattern Matcher. Which just happens to use Regex as its language.