Generic Date Parsing Library from unstructured text [duplicate] - java

This question already has answers here:
Natural Language date and time parser for java [closed]
(8 answers)
Closed 6 years ago.
Can somebody suggest any Library in Java which is capable of parsing Date/Time Calendar Event from Unstructured Data.
Example
Starts 10pm Tonight! Sunday feb 10th => 10/Feb/2013 10pm
tomorrow (feb 10th) => 10/Feb/2013
Sunday Feb 10\r\nwith daily screenings till Feb 16th
and so on
The input data comes from user, so he may enter data in any random format.
I started of identifying all the possible token and do a regex match to phrase all tokens.
I wonder if someone can suggest some Library in Java, which might actually help in parsing.
I ran through other post on SO, but they seem to suggest techniques, i wonder if somebody has a library.
Thanks

You could take some of the trunk source from Apache openLNP (natural language processing) at http://opennlp.apache.org/ or just set up a callable RESTful web service by implementing openNLP on your server. Benefit of implementing the OOB openNLP is you have entity extractors through the nameFinder interface for dates, times, organizations, locations, and people. You would also be able to build an example file of more typical context for the items of interest indicating their appropriate entity type and train the NLP model against it to gain a better hit rate for your context. I have a working example of a C# NLP in the apps section of my portfolio at http://www.augmentedintel.com/apps/csharpnlp/extract-names-from-text.aspx.

UTAH (https://github.com/sonalake/utah-parser) is able to handle generic parsing of unstructured text into maps. Once you've done that you should be able to throw that into a formatter.

Related

How to convert ISO code to localized measure unit in java?

I'm currently trying to convert ISOCODE measure units into the fulltext labels.
For example I'll receive a string such as "LTR" and try to convert it to "Liter". It's in german so I'm also looking for a possibility to do this localized.
Is there a library or so which is already doing this? Is there an enum somewhere, containing all these information?
Otherwise, I guess I'll just have to create one myself.
Thanks a lot.
JSR 363 deals with units of measurement and has been implemented in UOM . You can browse the javadoc to get an idea of what's in there.
There was a project called the JScience project, but it doesn't seem to have been updated for some time.

Retrieving dates from Google Custom Search

I am currently developing a Java application based on Google Custom Search API, using their Java libraries.
According to Google's documentation, they associate a date to each indexed Web page:
Page Dates: Google estimates the date for a page based the URL, title, byline date and other features. This date can be used with the sort operator using the special structured data type date, as in &sort=date.
I want to retrieve the date associated to all the results returned for a given request. However, I didn't find anything related to this task in Google's documentation: there are parameters one can use to sort the results by date, or focus on a certain period of time, but nothing regarding retrieving the precise dates themselves. And I couldn't find any reference to this problem on the Web neither.
So, I am turning to SO to ask these questions:
Is it even possible to do that through Google's API? How?
Otherwise, is there a workaround?

How to get the update frequency of websites

I need to build a web service that analyzises SEO. The service will show how often the site was updated. I need to figure out how to get the posted date or update frequency from the HTML of the website.
For example on http://googletesting.blogspot.com/ I can get date from the tag <span>Wednesday, June 04, 2014</span>. Other websites don't use the same tags and date format so I can't us the same code to detect those dates.
(Dates can have very different formats in different locales. Also, month names can be written as text or as number. I need to match as much dates as possible.Sometime,date format isn't posted date but it's just words in articles.
My Algorithm about this
I attempt to get "posted date" from all posted then calculate update frequency.
Such as Fist posted at 30May 2012, Second posted at 29May2012, Third posted at 28May2012
So I will get result that this website was updated dairly
In the end, I want to know if each website updates:
Yearly
Monthly
Weekly
Daily
How do I reliably get this from any website?
Instead of parsing the dates in the page, you could download the home page and store it. Then you could come back every day and download the homepage again to see if it changed. This approach would work even for sites that don't publish any dates on their homepage. It would take longer to get your answer though.
Another approach would be to download the RSS feed for the site if it has one. The example site you give one has an XML feed: http://feeds.feedburner.com/blogspot/RLXA?format=xml RSS feeds are meant to be machine readable and the dates are in a consistent format.
You also say that you are using Java. I've found that Java's date parsing libraries are not very flexible. They force you to know the exact format of the date before you parse it. I have written a free, open source flexible date time parser in Java that you could try: http://ostermiller.org/utils/DateTimeParse.html Once you found dates on the page (maybe for looking at what comes after "posted on"), you could use my flexible parser to parse dates in a variety of formats.

Resume Parsing First Step

I have multiple resumes in a format like somebody sends to a company to apply for a job. I need to parse these resumes in Java.
Do I need to convert these resumes to XML first for parsing? May the example below be a way to convert the resume in XML?
<Name>Varjhjh</Name>
<Experience>5</Experience>
<Age>7</Age>
.
.
.
resume parsing isn't trivial task, I remember couple years ago I was implementing one strategy -- the main problem is everybody construct their CV his/her own way.
e.g. one writes Date of Birth, another DOB next Birth Date -- so you have to use some dictionary for these cases.
And another interesting thing which you can have it's parsing names, especially if your target candidate has very very very long long name e.g. Frederick Gerald Hubert Irvim John Kenneth
Or for example user have few phones his landline, mobile, his reference 1 phone, two etc.
I remember these guys parsed cv not badly
www.rchilli.com/
Other Parsing vendors include: Sovren, Daxtra, Burning Glass and Hireability
But I'm not sure if they have Java integration, and not sure about their cost.
Anyway, good luck in parsing.
I work for Sovren which is a parsing vendor for full disclosure. Resume parsing is not a trivial task. Many company including Sovren, HireAbility, Daxtra and Burning Glass offer installed and SaaS solutions for parsing. Typical work flow is convert the non image resume/cv to text and parsing it returning HR-XML, the industry standard.

Are there APIs for text analysis/mining in Java? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I want to know if there is an API to do text analysis in Java. Something that can extract all words in a text, separate words, expressions, etc. Something that can inform if a word found is a number, date, year, name, currency, etc.
I'm starting the text analysis now, so I only need an API to kickoff. I made a web-crawler, now I need something to analyze the downloaded data. Need methods to count the number of words in a page, similar words, data type and another resources related to the text.
Are there APIs for text analysis in Java?
EDIT: Text-mining, I want to mining the text. An API for Java that provides this.
It looks like you're looking for a Named Entity Recogniser.
You have got a couple of choices.
CRFClassifier from the Stanford Natural Language Processing Group, is a Java implementation of a Named Entity Recogniser.
GATE (General Architecture for Text Engineering), an open source suite for language processing. Take a look at the screenshots at the page for developers: http://gate.ac.uk/family/developer.html. It should give you a brief idea what this can do. The video tutorial gives you a better overview of what this software has to offer.
You may need to customise one of them to fit your needs.
You also have other options:
simple text extraction via Web services: e.g. Tagthe.net and Yahoo's Term Extractor.
part-of-speech (POS) tagging: extracting part-of-speech (e.g. verbs, nouns) from the text. Here is a post on SO: What is a good Java library for Parts-Of-Speech tagging?.
In terms of training for CRFClassifier, you could find a brief explanation at their FAQ:
...the training data should be in tab-separated columns, and you
define the meaning of those columns via a map. One column should be
called "answer" and has the NER class, and existing features know
about names like "word" and "tag". You define the data file, the map,
and what features to generate via a properties file. There is
considerable documentation of what features different properties
generate in the Javadoc of NERFeatureFactory, though ultimately you
have to go to the source code to answer some questions...
You can also find a code snippet at the javadoc of CRFClassifier:
Typical command-line usage
For running a trained model with a provided serialized classifier on a
text file:
java -mx500m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier
conll.ner.gz -textFile samplesentences.txt
When specifying all parameters in a properties file (train, test, or
runtime):
java -mx1g edu.stanford.nlp.ie.crf.CRFClassifier -prop propFile
To train and test a simple NER model from the command line:
java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier -trainFile
trainFile -testFile testFile -macro > output
For example - you might use some classes from standard library java.text, or use StreamTokenizer (you might customize it according to your requirements). But as you know - text data from internet sources is usually has many orthographical mistakes and for better performance you have to use something like fuzzy tokenizer - java.text and other standart utils has too limited capabilities in such context.
So, I'd advice you to use regular expressions (java.util.regex) and create own kind of tokenizer according to your needs.
P.S.
According to your needs - you might create state-machine parser for recognizing templated parts in raw texts. You might see simple state-machine recognizer on the picture below (you can construct more advanced parser, which could recognize much more complex templates in text).
If you're dealing with large amounts of data, maybe Apache's Lucene will help with what you need.
Otherwise it might be easiest to just create your own Analyzer class that leans heavily on the standard Pattern class. That way, you can control what text is considered a word, boundary, number, date, etc. E.g., is 20110723 a date or number? You might need to implement a multiple-pass parsing algorithm to better "understand" the data.
I recommend looking at LingPipe too. If you are OK with webservices then this article has a good summary of different APIs
I'd rather adapt Lucene's Analysis and Stemmer classes rather than reinventing the wheel. They have a vast majority of cases covered. See also the additional and contrib classes.

Categories

Resources