I am currently working on a University project under the theme of "search-engine".
For this purpose we were given access to a database of scientific publications
(http://dblp.uni-trier.de)
It is a 2GB XML file which looks something like this:
<article key="GottlobSR96">
<author>Georg Gottlob</author>
<author>Michael Schrefl</author>
<author>Brigitte Röck</author>
<title>Extending Object-Oriented Systems with Roles.</title>
<pages>268-296</pages>
<year>1996</year>
<volume>14</volume>
<journal>TOIS</journal>
<number>3</number>
<url>db/journals/tois/tois14.html#GottlobSR96</url>
</article>
As you can see the "article"-tag contains various information such as author,title of the paper,year of publication. My job now is to implement a Java solution which takes search terms of different categories (author, university,title) as input and provides the user with additional information.
For example if you enter the name of a professor it should return data like his date of birth, the University he works at, number of publications, etc..
I suppose this would work using google api to find for a persons entry on the University homepage and then somehow parsing through the page to find the needed information. For Universities there should be a Wikipedia page.
I already tried using mediawiki api but couldn't figure out how to get only the specific information I want.(I could only get the intro paragraph)
I've never worked on a project of this scale so I don't really have a clue on how to implement foreign API's/libraries etc. into my own code.
So i guess my question is:
How do i get specific information based on a google-search? May it be through wikipedia or otherwise.
Related
I have been working on information extraction and was able to run standAloneAnnie.java
http://gate.ac.uk/wiki/code-repository/src/sheffield/examples/StandAloneAnnie.java
My question is, How can I use GATE ANNIE to get similar words like if I input (dine) will get result like (food, eat, dinner, restaurant) ?
More Information:
I am doing a project where I was assigned to develop a simple webpage to take user input and pass to GATE components which will tokenize the query and return a semantic grouping for each phrase in order to make some recommendation.
For example user would enter "I want to have dinner in Kuala Lumpur" and the system will break it down to (Search for :dinner - Required: restaurant, dinner, eat, food - Location: Kuala Lumpur.
ANNIE by default has like 15 annotations, see demo
http://services.gate.ac.uk/annie/
Now I already implemented everything as the demo but my question is. Can I do that using GATE ANNIE, i mean is it possible to find words synonyms or group words based on their type (noun, verbs)?
Plain vanilla ANNIE doesn't support this kind of thing but there are third party plugins such as Phil Gooch's WordNet Suggester that might help. Or if your domain is fairly restricted you might get better results with less effort by simply creating your own gazetteer lists of related terms and a few simple JAPE rules. You may find the training materials available on the GATE Wiki useful if you haven't done much of this before.
I'm looking for a Java library that can do Named entity recognition (NER) with a custom controlled vocabulary, without needing labeled training data first. I searched some on SE, but most questions are rather unspecific.
Consider the following use-case:
an editor is inputting articles in a CMS (about 500 words).
the text may contain references (in plain text) to entities of a specific domain. e.g:
names of points of interest, like bars, restaurants, as well as neighborhoods, etc.
a controlled vocabulary of these entities exist (about 5.000 entities) .
I imagine an entity to be a -tuple in the vocabulary
after finishing the text, the user should be able to save the document.
This triggers the workflow to scan the piece of text against the vocabulary, by comparing against the name of the entity. It's not required to have a 100% match: 97% on Jarao-winkler or whatever (I'm not familiar with what algo's NER uses) may be enough, I need this to be configurable.
Hits are returned to the controller server-side. This in return returns JSON to the client containing of the entities, which are represented as suggested crosslinks to the editor.
Ideally, I'm looking for a project that uses NRE to suggests crosslinks within a CMS-environment to piggyback on. (I'm sure plugins for wordpress exist for example) not so sure if something similar exists in Java.
All other more general pointers to NRE-libraries which work with controlled custom vocabularies are welcome as well.
For people looking this up in the future:
"Approximate Dictionary-Based Chunking"
see: http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html
(URL edited.)
Unsure if these might be helpful:
http://www-nlp.stanford.edu/software/CRF-NER.shtml
http://cogcomp.cs.illinois.edu/page/software
I have been wondering about this, which is why I have put off learning app development for so long. Let's say I was making a school timetable app, that all the user had to do was enter the name of their course, and then the app shows the timetable for that course..
The questions is can I get information from the college or do I have to hard code it into the database myself?
How does one get information to use if they need it?
Thanks
It depends. Does the college provide you an interface you can use? Probably not one that was meant to be used by a third party app.
If not, then you have to somehow get the information into your database. Either per parsing their online HTML schedules or inputing it by hand (obviously always one of the last options to consider).
If the college had a website that you could view, you could scan the page for class listings and pull that data in - but more than likely that sort of data will need to be entered manually by you when you ship the app.
If college is having its website and the website provides RSS feed for time table you parse that XML file and show the data which is parse or you can save the time table information of which course in the database and display that using cursors.
I am working on a project here that ingests internal resumes from people at my company, strips out the skills and relevant content from them and stores it in a database. This was all done using docx4j and Grails. This required the resumes to first be submitted via a template that formatted everything just right so that the ingest tool knew what to look for to strip the data.
The 2nd portion of this, is what if we want to get out a "reduced" resume from the database. In other words, I want to search the uploaded content I now have, and only print out new resumes for people who have Java programming experience lets say. So I can go into my database, find the people who originally had java as a skill, and output a new set of resumes that are also still in a nice templated format, and only have the relevant info in them, instead of ALL the content.
I have been writing some software to do this in Java that will basically use a docx template, overwriting the items in customXML which are bound to the content controls in the doc, so the new data shows up and can eb saved as a new docx with that custom data.
This seems really cumbersome to me, and has some limitations. For one, lets say my template has a place for 3 Skills, and the particular person has 8 skills. There seems to be no good way to add those 5 additional skills to the docx other than painstakingly inserting the data with all of the formatting XML tags and such. This is a real pain, because if the template changes, I dont want to have to go back into my software and edit source code to change that additional data input XML tag to bold instead of italic.
I was doing some reading up on using Infopath to create a form that I could use to get the input, connecting to some sharepoint data source or something to store the stripped out data. However, I can't seem to find out if it is possible using sharepoint to get the data back out, in a nice formatted way. What would the general steps for this be? It seems like I couldnt find very much about this topic with any quick googling.
Thanks
You could set up the skills:
<skills>
<skill>..</skill>
<skill>..</skill>
and use a "repeat" content control pointing to the container. This would handle any number of <skill> entries.
I am looking for some stock symbol look-up API. I could able to query yahoo finance with a symbol & could able to retrieve the stock price & other details.
I am looking for some auto-complete stock look-up API like if i query fo "Go*" ... how can i get all stock symbols starting with GO* = Goog etc ... is there any APi for wildcard stock symbol searches
Any help would be great ..
Thanks
There's a simple problem here and a more complex problem.
If your list of symbols is static, then you could use any typical autocomplete API against a file that you maintain locally.
However, the list of symbols is rarely static. Symbols are constantly being added or erased and changed due to a variety of financial market situations (IPO, acquisitions, mergers, renames, bankrupcies, etc.). Many symbols are also traded on specific exchanges, and some symbols are cross listed. There are also financial instruments that are not simply stocks, such as indices, commodities, etc. The general term for this is "instruments definitions" and a complete list of such definitions is a service provided by companies such as Reuters or Bloomberg.
I am not familiar with any free and open instrument lists that you could get for free, and you need to make sure that you are complying with the licenses of services that allow you to get a current
list.
If you can tolerate a delay of one business day, you might be able to scrape the list from a variety of sources that provide close-of-business listings for all stocks in the US. The WSJ has a printed list (likely also an electronic one). Eoddata provides such a list, etc. But make sure that you are complying with their terms.
yahoo has an api for this:
http://d.yimg.com/autoc.finance.yahoo.com/autoc?query=GOO&callback=YAHOO.Finance.SymbolSuggest.ssCallback
Any autocomplete API or toolkit should work if you put the tickers in as a source. You will probably have to host it yourself as I don't know of any public ones.
You can try using the "Company Search" operation in the Company Fundamentals API here: http://www.mergent.com/servius/