First off let me say that I am a complete newbie with NLP. Although, as you read on, that is probably going to become strikingly apparent.
I'm parsing Wikipedia pages to find all mentions of the page title. I do this by going through the CorefChainAnnotations to find "proper" mentions - I then assume that the most common ones are talking about the page title. I do it by running this:
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,coref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String content = "Abraham Lincoln was an American politician and lawyer who served as the 16th President of the United States from March 1861 until his assassination in April 1865. Lincoln led the United States through its Civil War—its bloodiest war and perhaps its greatest moral, constitutional, and political crisis.";
Annotation document = new Annotation(content);
pipeline.annotate(document);
for (CorefChain cc : document.get(CorefCoreAnnotations.CorefChainAnnotation.class).values()) {
List<CorefChain.CorefMention> corefMentions = cc.getMentionsInTextualOrder();
for (CorefChain.CorefMention cm : corefMentions) {
if (cm.mentionType == Dictionaries.MentionType.PROPER) {
log("Proper ref using " + cm.mentionSpan + ", " + cm.mentionType);
}
}
}
This returns:
Proper ref using the United States
Proper ref using the United States
Proper ref using Abraham Lincoln
Proper ref using Lincoln
I know already that "Abraham Lincoln" is definitely what I am looking for and I can surmise that because "Lincoln" appears a lot as well then that must be another way of talking about the main subject. (I realise right now the most common named entity is "the United States", but once I've fed it the whole page it works fine).
This works great until I have a page like "Gone with the Wind". If I change my code to use that:
String content = "Gone with the Wind has been criticized as historical revisionism glorifying slavery, but nevertheless, it has been credited for triggering changes to the way African-Americans are depicted cinematically.";
then I get no Proper mentions back at all. I suspect this is because none of the words in the title are recognised as named entities.
Is there any way I can get Stanford NLP to recognise "Gone with the Wind" as an already-known named entity? From looking around on the internet it seems to involve training a model, but I want this to be a known named entitity just for this single run and I don't want the model to remember this training later.
I can just imagine NLP experts rolling their eyes at the awfulness of this approach, but it gets better! I came up with the great idea of changing any occurences of the page title to "Thingamijig" before passing the text to Stanford NLP, which works great for "Gone with the Wind" but then fails for "Abraham Lincoln" because (I think) the NER longer associates "Lincoln" with "Thingamijig" in the corefMentions.
In my dream world I would do something like:
pipeline.addKnownNamedEntity("Gone with the Wind");
But that doesn't seem to be something I can do and I'm not exactly sure how to go about it.
You can submit a dictionary with any phrases you want and have them recognized as named entities.
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.additional.regexner.mapping additional.rules -file example.txt -outputFormat text
additional.rules
Gone With The Wind MOVIE MISC 1
Note that the columns above should be tab-delimited. You can have as many lines as you'd like in the additional.rules file.
One warning, EVERY TIME that token pattern occurs it will be tagged.
More details here: https://stanfordnlp.github.io/CoreNLP/ner.html
I recently discovered the Stanford NLP parser and it seems quite amazing. I have currently a working instance of it running in our project but facing the below mentioned 2 problems.
How can I parse text and then extract only specific speech-labels from the parsed data, for example, how can I extract only NNPS and PRP from the sentence.
Our platform works in both English and German, so there is always a possibility that the text is either in English or German. How can I accommodate this scenario. Thank you.
Code :
private final String PCG_MODEL = "edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz";
private final TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "invertible=true");
public void testParser() {
LexicalizedParser lp = LexicalizedParser.loadModel(PCG_MODEL);
String sent="Complete Howto guide to install EC2 Linux server in Amazon Web services cloud.";
Tree parse;
parse = lp.parse(sent);
List taggedWords = parse.taggedYield();
System.out.println(taggedWords);
}
The above example works, but as you can see I am loading the English data. Thank you.
Try this:
for (Tree subTree: parse) // traversing the sentence's parse tree
{
if(subTree.label().value().equals("NNPS")) //If the word's label is NNPS
{ //Do what you want }
}
For Query 1, I don't think stanford-nlp has an option to extract a specific POS tags.
However, Using custom trained models, we can achieve the same. I had tried similar requirement for NER - name Entity recognition custom models.
My overall goal is to return only clean sentences from a Wikipedia article without any markup. Obviously, there are ways to return JSON, XML, etc., but these are full of markup. My best approach so far is to return what Wikipedia calls raw. For example, the following link returns the raw format for the page "Iron Man":
http://en.wikipedia.org/w/index.php?title=Iron%20Man&action=raw
Here is a snippet of what is returned:
...//I am truncating some markup at the beginning here.
|creative_team_month =
|creative_team_year =
|creators_series =
|TPB =
|ISBN =
|TPB# =
|ISBN# =
|nonUS =
}}
'''Iron Man''' is a fictional character, a [[superhero]] that appears in\\
[[comic book]]s published by [[Marvel Comics]].
...//I am truncating here everything until the end.
I have stuck to the raw format because I have found it the easiest to clean up. Although what I have written so far in Java cleans up this pretty well, there are a lot of cases that slip by. These cases include markup for Wikipedia timelines, Wikipedia pictures, and other Wikipedia properties which do not appear on all articles. Again, I am working in Java (in particular, I am working on a Tomcat web application).
Question: Is there a better way to get clean, human-readable sentences from Wikipedia articles? Maybe someone already built a library for this which I just can't find?
I will be happy to edit my question to provide details about what I mean by clean and human-readable if it is not clear.
My current Java method which cleans up the raw formatted text is as follows:
public String cleanRaw(String input){
//Next three lines attempt to get rid of references.
input= input.replaceAll("<ref>.*?</ref>","");
input= input.replaceAll("<ref .*?</ref>","");
input= input.replaceAll("<ref .*?/>","");
input= input.replaceAll("==[^=]*==", "");
//I found that anything between curly braces is not needed.
while (input.indexOf("{{") >= 0){
int prevLength= input.length();
input= input.replaceAll("\\{\\{[^{}]*\\}\\}", "");
if (prevLength == input.length()){
break;
}
}
//Next line gets rid of links to other Wikipedia pages.
input= input.replaceAll("\\[\\[([^]]*[|])?([^]]*?)\\]\\]", "$2");
input= input.replaceAll("<!--.*?-->","");
input= input.replaceAll("[^A-Za-z0-9., ]", "");
return input;
}
I found a couple of projects that might help. You might be able to run the first one by including a Javascript engine in your Java code.
txtwiki.js
A javascript library to convert MediaWiki markup to plaintext.
https://github.com/joaomsa/txtwiki.js
WikiExtractor
A Python script that extracts and cleans text from a Wikipedia database dump
http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
Source:
http://www.mediawiki.org/wiki/Alternative_parsers
I'm looking for access to financial data from Google services.
I found this URL that gets the stock data for Microsoft.
What are all the possible parameters that Google allows for this kind of HTTP request? I'd like to see all the different information that I could get.
The Google Finance Gadget API has been officially deprecated since October 2012, but as of April 2014, it's still active. It is completely dead as of March 2022.
http://www.google.com/finance/info?q=NASDAQ:GOOG
http://www.google.com/finance/info?q=CURRENCY:GBPUSD
http://finance.google.com/finance/info?client=ig&q=AAPL,YHOO
You can also get charts: https://www.google.com/finance/getchart?q=YELP
Note that if your application is for public consumption, using the Google Finance API is against Google's terms of service.
Check google-finance-get-stock-quote-realtime for the complete code in python
There's a whole API for managing portfolios. *Link removed. Google no longer provides a developer API for this.
Getting stock quotes is a little harder. I found one article where someone got stock quotes using Google Spreadsheets.
You can also use the gadgets but I guess that's not what you're after.
The API you mention is interesting but doesn't seem to be documented (as far as I've been able to find anyway).
Here is some information on historical prices, just for reference sake.
I found this site helpful.
http://benjisimon.blogspot.com/2009/01/truly-simple-stock-api.html
It links to an API yahoo seems to offer that is very simple and useful.
For instance:
http://finance.yahoo.com/d/quotes.csv?s=GOOG+AAPL&f=snl1
Full details here:
http://www.gummy-stuff.org/Yahoo-data.htm
Edit: the api call has been removed by google. so it is no longer functioning.
Agree with Pareshkumar's answer. Now there is a python wrapper googlefinance for the url call.
Install googlefinance
$pip install googlefinance
It is easy to get current stock price:
>>> from googlefinance import getQuotes
>>> import json
>>> print json.dumps(getQuotes('AAPL'), indent=2)
[
{
"Index": "NASDAQ",
"LastTradeWithCurrency": "129.09",
"LastTradeDateTime": "2015-03-02T16:04:29Z",
"LastTradePrice": "129.09",
"Yield": "1.46",
"LastTradeTime": "4:04PM EST",
"LastTradeDateTimeLong": "Mar 2, 4:04PM EST",
"Dividend": "0.47",
"StockSymbol": "AAPL",
"ID": "22144"
}
]
Google finance is a source that provides real-time stock data. There are also other APIs from yahoo, such as yahoo-finance, but they are delayed by 15min for NYSE and NASDAQ stocks.
The problem with Yahoo and Google data is that it violates terms of service if you're using it for commercial use. When your site/app is still small it's not biggie, but as soon as you grow a little you start getting cease and desists from the exchanges.
A licensed solution example is FinancialContent: http://www.financialcontent.com/json.php
or Xignite
You can also pull data from Google Fiance directly in Google Sheets via GOOGLEFINANCE() function for both current and historical data:
GOOGLEFINANCE("NASDAQ:GOOGL", "price", DATE(2014,1,1), DATE(2014,12,31), "DAILY")
Another way is to use Yahoo finance instead via yfinance package. Or with such query which will return a JSON:
https://query1.finance.yahoo.com/v8/finance/chart/MSFT
Code to parse price and panel on the right, and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml, json
from itertools import zip_longest
def scrape_google_finance(ticker: str):
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"hl": "en"
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
# https://www.whatismybrowser.com/detect/what-is-my-user-agent
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36",
}
html = requests.get(f"https://www.google.com/finance/quote/{ticker}", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
# describe empty dict where data will be appended
ticker_data = {
"ticker_data": {},
"about_panel": {}
}
ticker_data["ticker_data"]["current_price"] = soup.select_one(".AHmHk .fxKbKc").text
ticker_data["ticker_data"]["quote"] = soup.select_one(".PdOqHc").text.replace(" • ",":")
ticker_data["ticker_data"]["title"] = soup.select_one(".zzDege").text
right_panel_keys = soup.select(".gyFHrc .mfs7Fc")
right_panel_values = soup.select(".gyFHrc .P6K39c")
for key, value in zip_longest(right_panel_keys, right_panel_values):
key_value = key.text.lower().replace(" ", "_")
ticker_data["about_panel"][key_value] = value.text
return ticker_data
data = scrape_google_finance(ticker="GOOGL:NASDAQ")
print(json.dumps(data, indent=2))
JSON output:
{
"ticker_data": {
"current_price": "$2,534.60",
"quote": "GOOGL:NASDAQ",
"title": "Alphabet Inc Class A"
},
"about_panel": {
"previous_close": "$2,597.88",
"day_range": "$2,532.02 - $2,609.59",
"year_range": "$2,193.62 - $3,030.93",
"market_cap": "1.68T USD",
"volume": "1.56M",
"p/e_ratio": "22.59",
"dividend_yield": "-",
"primary_exchange": "NASDAQ",
"ceo": "Sundar Pichai",
"founded": "Oct 2, 2015",
"headquarters": "Mountain View, CaliforniaUnited States",
"website": "abc.xyz",
"employees": "156,500"
}
}
Out of scope of your question. If there's a need to parse the whole Google Finance Ticker page, there's a line-by-line scrape Google Finance Ticker Quote Data in Python blog post about it at SerpApi.
Perhaps of interest, the Google Finance API documentaton includes a section detailing how to access different parameters via JavaScript.
I suppose the JavaScript API might be a wrapper to the JSON request you mention above... perhaps you could check which HTTP requests are being sent.
This is no longer an active API for google, you can try Xignite, although they charge: http://www.xignite.com
The simplest way as you have explained is this link this is for
'Dow Jones Industrial Average'
Link 2 is for 'NASDAQ-100'
and for all related to NASDAQ link 3
I think this should be it, else you want same in JSON notations the same as Microsoft
Please refer this old post I think this will help,
Update:
To know the details of volume and other details,
I have created a vbscript that is using IE object to fetch details from the link, and alerts the content in the particular id(Create a .vbs file and run it..
Set IE = CreateObject("InternetExplorer.Application")
while IE.readyState = 4: WScript.Sleep 10: wend
IE.Navigate "https://www.google.com/finance?q=INDEXNASDAQ%3ANDX&sq=NASDAQ&sp=2&ei=B3UoUsiIH5DIlgPEsQE"
IE.visible = true
while IE.readyState = 4: WScript.Sleep 10: wend
dim ht
ht= IE.document.getElementById("market-data-div").innerText
msgBox ht
IE.quit
this will alert the values from page
like this
3,124.54 0.00 (0.00%)
Sep 4 - Close
INDEXNASDAQ real-time data - Disclaimer
Range -
52 week 2,494.38 - 3,149.24
Open -
Vol. 0.00
I am sure this will help..
Here is an example that you can use. Havent got Google Finance yet, but Here is the Yahoo Example. You will need the HTMLAgilityPack , Which is awesome. Happy Symbol Hunting.
Call the procedure by using YahooStockRequest(string Symbols);
Where Symbols = a comma-delimited string of symbols, or just one symbol
public string YahooStockRequest(string Symbols,bool UseYahoo=true)
{
{
string StockQuoteUrl = string.Empty;
try
{
// Use Yahoo finance service to download stock data from Yahoo
if (UseYahoo)
{
string YahooSymbolString = Symbols.Replace(",","+");
StockQuoteUrl = #"http://finance.yahoo.com/q?s=" + YahooSymbolString + "&ql=1";
}
else
{
//Going to Put Google Finance here when I Figure it out.
}
// Initialize a new WebRequest.
HttpWebRequest webreq = (HttpWebRequest)WebRequest.Create(StockQuoteUrl);
// Get the response from the Internet resource.
HttpWebResponse webresp = (HttpWebResponse)webreq.GetResponse();
// Read the body of the response from the server.
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
string pageSource;
using (StreamReader sr = new StreamReader(webresp.GetResponseStream()))
{
pageSource = sr.ReadToEnd();
}
doc.LoadHtml(pageSource.ToString());
if (UseYahoo)
{
string Results=string.Empty;
//loop through each Symbol that you provided with a "," delimiter
foreach (string SplitSymbol in Symbols.Split(new char[] { ',' }))
{
Results+=SplitSymbol + " : " + doc.GetElementbyId("yfs_l10_" + SplitSymbol).InnerText + Environment.NewLine;
}
return (Results);
}
else
{
return (doc.GetElementbyId("ref_14135_l").InnerText);
}
}
catch (WebException Webex)
{
return("SYSTEM ERROR DOWNLOADING SYMBOL: " + Webex.ToString());
}
}
}
Building upon the shoulders of giants...here's a one-liner I wrote to zap all of Google's current stock data into local Bash shell variables:
stock=$1
# Fetch from Google Finance API, put into local variables
eval $(curl -s "http://www.google.com/ig/api?stock=$stock"|sed 's/</\n</g' |sed '/data=/!d; s/ data=/=/g; s/\/>/; /g; s/</GF_/g' |tee /tmp/stockprice.tmp.log)
echo "$stock,$(date +%Y-%m-%d),$GF_open,$GF_high,$GF_low,$GF_last,$GF_volume"
Then you will have variables like $GF_last $GF_open $GF_volume etc. readily available. Run env or see inside /tmp/stockprice.tmp.log
http://www.google.com/ig/api?stock=TVIX&output=csv by itself returns:
<?xml version="1.0"?>
<xml_api_reply version="1">
<finance module_id="0" tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0" >
<symbol data="TVIX"/>
<pretty_symbol data="TVIX"/>
<symbol_lookup_url data="/finance?client=ig&q=TVIX"/>
<company data="VelocityShares Daily 2x VIX Short Term ETN"/>
<exchange data="AMEX"/>
<exchange_timezone data="ET"/>
<exchange_utc_offset data="+05:00"/>
<exchange_closing data="960"/>
<divisor data="2"/>
<currency data="USD"/>
<last data="57.45"/>
<high data="59.70"/>
<low data="56.85"/>
etc.
So for stock="FBM" /tmp/stockprice.tmp.log (and your environment) will contain:
GF_symbol="FBM";
GF_pretty_symbol="FBM";
GF_symbol_lookup_url="/finance?client=ig&q=FBM";
GF_company="Focus Morningstar Basic Materials Index ETF";
GF_exchange="NYSEARCA";
GF_exchange_timezone="";
GF_exchange_utc_offset="";
GF_exchange_closing="";
GF_divisor="2";
GF_currency="USD";
GF_last="22.82";
GF_high="22.82";
GF_low="22.82";
GF_volume="100";
GF_avg_volume="";
GF_market_cap="4.56";
GF_open="22.82";
GF_y_close="22.80";
GF_change="+0.02";
GF_perc_change="0.09";
GF_delay="0";
GF_trade_timestamp="8 hours ago";
GF_trade_date_utc="20120228";
GF_trade_time_utc="184541";
GF_current_date_utc="20120229";
GF_current_time_utc="033534";
GF_symbol_url="/finance?client=ig&q=FBM";
GF_chart_url="/finance/chart?q=NYSEARCA:FBM&tlf=12";
GF_disclaimer_url="/help/stock_disclaimer.html";
GF_ecn_url="";
GF_isld_last="";
GF_isld_trade_date_utc="";
GF_isld_trade_time_utc="";
GF_brut_last="";
GF_brut_trade_date_utc="";
GF_brut_trade_time_utc="";
GF_daylight_savings="false";
The Google stock quote API has gone away. However, Investor's Exchange offers an API that's very easy to use for quote data.
I have personally built an app for stock data and fundamentals with Intrinio Two years ago but abandoned the project because I was beaten to market by a competitor.
I built it in Java but they support multiple stacks. Back then, You could access their api for free for testing purposes, but I think they build packages based on your needs now.
In any case, they were exceptionally helpful and charge low fees from what I remember, and their library is well documented so pulling data in json is very straightforward.
In order to find chart data using the financial data API of Google, one must simply go to Google as if looking for a search term, type finance into the search engine, and a link to Google finance will appear. Once at the Google finance search engine, type the ticker name into the financial data API engine and the result will be displayed. However, it should be noted that all Google finance charts are delayed by 15 minutes, and at most can be used for a better understanding of the ticker's past history, rather than current price.
A solution to the delayed chart information is to obtain a real-time financial data API. An example of one would be the barchartondemand interface that has real-time quote information, along with other detailed features that make it simpler to find the exact chart you're looking for. With fully customizable features, and specific programming tools for the precise trading information you need, barchartondemand's tools outdo Google finance by a wide margin.
Try with this:
http://finance.google.com/finance/info?client=ig&q=NASDAQ:GOOGL
It will return you all available details about the mentioned stock.
e.g. out put would look like below:
// [ {
"id": "694653"
,"t" : "GOOGL"
,"e" : "NASDAQ"
,"l" : "528.08"
,"l_fix" : "528.08"
,"l_cur" : "528.08"
,"s": "0"
,"ltt":"4:00PM EST"
,"lt" : "Dec 5, 4:00PM EST"
,"lt_dts" : "2014-12-05T16:00:14Z"
,"c" : "-14.50"
,"c_fix" : "-14.50"
,"cp" : "-2.67"
,"cp_fix" : "-2.67"
,"ccol" : "chr"
,"pcls_fix" : "542.58"
}
]
You can have your company stock symbol at the end of this URL to get its details:
http://finance.google.com/finance/info?client=ig&q=<YOUR COMPANY STOCK SYMBOL>