Scrape links from a lst in scrapy OR create a loop? - java

I want to scrape this website: https://www.racingpost.com/results for the results.
I already have a crawler that scrapes and follows the links on the results page - but i can not go further back than the 6 or seven days that are displayed on the site. The older results are aviable via the "resultsfinder", which is sadly java script, as are other sources of the older races like the form of the horses.
I already tried to learn to scrape java to get the links, and while it is very interesting, I am wondering if there is not an easier way, as the result page adresses are designed in a very convinient way:
Its simply https://www.racingpost.com/results/ + something like 1990-02-08 or 2021-02-11 or any other date.
So I thought it might be easier to design the spider to scrape to get its links from a loop or predefined list of links.
How could I design a loop that runs through 1990-01-01 up to now in scrapy or is it better to create a predefined list of links for this?

Generate the dates in the spider and append them to the link, no need to create a predefined list of links.
from datetime import date, timedelta
# Initialize variables
start_date = date(1990, 1, 1)
end_date = date.today()
crawl_date = start_date
base_url = "https://www.racingpost.com/results/"
links = []
# Generate the links
while crawl_date <= end_date:
links.append(base_url + str(crawl_date))
crawl_date += timedelta(days=1)
Then loop through the generated list, or alternatively just call the parse function from the while-loop instead of adding the links to a list.
Example results:
>>> links
[
"https://www.racingpost.com/results/1990-01-01",
"https://www.racingpost.com/results/1990-01-02",
"https://www.racingpost.com/results/1990-01-03",
"https://www.racingpost.com/results/1990-01-04",
"https://www.racingpost.com/results/1990-01-05",
...
]

Related

How do i Jsoup query for the value of a html key/value pair

sorry if my terms are off, i havent done this before
Im using jsoup to scrape a single value off a website page,
I am trying to find the "serialno" which is stored within this function (java script?)
function set(obj, val)
{
document.getElementById(obj).innerHTML= val;
}
called by
{set("modelname", "NPort 5650-16");set("mac", "00:90:E8:22:76:F4");set("serialno", "2583");set("ver", "3.3 Build 08042219");setlabel("NPORT");uptime("264 days, 03h:31m:34s");}<
i am unsure how i can use jsoup to extract/print the serialno value, which in this case happens to be 2583. ive tried basic commands using getElementById but ive never used jsoup before. i am familiar with maps, but not sure how i can manipulate with jsoup, and most of the tutorials online need the actual 'path' to the exact cell within the table (where this info is displayed).
You can't use Jsoup to do this. Jsoup can parse HTML, but javascipt is out of its reach and is recognized as text. It can't be executed and selecting things from javascript is not possible.
But if you already have HTML parsed to Document and you're looking for an alternative solution you may try to use regular expressions to grab this value.
Document doc = Jsoup.parse...
String html = doc.toString();
Pattern p = Pattern.compile("set\\(\"serialno\", \"(\\d+)\"\\)");
Matcher m = p.matcher(html);
if (m.find()) {
String serialno = m.group(1);
System.out.println(serialno);
}

Redisearch query with "begin with" instead of "contains"

I am trying to understand on how to perform queries in Redisearch strictly with "begins with" and I keep getting "contains".
For example if I have fields with values like 'football', 'myfootball', 'greenfootball' and would provide a search term like this:
> FT.SEARCH myIdx #myfield:foot*
I want just to get 'football' but I keep getting other fields that contain the word instead of beginning with that word.
Is there a way to avoid this?
I was trying to use VERBATIM and things like #myfield:^foot* but nothing.
I am using JRedisearch as a client but eventually I had to enter the DB and perform these queries manually in order to figure out what's happening. That being said, is this possible to do with this client at the moment?
Thanks
EDIT
A sample of my index setup:
Client client = new Client(INDEX_NAME, url, PORT);
Schema sc = new Schema().addSortableTextField("url", 1.0); // using this field for query
client.dropIndex(true);
client.createIndex(sc, Client.IndexOptions.Default());
return client;
Sample document:
id: // random uuid
urlPath: myfootbal
application: web
market: Europe
After checking the RDB provided I see that when searching foot* you are not getting myfootbal. The replies look like this: /dot-com/plp/football/x/index.html. You are getting those replies because this url is tokenized, and '/' is one of the tokenize chars. If you do not want those urls to be tokenized you need to declare them as TAGS and not as TEXT. This way the entire url will be indexed as is and when search for foot* it will not appear in the results.
For more information about TAGS see the FT.CREATE documentation: https://oss.redislabs.com/redisearch/Commands.html

How can I efficiently extract text from bunch for web pages without extra information

I have list of webpages around 1 million, I want to efficiently just extract text from those pages. Currently I am using BeautifulSoup library in python to get text from HTML and using request command to get html of a webpage. This approach extract some extra information in addition to the text like if any javascript is listed in body.
Could you please suggest me any suitable and efficient way to do the task. I looked at scrapy but it looks like it crawls specific website. Can we pass it list of specific webpages to get information from ?
Thank you in advance.
Yes, you can use Scrapy to crawl a set of URLs in a generic fashion.
You simply need to set them on the start_urls list attribute of your spider, or reimplement the start_requests spider method to yield requests from any data source, and then implement your parse callback to perform the generic content extraction you want.
You can use html-text to extract text from them, and regular Scrapy selectors to extract additional data like the one you mention.
In scrapy you can set up your own parser. E.g. Beautiful soup. This parser you can call from your parse method.
To extract text from generic pages I traverse the body only, exclude comments etc and some tags like script, style, etc:
for snippet in soup.find('body').descendants:
if isinstance(snippet, bs4.element.NavigableString) \
and not isinstance(snippet, EXCLUDED_STRING_TYPES)\
and snippet.parent.name not in EXCLUDED_TAGS:
snippet = re.sub(UNICODE_WHITESPACES, ' ', snippet)
snippet = snippet.strip()
if snippet != '':
snippets.append(snippet)
with
EXCLUDED_STRING_TYPES = (bs4.Comment, bs4.CData, bs4.ProcessingInstruction, bs4.Declaration)
EXCLUDED_TAGS = ['script', 'noscript', 'style', 'pre', 'code']
UNICODE_WHITESPACES = re.compile(u'[\t\n\x0b\x0c\r\x1c\x1d\x1e\x1f \x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004'
u'\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000]+')

Extract data from multiple classes in selenium using java

The objective is to extract reviews from E-com website.How should i proceed to extract data from multiple classes using Selenium and then applying a for loop.Do i have to create an xpath with all the classes if yes how the syntax should be.There are few classes which contains data in string format and integers.
[Flipkart Reviews - class details]
class="_2xg6Ul" Brilliant
class="qwjRop" Best camera in smartphone period. Have note 8 and iPhone X also but pixel 2 with single lens beats them hands down
class= "_3LYOAd _3sxSiS" Flipkart Customer
class="_3LYOAd" 29 Nov, 2017
class="_1_BQL8" 142
Based on the very, very limited information you provided, this is what I came up with. You'll have to provide more information like the code you already have as well as the full HTML for each element.
List<WebElement> list = driver.findElements(By.xpath("//body[contains(#class, '_')]"));
//Iterate through list
for(int i =0;i<list.size();i++) {
WebElement review = list.get(i);
System.out.println(review.getText());
}

Workaround on Scraping HTML by diving into js source code

I learn about jSoup recently and would like to dive more into it. However, I have met obstacle handling webpages with javascript (I have no knowledge in js, yet :/).
I have read that htmlunit would be the correct tool to perform webbrowser actions, but I figured out that I would need no knowledge in js if I can find out the JSON object obtained in the webpage using the javascript.
For example, this page:
among the source files, one of them is tooltips.js. In this file, variable rgNeededFeeds is generated and called in method LoadHeropediaData(), which is the method to generate the whole URL link for getting the json object.
URL = URL + 'jsfeed/heropediadata?feeds='+strFeeds+'&v=3633666222511362823&l=english';
I could not get my mind on what is actually strFeeds. I have tried various combinations but it doesn't work (it returned an empty array...). Or, my guess is totally off?
What I actually need is the data it displays on top when you click on one of the "items". The info in the "hover" would do too, but it lack the "recepi" info. And I'm presuming that by getting the json object from the full URL above, well, basically all data infos should be in that json.
Anyways, this is only based on what I understand from staring at those source files for hours. Do correct me if I'm wrong. (I'm in Java by the way)
**p/s: I would also like to take this opportunity to express my thanks to Balusc, he has been everywhere when I have doubts on jSoup. :>*
strFeeds is nothing but one of these two strings : itemdata or abilitydata
You can find this in tooltips.js at line 38-45
var rgNeededFeeds = [];
$.each( [ 'item', 'ability' ],
function( i, ttType ){
icons = GetIconCollection( ttType );
if ( icons.length ){
rgNeededFeeds.push( ttType+'data' );
//..............
}
}
)
ttType is the value of an iteration over the array [ 'item', 'ability' ] which concatenated with the string data is pushed into the array rgNeededFeeds
The function LoadHeropediaData is called at the end of the function above with rgNeededFeeds as parameter :
LoadHeropediaData( rgNeededFeeds );
Aside note : If you begin to start scraping websites, learning javascript will be MANDATORY.
NOTE : you're right, the JSON contains all the information needed...

Categories

Resources