Workaround on Scraping HTML by diving into js source code

Workaround on Scraping HTML by diving into js source code - java

I learn about jSoup recently and would like to dive more into it. However, I have met obstacle handling webpages with javascript (I have no knowledge in js, yet :/).
I have read that htmlunit would be the correct tool to perform webbrowser actions, but I figured out that I would need no knowledge in js if I can find out the JSON object obtained in the webpage using the javascript.
For example, this page:
among the source files, one of them is tooltips.js. In this file, variable rgNeededFeeds is generated and called in method LoadHeropediaData(), which is the method to generate the whole URL link for getting the json object.
URL = URL + 'jsfeed/heropediadata?feeds='+strFeeds+'&v=3633666222511362823&l=english';
I could not get my mind on what is actually strFeeds. I have tried various combinations but it doesn't work (it returned an empty array...). Or, my guess is totally off?
What I actually need is the data it displays on top when you click on one of the "items". The info in the "hover" would do too, but it lack the "recepi" info. And I'm presuming that by getting the json object from the full URL above, well, basically all data infos should be in that json.
Anyways, this is only based on what I understand from staring at those source files for hours. Do correct me if I'm wrong. (I'm in Java by the way)
**p/s: I would also like to take this opportunity to express my thanks to Balusc, he has been everywhere when I have doubts on jSoup. :>*

strFeeds is nothing but one of these two strings : itemdata or abilitydata
You can find this in tooltips.js at line 38-45
var rgNeededFeeds = [];
$.each( [ 'item', 'ability' ],
function( i, ttType ){
icons = GetIconCollection( ttType );
if ( icons.length ){
rgNeededFeeds.push( ttType+'data' );
//..............
}
}
)
ttType is the value of an iteration over the array [ 'item', 'ability' ] which concatenated with the string data is pushed into the array rgNeededFeeds
The function LoadHeropediaData is called at the end of the function above with rgNeededFeeds as parameter :
LoadHeropediaData( rgNeededFeeds );
Aside note : If you begin to start scraping websites, learning javascript will be MANDATORY.
NOTE : you're right, the JSON contains all the information needed...

Related

Scrape links from a lst in scrapy OR create a loop?

I want to scrape this website: https://www.racingpost.com/results for the results.
I already have a crawler that scrapes and follows the links on the results page - but i can not go further back than the 6 or seven days that are displayed on the site. The older results are aviable via the "resultsfinder", which is sadly java script, as are other sources of the older races like the form of the horses.
I already tried to learn to scrape java to get the links, and while it is very interesting, I am wondering if there is not an easier way, as the result page adresses are designed in a very convinient way:
Its simply https://www.racingpost.com/results/ + something like 1990-02-08 or 2021-02-11 or any other date.
So I thought it might be easier to design the spider to scrape to get its links from a loop or predefined list of links.
How could I design a loop that runs through 1990-01-01 up to now in scrapy or is it better to create a predefined list of links for this?

Generate the dates in the spider and append them to the link, no need to create a predefined list of links.
from datetime import date, timedelta
# Initialize variables
start_date = date(1990, 1, 1)
end_date = date.today()
crawl_date = start_date
base_url = "https://www.racingpost.com/results/"
links = []
# Generate the links
while crawl_date <= end_date:
links.append(base_url + str(crawl_date))
crawl_date += timedelta(days=1)
Then loop through the generated list, or alternatively just call the parse function from the while-loop instead of adding the links to a list.
Example results:
>>> links
[
"https://www.racingpost.com/results/1990-01-01",
"https://www.racingpost.com/results/1990-01-02",
"https://www.racingpost.com/results/1990-01-03",
"https://www.racingpost.com/results/1990-01-04",
"https://www.racingpost.com/results/1990-01-05",
...
]

Redisearch query with "begin with" instead of "contains"

I am trying to understand on how to perform queries in Redisearch strictly with "begins with" and I keep getting "contains".
For example if I have fields with values like 'football', 'myfootball', 'greenfootball' and would provide a search term like this:
> FT.SEARCH myIdx #myfield:foot*
I want just to get 'football' but I keep getting other fields that contain the word instead of beginning with that word.
Is there a way to avoid this?
I was trying to use VERBATIM and things like #myfield:^foot* but nothing.
I am using JRedisearch as a client but eventually I had to enter the DB and perform these queries manually in order to figure out what's happening. That being said, is this possible to do with this client at the moment?
Thanks
EDIT
A sample of my index setup:
Client client = new Client(INDEX_NAME, url, PORT);
Schema sc = new Schema().addSortableTextField("url", 1.0); // using this field for query
client.dropIndex(true);
client.createIndex(sc, Client.IndexOptions.Default());
return client;
Sample document:
id: // random uuid
urlPath: myfootbal
application: web
market: Europe

After checking the RDB provided I see that when searching foot* you are not getting myfootbal. The replies look like this: /dot-com/plp/football/x/index.html. You are getting those replies because this url is tokenized, and '/' is one of the tokenize chars. If you do not want those urls to be tokenized you need to declare them as TAGS and not as TEXT. This way the entire url will be indexed as is and when search for foot* it will not appear in the results.
For more information about TAGS see the FT.CREATE documentation: https://oss.redislabs.com/redisearch/Commands.html

How can I efficiently extract text from bunch for web pages without extra information

I have list of webpages around 1 million, I want to efficiently just extract text from those pages. Currently I am using BeautifulSoup library in python to get text from HTML and using request command to get html of a webpage. This approach extract some extra information in addition to the text like if any javascript is listed in body.
Could you please suggest me any suitable and efficient way to do the task. I looked at scrapy but it looks like it crawls specific website. Can we pass it list of specific webpages to get information from ?
Thank you in advance.

Yes, you can use Scrapy to crawl a set of URLs in a generic fashion.
You simply need to set them on the start_urls list attribute of your spider, or reimplement the start_requests spider method to yield requests from any data source, and then implement your parse callback to perform the generic content extraction you want.
You can use html-text to extract text from them, and regular Scrapy selectors to extract additional data like the one you mention.

In scrapy you can set up your own parser. E.g. Beautiful soup. This parser you can call from your parse method.
To extract text from generic pages I traverse the body only, exclude comments etc and some tags like script, style, etc:
for snippet in soup.find('body').descendants:
if isinstance(snippet, bs4.element.NavigableString) \
and not isinstance(snippet, EXCLUDED_STRING_TYPES)\
and snippet.parent.name not in EXCLUDED_TAGS:
snippet = re.sub(UNICODE_WHITESPACES, ' ', snippet)
snippet = snippet.strip()
if snippet != '':
snippets.append(snippet)
with
EXCLUDED_STRING_TYPES = (bs4.Comment, bs4.CData, bs4.ProcessingInstruction, bs4.Declaration)
EXCLUDED_TAGS = ['script', 'noscript', 'style', 'pre', 'code']
UNICODE_WHITESPACES = re.compile(u'[\t\n\x0b\x0c\r\x1c\x1d\x1e\x1f \x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004'
u'\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000]+')

Passing Jackjson JSON object from JSP to JavaScript function

I have a JSON String stored in a database. In one of my JSP pages, I retrieve this string, and I want to be able to pass the String or the JSON object into Javascript function. The function is simply this for test purposes
function test(h){
alert(h);
}
Now I can retrieve the JSON string from the database fine, I have printed it out to the screen to ensure that it is getting it, however when I pass it in like this
<input type="button"
name="setFontButton"
value="Set"
class="form_btn_primary"
onclick="test('<%=theJSON%>'); return false;"/>
Nothing happens. I used firebug to check what was wrong, and it says there is invalid character.
So I then tried passing in the JSON object like so
Widget widg = mapper.readValue(testing.get(0), Widget.class);
Then pass in it
onclick="test('<%=widg%>'); return false;"/>
Now this will pass in without an error, and it alerts the object name, however I am unable to parse it. Object comes in like with the package name of where the widget class is stored like so
com.package.mode.Widget#ba8af9
I tried using Stringify, but that doesn't seem to work on this Jackson JSON object.
After all that failed, I tried a last resort of taking the String from the database, and encoding it in base64. However, this too fails if I do this
String test = Base64.encode(theString);
and pass that in. However if I do that, print it out to the screen, then copy what is printed out, and send that through it works, so don't quite understand why that is.
So could someone please tell me what I am doing wrong. I have tried soo many different solutions and nothing is working.
The JSON String is stored in database like this
{
"id":1,
"splits":[
{
"texts":[
{
"value":"Test",
"locationX":3,
"locationY":-153,
"font":{
"type":"Normal",
"size":"Medium",
"bold":false,
"colour":"5a5a5a",
"italics":false
}
}
]
}
]
}
Would be very grateful if someone could point me in the direct direction!!
Edit:
Incase anyone else has same problem do this to pass the JSON from JSP to the JS function
<%=theJSON.replaceAll("\"", "\\\'")%>
That allows you to pass the JSON in,
then to get it back in JavaScript to normal JSON format
theJSON = theJSON.replace(/'/g,'"');
Should work fine

I think the combination of double quotes wrapping the onclick and the ones in your JSON may be messing you up. Think of it as if you entered the JSON manually -- it would look like this:
onclick="test('{ "id":1, "splits":[ { "texts":[ { "value":"Test", "locationX":3, "locationY":-153, "font":{ "type":"Normal", "size":"Medium", "bold":false, "colour":"5a5a5a", "italics":false } } ] } ] }'); return false;"
and the opening double quote before id would actually be closing the double quote following onclick= (You should be able to verify this by looking at the page source). Try specifying the onclick as:
onclick='test(\'<%=theJSON%>\'); return false;'

You can follow the following steps
Fetch the jon string
Using the jackson or any other JSON jar file , convert the json string to json array and print the string using out.println.
Call this jsp which prints the json string
check in the firebug , you will be able to see your json .
If the Json string does not print , there can be some problems in your json format.
this is a good website for json beautification , http://jsbeautifier.org/ , really makes the string simple to read .
Thanks
Abhi

Velocity - How to avoid ParseErrorException when using jQuery?

I'm trying to add a jQuery post to some JavaScript on a web page. The entire page is built up of several Velocity templates. Everything has been fine until I've tried to add the jQuery post, now I get:
org.apache.velocity.exception.ParseErrorException: Encountered "," at line 282, column 24 of /WEB-INF/velocity/www/comments.vm
Was expecting one of:
"(" ...
<RPAREN> ...
<ESCAPE_DIRECTIVE> ...
~~~snip~~~
Line 282 is $.post(... and column 24 appears to be the first "," character. Initially I had the JSON on this line, but I moved it up (to the var myJSONObject ... line)as I thought the error related to invalid JSON (tabs at the start of the line gave a misleading column number).
var myJSONObject = {"body": "", "action": "postcomment", "submitted": "true", "ajax": "true"};
myJSONObject.body = $("body").val();
$.post("$!{articleurl}", myJSONObject, function(result){
btn.textContent='Comment sent successfully.';
});
Minor Update
I changed the following lines:
var url = "$articleurl";
$.post(url, myJSONObject, function(result){
~~~snip~~~
The parse exception still focuses on the first ",". I'm assuming the issue is that Velocity thinks it should be able to resolve $.post - when in fact, it's jQuery. I've used jQuery in other Velocity VM templates without any problem. Is there a way to get Velocity to ignore certain lines / statements when parsing?
Update 2
I found this link about escaping references in Velocity, but it does not resolve my issue. Adding a "\" before $.post gives me the exact same error, but the column is one extra, because of the character added at the start of the line.

You can wrap your javascript with #[[ ... ]]# which tells Velocity to not parse the enclosed block (new in Velocity 1.7)
#[[
<script>
...
</script>
]]#

Ok, there appears to be two solutions for this:
First, with jQuery we can just avoid using the global alias $ and instead use the jQuery object directly:
jQuery.post(url, myJSONObject, function(result){
~~~snip~~~
In my case, the above works great. But I suspect in other scenarios (non-jQuery) this may not be possible. In which case, we can 'hide' our character within a valid Velocity reference like this:
#set( $D = '$' )
${D}
Source: http://velocity.apache.org/engine/devel/user-guide.html#escapinginvalidvtlreferences
I'd still like to know why the backslash escape didn't work, but the above will at least get me moving again. :)

I think this is a bug in version 1.6.x, because it works fine in 1.7(If it did not, please tell me, I test it many times..), according to the reference, the $ takes effect only when it is followed by a-zA-Z. I want to try do debug what happened really, but the translation code is generated by Java CC tool, it is too hard to recognize the logic...

you must create a js file with your javascript code
and import your js file into your vm code

I couldn't get it to work with any of the other fixes like escaping "$" in velocity unfortunately. I got it working by loading an external js-file with the jQuery instead of writing jQuery directly in velocity. Worked out for me at least, hope it helps someone :)
/björn

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Workaround on Scraping HTML by diving into js source code - java

Related

Scrape links from a lst in scrapy OR create a loop?

Redisearch query with "begin with" instead of "contains"

How can I efficiently extract text from bunch for web pages without extra information

Passing Jackjson JSON object from JSP to JavaScript function

Velocity - How to avoid ParseErrorException when using jQuery?

Categories

Resources