I'm trying to retrieve information from a catalog using JSoup, it always has 9 columns per row, the 6th column specifically is a placeholder for when you are logged in, when you actually are logged in to the site, that column shows "price".
I have the following: (username and password not shown here)
Document doc = null;
String url;
Response res = Jsoup.connect("https://www.prisa.cl/home/?page=iniciaSesion")
.method(Method.GET)
.timeout(10000)
.execute();
String sessionID = res.cookie("PHPSESSID");
System.out.println(sessionID);
res = Jsoup.connect("https://www.prisa.cl/home/?page=iniciaSesion")
.data("email_address", username, "password", password)
.method(Method.POST)
.timeout(10000)
.execute();
sessionID = res.cookie("PHPSESSID");
System.out.println(sessionID);
for(int page=1; page<=1; page++){
url = "https://www.prisa.cl/catalog/advanced_search_result.php"
+ "?keywords=%20&enviar=&categories_id=&manufacturers_id=&pfrom=&pto=&sort=2a&&page="+page;
doc = Jsoup.connect(url)
.cookie("PHPSESSID", sessionID)
.timeout(10000)
.get();
for(Element table : doc.select("table table table table table")){
for(Element row : table.select("tr")){
Elements tds = row.select("td");
if(tds.size() == 9){
System.out.println(tds.select("img[src]").attr("src")+";"+
tds.get(1).text()+";"+
tds.get(2).text()+";"+
tds.get(3).text()+";"+
tds.get(4).text()+";"+
tds.get(5).text()+";"+
tds.get(6).text());
} //end if
} //rows
} //tables
System.out.println("finished page: "+page);
} //pages
what i think/hope is happening here is:
1- I'm getting the PHPSESSID cookie while not logged in (for debugging purposes)
2- I'm getting the PHPSESSID again while logged in (has different data)
3- I'm iterating for each page in the catalog (used only 1 in the code above) and attempting to send the PHPSESSID cookie during the connection to retrieve the data while logged in
4- Looking for a TR that has 9 TDs while being 5 tables deep (the page layout is a little confusing)
I am super new to this but I actually searched for a couple of days some different methods in Stack Overflow and in the JSoup documentation to no avail.
What am i doing wrong?
Related
I am trying to scrape data from the following table.
Yahoo finance CBOE Volatility Index
I am using jsoup for it.
String url = "https://finance.yahoo.com/quote/%5EVIX/history?p=%5EVIX&guccounter=1&guce_referrer=aHR0cHM6Ly9tYWlsLmdvb2dsZS5jb20v&guce_referrer_sig=AQAAAKU5UXnZEhNK_s1k-l6fQ7l-jFaR2xghH5NOhaohsec-HThT1BaEsni-hUlysVCFWpzd4qa2OZ2YZtBDJNQqKw1Uh64_nppDI4RnzPnTgxDGta123-A_SbIBm4SA5B0xopHvDcl5A21esFvWceZnRJPk6ohtud7OGJpWcNLdADYT";
Document doc = Jsoup.connect(url).get();
Element table = doc.getElementById("mrt-node-Col1-1-HistoricalDataTable");
Elements rows=table.select("tr");
Elements first=rows.get(0).select("th,td");
List<String>headers=new ArrayList<>();
for(Element header:first)
headers.add(header.text());
List<Map<String,String>> listMap = new ArrayList<Map<String,String>>();
for(int row=1;row<rows.size()-1;row++) {
Elements colVals = rows.get(row).select("th,td");
int colCount = 0;
Map<String,String> tuple = new LinkedHashMap<String,String>();
for(Element colVal : colVals)
tuple.put(headers.get(colCount++), colVal.text());
listMap.add(tuple);
}
By this approach I am only getting the first 100 or some more rows. This is because it first loads that amount of rows and whenever we scroll to that position of the row, newer rows are loaded. I could not find any pagination and nothing helpful from the network calls.The data seems to be encoded in gif format(whenever there is a mouse event on scroll).
I found a way around to use selenium web driver and fetch all the data. I was wondering is there any way to just use Jsoup to solve the issue.
I have an issue - I'm trying to scrape a Cinema webpage,
---> https://cinemaxx.dk/koebenhavn
I need to get data regarding how many seats that is reserved/sold, I need to extract the last snapshot.
The seats that are reserved/sold is shown on the picture as a red square:
Basiclly, my logic is this.
I scrape the contact using htmlUnit.
I set htmlUnit to execute all JS.
extract the (reservedSeats BASE64 String).
Convert the BASE64 string to image.
Then my program analyse the image, and count how many seats that is reserved / sold.
My issue is:
As I need the last snapshot of the picture, - cause that is the picture that gives the correct data related to how many seats that is reserved / sold. - I start scraping the website 3 min before the movie start,... and untill input == null.
I do this by looping my scrape method - But the ciname server automatic reserve 2 seats at each request (and hold them for 10 minutes). - So I end up reserving all the seats in the whle cinema... (you can see an example on the 2 reserved seats (blue squares) on the picture above)).
I found the JS method in the HTML that reserved the 2 seats at request - Now I would like htmlUnit to execute all JS exect this one JS method that reserves theese 2 seats by HTTP request.
I hope it gives sense, all above.
Is there someone out there that maybe can lead me in the right direction ?, or maybe had similar issue?.
public void scraper(String url) {
final String URL = url;
//Initialize Ghost Browser (FireFox_60):
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_60)) {
//Configure Ghost Browser:
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setCssEnabled(false);
//Load Url & Configure Ghost Browser:
final HtmlPage page = webClient.getPage(URL);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.waitForBackgroundJavaScript(3000);
//Spider JS PATH to BASE64 data:
final HtmlElement seatPictureRaw = page.querySelector
("body > div.page.page--booking.ng-scope > div.relative > div.inner__container.inner__container--content " +
"> div.seatselect > div > div > div > div:nth-child(2) > div.seatselect__image > img");
//Terminate Current web session:
webClient.getCurrentWindow().getJobManager().removeAllJobs();
webClient.close();
//Process the raw BASE64 Data - Extract clean BASE64 String:
String rawBASE64Data = String.valueOf(seatPictureRaw);
String[] arrOfStr = rawBASE64Data.split("(?<=> 0\") ");
String cleanedUpBASE64Data = arrOfStr[1];
String cleanedUpBASE64Data1 = cleanedUpBASE64Data.replace("src=\"data:image/gif;base64,", "");
String cleanedUpBASE64Data2 = cleanedUpBASE64Data1.replace("\">]", "");
//System.out.println(cleanedUpBASE64Data2);
//Decode BASE64 Rawdata to Image:
final byte[] decodedBytes = Base64.getDecoder().decode(cleanedUpBASE64Data2);
System.out.println("Numbers Of Caracters in BASE64 String: " + decodedBytes.length);
BufferedImage image = ImageIO.read(new ByteArrayInputStream(decodedBytes));
//Forward image for PictureAnalyzer Class...
final PictureAnalyzer pictureAnalyzer = new PictureAnalyzer();
pictureAnalyzer.analyzePixels(image);
} catch (Exception ex) {
ex.printStackTrace();
}
}
One option you have is to intercept&modify the server responses and replace the function call with something else.
replace only the function name (this is uggly because it will generate a js exceptions at runtime) or
remove the function call from the source or
replace the function body with {} or
....
See http://htmlunit.sourceforge.net/faq.html#HowToModifyRequestOrResponse for more
I'm trying to login and extract data from a fantasyfootball website.
I get the following error,
Jul 24, 2015 8:01:12 PM StatsCollector main
SEVERE: null
org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=http://fantasy.premierleague.com/
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:537)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:493)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:205)
at StatsCollector.main(StatsCollector.java:26)
whenever I try this code. Where am I going wrong?
public class StatsCollector {
public static void main (String [] args){
try {
String url = "http://fantasy.premierleague.com/";
Connection.Response response = Jsoup.connect(url).method(Connection.Method.GET).execute();
Response res= Jsoup
.connect(url)
.data("ismEmail", "example#googlemail.com", "id_password", "examplepassword")
.method(Method.POST)
.execute();
Map<String, String> loginCookies = res.cookies();
Document doc = Jsoup.connect("http://fantasy.premierleague.com/transfers")
.cookies(loginCookies)
.get();
String title = doc.title();
System.out.println(title);
}
catch (IOException ex) {
Logger.getLogger(StatsCollector.class.getName()).log(Level.SEVERE,null,ex);
}
}
}
Response res= Jsoup
.connect(url)
.data("ismEmail", "example#googlemail.com", "id_password", "examplepassword")
.method(Method.POST)
.execute();
Are you trying to execute this actual code? This seems to be an example code with placeholders instead of login credentials. This would explain the error you received, HTTP 403.
Edit 1
My bad. I took a look at the login form on that site, and it seems to me that you confused the id of the input elements ("ismEmail" and "id_password" with the name which gets sent with the form ("email", "password"). Is this working for you?
Response res= Jsoup
.connect(url)
.data("email", "example#googlemail.com", "password", "examplepassword")
.method(Method.POST)
.execute();
Edit 2
Okay, this was stuck in my head, beacause signing into a website with JSoup should not be that hard. I created an account there and tried for myself. Code first:
String url = "https://users.premierleague.com/PremierUser/j_spring_security_check";
Response res = Jsoup
.connect(url)
.followRedirects(false)
.timeout(2_000)
.data("j_username", "<USER>")
.data("j_password", "<PASSWORD>")
.method(Method.POST)
.execute();
Map<String, String> loginCookies = res.cookies();
Document doc = Jsoup.connect("http://fantasy.premierleague.com/squad-selection/")
.cookies(loginCookies)
.get();
So what is happening here? First I realized, that the target of the login form was wrong. The page seems to be built on spring, so the form attributes and target use spring defaults j_spring_security_check, j_username and j_password. Then a read timeout occurred to me, until I set the flag followRedirects(false). I can only guess why this helped, but maybe this is a protection against crawlers?
In the end i try to connect to the squad selection page, and the parsed response contains my personal view and data. This code seems to work for me, would you give it a try?
I have this web page https://rrtp.comed.com/pricing-table-today/ and from that I need to get the information about Time (Hour Ending) and Day-Ahead Hourly Price column alone. I tried with the following code,
Document doc = Jsoup.connect("https://rrtp.comed.com/pricing-table-today/").get();
for (Element table : doc.select("table.prices three-col")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
if (tds.size() > 2) {
System.out.println(tds.get(0).text() + ":" + tds.get(1).text());
}
}
}
but unfortunately I am unable to get the data I need.
Is there something wrong in the code..? or This page can't be crawled...?
Need some help
As I said in comment:
You should hit https://rrtp.comed.com/rrtp/ServletFeed?type=pricingtabledual&date=20150717 because it's source from which data is loaded on the page you have pointed to.
Data under this link is not a valid html document (and this is why it's not working for you), but you can easily make it "quite" right.
All you have to do is first get the response and add <table>..</table> tags around it, then it's enough to parse it as html document.
Connection.Response response = Jsoup.connect("https://rrtp.comed.com/rrtp/ServletFeed?type=pricingtabledual&date=20150717").execute();
Document doc = Jsoup.parse("<table>" + response.body() + "</table>");
for (Element element : doc.select("tr")) {
System.out.println(element.html());
}
I am starting of from a websites homepage. I am parsing the entire web page and I am collecting all the links on that homepage and putting them in a queue. Then I am removing each link from the queue and doing the same thing until I get the text that I want. However if I get a link like youtube.com/something then I am going to all the links on youtube. I want to restrict this.
I want to crawl within the same domain only. How do I do that?
private void crawler() throws IOException {
while (!q.isEmpty()){
String link = q.remove();
System.out.println("------"+link);
Document doc = Jsoup.connect(link).ignoreContentType(true).timeout(0).get();
if(doc.text().contains("publicly intoxicated behavior or persistence")){
System.out.println("************ On this page ******************");
System.out.println(doc.text());
return;
}
Elements links = doc.select("a[href]");
for (Element link1 : links){
String absUrl = link1.attr("abs:href");
if (absUrl == null || absUrl.length() == 0) {
continue;
}
// System.out.println(absUrl);
q.add(absUrl);
}
}
}
This article shows how to write a web crawler. The following line forces all crawled links to be on the mit.edu domain.
if(link.attr("href").contains("mit.edu"))
There might be a bug with that line since relative URLs won't have the domain. I suggest that adding abs: might be better.
if(link.attr("abs:href").contains("mit.edu"))