htmlUnit - Is it possible to execute only specific JS functions? - java

I have an issue - I'm trying to scrape a Cinema webpage,
---> https://cinemaxx.dk/koebenhavn
I need to get data regarding how many seats that is reserved/sold, I need to extract the last snapshot.
The seats that are reserved/sold is shown on the picture as a red square:
Basiclly, my logic is this.
I scrape the contact using htmlUnit.
I set htmlUnit to execute all JS.
extract the (reservedSeats BASE64 String).
Convert the BASE64 string to image.
Then my program analyse the image, and count how many seats that is reserved / sold.
My issue is:
As I need the last snapshot of the picture, - cause that is the picture that gives the correct data related to how many seats that is reserved / sold. - I start scraping the website 3 min before the movie start,... and untill input == null.
I do this by looping my scrape method - But the ciname server automatic reserve 2 seats at each request (and hold them for 10 minutes). - So I end up reserving all the seats in the whle cinema... (you can see an example on the 2 reserved seats (blue squares) on the picture above)).
I found the JS method in the HTML that reserved the 2 seats at request - Now I would like htmlUnit to execute all JS exect this one JS method that reserves theese 2 seats by HTTP request.
I hope it gives sense, all above.
Is there someone out there that maybe can lead me in the right direction ?, or maybe had similar issue?.
public void scraper(String url) {
final String URL = url;
//Initialize Ghost Browser (FireFox_60):
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_60)) {
//Configure Ghost Browser:
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setCssEnabled(false);
//Load Url & Configure Ghost Browser:
final HtmlPage page = webClient.getPage(URL);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.waitForBackgroundJavaScript(3000);
//Spider JS PATH to BASE64 data:
final HtmlElement seatPictureRaw = page.querySelector
("body > div.page.page--booking.ng-scope > div.relative > div.inner__container.inner__container--content " +
"> div.seatselect > div > div > div > div:nth-child(2) > div.seatselect__image > img");
//Terminate Current web session:
webClient.getCurrentWindow().getJobManager().removeAllJobs();
webClient.close();
//Process the raw BASE64 Data - Extract clean BASE64 String:
String rawBASE64Data = String.valueOf(seatPictureRaw);
String[] arrOfStr = rawBASE64Data.split("(?<=> 0\") ");
String cleanedUpBASE64Data = arrOfStr[1];
String cleanedUpBASE64Data1 = cleanedUpBASE64Data.replace("src=\"data:image/gif;base64,", "");
String cleanedUpBASE64Data2 = cleanedUpBASE64Data1.replace("\">]", "");
//System.out.println(cleanedUpBASE64Data2);
//Decode BASE64 Rawdata to Image:
final byte[] decodedBytes = Base64.getDecoder().decode(cleanedUpBASE64Data2);
System.out.println("Numbers Of Caracters in BASE64 String: " + decodedBytes.length);
BufferedImage image = ImageIO.read(new ByteArrayInputStream(decodedBytes));
//Forward image for PictureAnalyzer Class...
final PictureAnalyzer pictureAnalyzer = new PictureAnalyzer();
pictureAnalyzer.analyzePixels(image);
} catch (Exception ex) {
ex.printStackTrace();
}
}

One option you have is to intercept&modify the server responses and replace the function call with something else.
replace only the function name (this is uggly because it will generate a js exceptions at runtime) or
remove the function call from the source or
replace the function body with {} or
....
See http://htmlunit.sourceforge.net/faq.html#HowToModifyRequestOrResponse for more

Related

Unable to get element with id from a URL using HtmlUnit in Java

Unable to get element with id="parcelMailingAddressResult" from https://www.mohavecounty.us/ContentPage.aspx?id=111&cid=869&parcel=10272001 using HTMLUnit in Java
If you go to above URL , you will see that there is a mailing address. A DOM inspection of the website shows that address has the above mentioned ID. I have been trying for several days to get that mailing address using my Java/HTMLUnit, and none of my tries worked.
Below are three methods that I tried within the same code.
System.getProperties().put("org.apache.commons.logging.simplelog.defaultlog", "fatal");
final WebClient webClient = new WebClient();
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setJavaScriptEnabled(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.setRefreshHandler(new RefreshHandler() {
public void handleRefresh(Page page, URL url, int arg) throws IOException {
System.out.println("handleRefresh");
}
});
HtmlPage page = (HtmlPage) webClient.getPage("https://www.mohavecounty.us/ContentPage.aspx?id=111&cid=869&parcel=10272001");
DomElement ownerAddresses = page.getElementById("parcelMailingAddressResult");
NodeList nodes = page.getElementsByTagName("parcelMailingAddressResult");
final HtmlDivision div = (HtmlDivision) page.getByXPath("//div[#class='container-fluid row']").get(0);
I expected the variables ownderAddresses and nodes to contain information that includes the owner's address. I expect div to contain some other information and, once I changed get(0) to get(<someHigherInteger>), to contain also information about the owner's address.
Instead:
ownerAddresses = null (after execution of ownerAddress = ...)
nodes is of size 0 (after execution of nodes = ...)
final HtmlDivision div = (HtmlDivision) page.getByXPath("//div[#class='container-fluid row']").get(0);
after about 13 seconds, throws the following exception:
Exception:
java.lang.IndexOutOfBoundsException: Index 0 out of bounds for length 0.
which means that (HtmlDivision) page.getByXPath("//div[#class='container-fluid row']") was of length 0.
Problem technically solved. Here is the new code:
System.getProperties().put("org.apache.commons.logging.simplelog.defaultlog", "fatal");
final WebClient webClient = new WebClient();
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.setRefreshHandler(new RefreshHandler() {
public void handleRefresh(Page page, URL url, int arg) throws IOException {
System.out.println("handleRefresh");
}
});
HtmlPage page = (HtmlPage) webClient.getPage("https://www.mohavecounty.us/ContentPage.aspx?id=111&cid=869&parcel=10272001");
HtmlElement ownerAddressElement = (HtmlElement) page.getElementById("parcelMailingAddressResult");
String ownerAddress = ownerAddressElement.asText();
I say "technically", because it takes the above code about an hour on my virtual machine to get the ownerAddress. In practice, it makes my code very hard to use. I suspect the following: when you go to https://www.mohavecounty.us/ContentPage.aspx?id=111&cid=869&parcel=10272001 , it takes the page only a few seconds to load. But the blue "Search" button shows the busy signal even after an hour and half. I suspect the JavaScript of the page has entered into some infinite loop which makes webClient believes that the page is still loading, until it loses patience after an hour.
I would like to cut that time from an hour to less than 30 minutes. But that's another question, which I will ask as a separate question at Stack Overflow.

Jsoup fetches wrong results

Working with Jsoup. The URL works well on the browser. But it fetches wrong result on the server. I set the maxBodySize "0" as well. But it still only gets first few tags. Moreover the data is even different from the browser one. Can you guys give me a hand?
String queryUrl = "http://www.juso.go.kr/addrlink/addrLinkApi.do?confmKey=U01TX0FVVEgyMDE3MDYyODE0MTYyMzIyMTcw&currentPage=1&countPerPage=20&keyword=연남동";
Document document = Jsoup.connect(queryUrl).maxBodySize(0).get();
Are you aware that this endpoint returns paginated data? Your URL asks for 20 entries from the first page. I assume that the order of these entries is not specified so you can get different data each time you call this endpoint - check if there is a URL parameter that can determine specific sort order.
Anyway to read all 2037 entries you have to do it sequentially. Examine following code:
final String baseUrl = "http://www.juso.go.kr/addrlink/addrLinkApi.do";
final String key = "U01TX0FVVEgyMDE3MDYyODE0MTYyMzIyMTcw";
final String keyword = "연남동";
final int perPage = 100;
int currentPage = 1;
while (true) {
System.out.println("Downloading data from page " + currentPage);
final String url = String.format("%s?confmKey=%s&currentPage=%d&countPerPage=%d&keyword=%s", baseUrl, key, currentPage, perPage, keyword);
final Document document = Jsoup.connect(url).maxBodySize(0).get();
final Elements jusos = document.getElementsByTag("juso");
System.out.println("Found " + jusos.size() + " juso entries");
if (jusos.size() == 0) {
break;
}
currentPage += 1;
}
In this case we are asking for 100 entries per page (that's the maximum number this endpoint supports) and we call it 21 times, as long as calling for a specific page return any <juso> element. Hope it helps solving your problem.

How to restart page number from 1 in different group of BIRT report

Backgroud:
Use Java + BIRT to generate report.
Generate report in viewer and allow user to choose to export it to different format (pdf, xls, word...).
All program are in "Layout", no program in "Master Page".
Have 1 "Data Set". The fields in "Layout" refer to this DS.
There is Group in "Layout", gropu by one field.
In "Group Header", I create one cell to use as page number. "Page : MyPageNumber".
"MyPageNumber" is a field I define which would +1 in Group Header.
Problem:
When I use 1st method to generate report, "MyPageNumber" could not show correctly. Because group header only load one time for each group. It would always show 1.
Question:
As I know there is "restart page number in group" in Crystal report. How to restart page in BIRT?
I want to show data of different group in 1 report file, and the page number start from 1 for each group.
You can do it with BIRT reports using page variables. For example:
Add 2 page variables... Group_page, Group_name.
Add 1 report variable... Group_total_page.
In the report beforeFactory add the script:
prevGroupKey = "";
groupPageNumber = 1;
reportContext.setGlobalVariable("gGROUP_NAME", "");
reportContext.setGlobalVariable("gGROUP_PAGE", 1);
In the report onPageEnd add the script:
var groupKey = currGroup;
var prevGroupKey = reportContext.getGlobalVariable("gGROUP_NAME");
var groupPageNumber = reportContext.getGlobalVariable("gGROUP_PAGE");
if( prevGroupKey == null ){
prevGroupKey = "";
}
if (prevGroupKey == groupKey)
{
if (groupPageNumber != null)
{
groupPageNumber = parseInt(groupPageNumber) + 1;
}
else {
groupPageNumber = 1;
}
}
else {
groupPageNumber = 1;
prevGroupKey = groupKey;
}
reportContext.setPageVariable("GROUP_NAME", groupKey);
reportContext.setPageVariable("GROUP_PAGE", groupPageNumber);
reportContext.setGlobalVariable("gGROUP_NAME", groupKey);
reportContext.setGlobalVariable("gGROUP_PAGE", groupPageNumber);
var groupTotalPage = reportContext.getPageVariable("GROUP_TOTAL_PAGE");
if (groupTotalPage == null)
{
groupTotalPage = new java.util.HashMap();
reportContext.setPageVariable("GROUP_TOTAL_PAGE", groupTotalPage);
}
groupTotalPage.put(groupKey, groupPageNumber);
In a master page onRender script add the following script:
var totalPage = reportContext.getPageVariable("GROUP_TOTAL_PAGE");
var groupName = reportContext.getPageVariable("GROUP_NAME");
if (totalPage != null)
{
this.text = java.lang.Integer.toString(totalPage.get(groupName));
}
In the table group header onCreate event, add the following script, replacing 'COUNTRY' with the name of the column that you are grouping on:
currGroup = this.getRowData().getColumnValue("COUNTRY");
In the master page add a grid to the header or footer and add an autotext variable for Group_page and Group_total_page. Optionally add the page variable for the Group_name as well.
Check out these links for more information about BIRT page variables:
https://books.google.ch/books?id=aIjZ4FYJOQkC&pg=PA85&lpg=PA85&dq=birt+change+autotext&source=bl&ots=K0nCmF2hrD&sig=CBOr_otRW0B72sZoFS7LC_1Mrz4&hl=en&sa=X&ei=ZKNAVcnuLYLHsAXRmIHoCw&ved=0CEoQ6AEwBQ#v=onepage&q=birt%20change%20autotext&f=false
https://www.youtube.com/watch?v=lw_k1qHY_gU
http://www.eclipse.org/birt/phoenix/project/notable2.5.php#jump_4
https://bugs.eclipse.org/bugs/show_bug.cgi?id=316173
http://www.eclipse.org/forums/index.php/t/575172/
Alas, this is not supported with BIRT.
That's probably not the answer you've hoped for, but it's the truth.
This is one of the very few aspects where BIRT is way behind other report generator tools.
However, depending on how you have BIRT integrated into your environment, a workaround approach is possible for PDF export that we use in our solution with great success.
The idea is to let BIRT generate a PDF outline based on the grouping.
And the BIRT report creates information in the ReportContext about where and how it wants the page numbers to be displayed.
After BIRT generated the PDF, a custom PDFPostProcessor uses the PDF outline and the information from the ReportContext to add the page numbers with iText.
If this work-around is viable for you, feel free to contact me.

HtmlUnit to click on specific link from link's reference with same name

I started using HtmlUnit today, so I'm a bit noob at the time.
I've managed to to go to IMDB and search for the movie "Sleepers" from 1996, and I get a bunch of results with the same name:
Here are the results from that search
I want to select the first "Sleepers" from the list, which is the correct one, but I don't know how to get that information with HtmlUnit. I looked inside the code and found the link, but I don't know how to extract it.
I guess i could use some regex, but that would defeat the purpose of using HtmlUnit.
This is my code (It has some bits from HtmlUnit's tutorial and some code found here):
public IMdB() {
try {
//final WebClient webClient = new WebClient();
final WebClient webClient = new WebClient(BrowserVersion.INTERNET_EXPLORER_8, "10.255.10.34", 8080);
//set proxy username and password
final DefaultCredentialsProvider credentialsProvider = (DefaultCredentialsProvider) webClient.getCredentialsProvider();
credentialsProvider.addCredentials("xxxx", "xxxx");
// Get the first page
final HtmlPage page1 = webClient.getPage("http://www.imdb.com");
// Get the form that we are dealing with and within that form,
// find the submit button and the field that we want to change.
//final HtmlForm form = page1.getFormByName("navbar-form");
HtmlForm form = page1.getFirstByXPath("//form[#id='navbar-form']");
//
HtmlButton button = form.getFirstByXPath("/html/body//form//button[#id='navbar-submit-button']");
HtmlTextInput textField = form.getFirstByXPath("/html/body//form//input[#id='navbar-query']");
// Change the value of the text field
textField.setValueAttribute("Sleepers");
// Now submit the form by clicking the button and get back the second page.
HtmlPage page2 = button.click();
// form = page2.getElementByName("s");
//page2 = page2.getFirstByXPath("/html/body//form//div//tr[#href]");
System.out.println("content: " + page2.asText());
webClient.closeAllWindows();
} catch (IOException ex) {
Logger.getLogger(IMdB.class.getName()).log(Level.SEVERE, null, ex);
}
System.out.println("END");
}
You should do that this way:
HtmlPage htmlPage = new WebClient().getPage("http://imdb.com/blah");
HtmlAnchor anchor = htmlPage.getFirstByXPath("//td[#class='primary_photo']//a")
System.out.println(anchor.getHrefAttribute());
I would suggest you to rather use the IMDB api then doing all that
The IMDb currently has two public APIs that are, although undocumented, very quick and reliable (used on their own site through AJAX).
A statically cached search suggestions API:
http://sg.media-imdb.com/suggests/a/aa.json
http://sg.media-imdb.com/suggests/h/hello.json
Format: JSONP
More advanced search
Name search (json): http://www.imdb.com/xml/find?json=1&nr=1&nm=on&q=jeniffer+garner
Title search (xml): http://www.imdb.com/xml/find?xml=1&nr=1&tt=on&q=lost
Format: JSON, XML and more

jericho Html parser error in jsp page

I have write code as
String sourceUrlString="http://some url";
Source source=new Source(new URL(sourceUrlString));
Element INFORM = source.getElementById("main").getAllElementsByClass("game").get(i-1);
String INFORM = INFORM.replaceAll("\\s",""); //shows error here
sendResponse(resp,+INFORM);
Now i want the text fetch from Element INFORM is Neglect white space how can i do so? above mentioned String INFORM Show error Duplicate local variable INFORM);
e.g
text fetch by Element INFORM is "my name is satish"
but it must send response as
"mynameissatish"
You have the name INFORM used twice - and thats not possible!
String sourceUrlString = "http://some url";
Source source = new Source(new URL(sourceUrlString));
Element INFORM = source.getElementById("main").getAllElementsByClass("game").get(i-1);
String response = INFORM.replaceAll("\\s",""); // ! Use another name here !
sendResponse(resp, respone); // or use '+' - not shure if 1 or 2 args

Categories

Resources