Jsoup fetches wrong results - java

Working with Jsoup. The URL works well on the browser. But it fetches wrong result on the server. I set the maxBodySize "0" as well. But it still only gets first few tags. Moreover the data is even different from the browser one. Can you guys give me a hand?
String queryUrl = "http://www.juso.go.kr/addrlink/addrLinkApi.do?confmKey=U01TX0FVVEgyMDE3MDYyODE0MTYyMzIyMTcw&currentPage=1&countPerPage=20&keyword=연남동";
Document document = Jsoup.connect(queryUrl).maxBodySize(0).get();

Are you aware that this endpoint returns paginated data? Your URL asks for 20 entries from the first page. I assume that the order of these entries is not specified so you can get different data each time you call this endpoint - check if there is a URL parameter that can determine specific sort order.
Anyway to read all 2037 entries you have to do it sequentially. Examine following code:
final String baseUrl = "http://www.juso.go.kr/addrlink/addrLinkApi.do";
final String key = "U01TX0FVVEgyMDE3MDYyODE0MTYyMzIyMTcw";
final String keyword = "연남동";
final int perPage = 100;
int currentPage = 1;
while (true) {
System.out.println("Downloading data from page " + currentPage);
final String url = String.format("%s?confmKey=%s&currentPage=%d&countPerPage=%d&keyword=%s", baseUrl, key, currentPage, perPage, keyword);
final Document document = Jsoup.connect(url).maxBodySize(0).get();
final Elements jusos = document.getElementsByTag("juso");
System.out.println("Found " + jusos.size() + " juso entries");
if (jusos.size() == 0) {
break;
}
currentPage += 1;
}
In this case we are asking for 100 entries per page (that's the maximum number this endpoint supports) and we call it 21 times, as long as calling for a specific page return any <juso> element. Hope it helps solving your problem.

Related

Pagination in CosmosDB Java SDK with continuation token

I'm trying to create from an async client a method to retrieve items from a CosmosDB but I'm afraid I'm full of questions and little to no documentation from Microsoft side
I've created a function that will read from a cosmosDB a list of items, page by page, which continuation will depend on a continuityToken. The methos looks like this. Please, be aware there could be some minor mistakes non related to the core functionality which is reading page by page:
#FunctionName("Feed")
public HttpResponseMessage getFeed(
#HttpTrigger(
name = "get",
methods = { HttpMethod.GET },
authLevel = AuthorizationLevel.ANONYMOUS,
route = "Feed"
) final HttpRequestMessage<Optional<String>> request,
#CosmosDBInput(
name = "Feed",
databaseName = Constants.DATABASE_NAME,
collectionName = Constants.LOG_COLLECTION_NAME,
sqlQuery = "SELECT * FROM c", // This won't be used actually as we use our own query
connectionStringSetting = Constants.CONNECTION_STRING_KEY
) final LogEntry[] logEntryArray,
final ExecutionContext context
) {
context
.getLogger()
.info("Query with paging and continuation token");
String query = "SELECT * FROM c"
int pageSize = 10; //No of docs per page
int currentPageNumber = 1;
int documentNumber = 0;
String continuationToken = null;
double requestCharge = 0.0;
// First iteration (continuationToken = null): Receive a batch of query response pages
// Subsequent iterations (continuationToken != null): Receive subsequent batch of query response pages, with continuationToken indicating where the previous iteration left off
do {
context
.getLogger()
.info("Receiving a set of query response pages.");
context
.getLogger()
.info("Continuation Token: " + continuationToken + "\n");
CosmosQueryRequestOptions queryOptions = new CosmosQueryRequestOptions();
Flux<FeedResponse<LogEntry>> feedResponseIterator =
container.queryItems(query, queryOptions, LogEntry.class).byPage(continuationToken,pageSize);
try {
feedResponseIterator.flatMap(fluxResponse -> {
context
.getLogger()
.info("Got a page of query result with " +
fluxResponse.getResults().size() + " items(s)"
+ " and request charge of " + fluxResponse.getRequestCharge());
context
.getLogger()
.info("Item Ids " + fluxResponse
.getResults()
.stream()
.map(LogEntry::getDate)
.collect(Collectors.toList()));
return Flux.empty();
}).blockLast();
} catch (Exception e) {
}
} while (continuationToken != null);
context
.getLogger()
.info(String.format("Total request charge: %f\n", requestCharge));
return request
.createResponseBuilder(HttpStatus.OK)
.header("Content-Type", "application/json")
.body("ALL READ")
.build();
}
For simplicity the read items are merely logged.
First question: We are using an async document client that returns a Flux. Will the client keep track of the token? It is a stateless client in principle. I understand that the sync client could take easily care of this case, but wouldn't the async client reset its memory of tokens after the first page and token has been generated?
Second: Is the while loop even appropriated? My assumption is a big no, as we need to send back the token in a header and the frontend UI will need to send the token to the Azure Function in a header or other similar fashion. The token should be extracted from the context then
Third: Is the flatMap and blockList way to read the flux appropriate? I was trying to play with the subscribe method but again I don't see how it could work for an async client.
Thanks a lot,
Alex.
UPDATE:
I have observed that Flux only uses the items per page value to set the number of items to be retrieved per batch, but after retrieval of one page it doesn't stop and keeps retrieving pages! I don't know how to stop it. I have tried substituting the Flux.empty() per Mono.empty() and setting a LIMIT clause in the sql query. The first option does the same and the second freezes the query and never returns apparently. How can I return one page an only one page along with the continuation token to do the following query once the user clicks on the next page button?

htmlUnit - Is it possible to execute only specific JS functions?

I have an issue - I'm trying to scrape a Cinema webpage,
---> https://cinemaxx.dk/koebenhavn
I need to get data regarding how many seats that is reserved/sold, I need to extract the last snapshot.
The seats that are reserved/sold is shown on the picture as a red square:
Basiclly, my logic is this.
I scrape the contact using htmlUnit.
I set htmlUnit to execute all JS.
extract the (reservedSeats BASE64 String).
Convert the BASE64 string to image.
Then my program analyse the image, and count how many seats that is reserved / sold.
My issue is:
As I need the last snapshot of the picture, - cause that is the picture that gives the correct data related to how many seats that is reserved / sold. - I start scraping the website 3 min before the movie start,... and untill input == null.
I do this by looping my scrape method - But the ciname server automatic reserve 2 seats at each request (and hold them for 10 minutes). - So I end up reserving all the seats in the whle cinema... (you can see an example on the 2 reserved seats (blue squares) on the picture above)).
I found the JS method in the HTML that reserved the 2 seats at request - Now I would like htmlUnit to execute all JS exect this one JS method that reserves theese 2 seats by HTTP request.
I hope it gives sense, all above.
Is there someone out there that maybe can lead me in the right direction ?, or maybe had similar issue?.
public void scraper(String url) {
final String URL = url;
//Initialize Ghost Browser (FireFox_60):
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_60)) {
//Configure Ghost Browser:
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setCssEnabled(false);
//Load Url & Configure Ghost Browser:
final HtmlPage page = webClient.getPage(URL);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.waitForBackgroundJavaScript(3000);
//Spider JS PATH to BASE64 data:
final HtmlElement seatPictureRaw = page.querySelector
("body > div.page.page--booking.ng-scope > div.relative > div.inner__container.inner__container--content " +
"> div.seatselect > div > div > div > div:nth-child(2) > div.seatselect__image > img");
//Terminate Current web session:
webClient.getCurrentWindow().getJobManager().removeAllJobs();
webClient.close();
//Process the raw BASE64 Data - Extract clean BASE64 String:
String rawBASE64Data = String.valueOf(seatPictureRaw);
String[] arrOfStr = rawBASE64Data.split("(?<=> 0\") ");
String cleanedUpBASE64Data = arrOfStr[1];
String cleanedUpBASE64Data1 = cleanedUpBASE64Data.replace("src=\"data:image/gif;base64,", "");
String cleanedUpBASE64Data2 = cleanedUpBASE64Data1.replace("\">]", "");
//System.out.println(cleanedUpBASE64Data2);
//Decode BASE64 Rawdata to Image:
final byte[] decodedBytes = Base64.getDecoder().decode(cleanedUpBASE64Data2);
System.out.println("Numbers Of Caracters in BASE64 String: " + decodedBytes.length);
BufferedImage image = ImageIO.read(new ByteArrayInputStream(decodedBytes));
//Forward image for PictureAnalyzer Class...
final PictureAnalyzer pictureAnalyzer = new PictureAnalyzer();
pictureAnalyzer.analyzePixels(image);
} catch (Exception ex) {
ex.printStackTrace();
}
}
One option you have is to intercept&modify the server responses and replace the function call with something else.
replace only the function name (this is uggly because it will generate a js exceptions at runtime) or
remove the function call from the source or
replace the function body with {} or
....
See http://htmlunit.sourceforge.net/faq.html#HowToModifyRequestOrResponse for more

Looping stops when assertion fails

There is an excelsheet where all URLs (16) are listed in one column. Now once page gets loaded need to verify whether page title is matching with the expected title which is already stored in excel. I am able to perform it using for loop. It runs all URls if all are passed but stops when it fails. I need to run it completely and give a report which passed and which failed. I written the below code.
rowCount = suite_pageload_xls.getRowCount("LoadURL");
for(i=2,j=2;i<=rowCount;i++,j++) {
String urlData = suite_pageload_xls.getCellData("LoadURL", "URL", i);
Thread.sleep(3000);
long start = System.currentTimeMillis();
APP_LOGS.debug(start);
driver.navigate().to(urlData);
String actualtitle = driver.getTitle();
long finish = System.currentTimeMillis();
APP_LOGS.debug(finish);
APP_LOGS.debug(urlData+ "-----" +driver.getTitle());
long totalTime = finish - start;
APP_LOGS.debug("Total time taken is "+totalTime+" ms");
String expectedtitle = suite_pageload_xls.getCellData("LoadURL", "Label", j);
Assert.assertEquals(actualtitle, expectedtitle);
if (actualtitle.equalsIgnoreCase(expectedtitle)) {
APP_LOGS.debug("PAGE LABEL MATCHING....");
String resultpass = "PASS";
APP_LOGS.debug(resultpass);
APP_LOGS.debug("***********************************************************");
} else {
APP_LOGS.debug("PAGE LABEL NOT MATCHING....");
String resultfail = "FAIL";
APP_LOGS.debug(resultfail);
APP_LOGS.debug("***********************************************************");
}
}
Kindly help me in this regard.
This is the correct behavior of the assertion, it throws an Exception when the assertion is wrong.
You could store the actualTitles and expectedTitles in arrays and perform the assertions all at once.
For better assertions I suggest you try AssertJ, you could directly compare 2 lists, the actual and the expected, and it will return the complete difference.

Is it possible to "send" a cookie via Jsoup to a server?

I'm trying to retrieve an article price from a website. The problem is, that the prices differ if you choose online price or store price. After selecting a store the website creates a cookie called: CP_GEODATA with a specific value. I tried to send the cookie in different ways, but I keep getting the online price.
public class Parser {
public static void main(String[] args) throws Exception {
Map<String, String> cookies = new HashMap<String, String>();
cookies.put("CP_COUNTRY ", "%7B%22country%22%3A%22DE%22%7D ");
cookies.put("CP_GEODATA ", "%7B%22location%22%3A-1%2C%22firstlocation%22%3A11%2C%22name%22%3A%22Hamburg%22%7D");
String url = "https://www.cyberport.de/?token=7a2d9b195e32082fec015dca45ba3aa4&sSearchId=565eee12d987b&EVENT=itemsearch&view=liste&query=&filterkategorie=";
Connection.Response res = Jsoup.connect(url).cookies(cookies).data("query", "4B05-525").execute();
Document doc = res.parse();
String tester = doc.select("span[id=articlePrice] > span[class=basis fl]").text();
String tester2 = doc.select("span[id=articlePrice] > span[class=decimal fl]").text();
System.out.println(tester + tester2 + " €");
}
}
The value I'm getting back right now is 2,90 € but it should be 4,90 €. I already tried everything and searched the internet a lot but I did not find any solution working for me.
This is the article I'm receiving the price from:
https://www.cyberport.de/micro-usb-2-0-kabel-usb-a-stecker-micro-b-stecker-0-5m--4B05-525_9374.html
I'm trying to receive the price for the store in Hamburg, Germany.
You can see the cookies I'm setting at the top.
Thank you for any help!
It seems that zone info is stored in session and zone code is sent to server in a post when you select it.
Then you need to do the following steps:
Do the POST with the desired zone
Get the session cookies
Using these cookes do your original POST
Hopefully get the correct results
Here is the code
public static void main(String[] args) throws Exception {
Connection.Response res;
//11 is for Hamburg
String zoneId = "11";
//Set the zone and get the session cookies
res = Jsoup.connect("https://www.cyberport.de/newajaxpass/catalog/itemlist/0/costinfo/" + zoneId)
.ignoreContentType(true)
.method(Method.POST).execute();
final Map<String, String> cookies = res.cookies();
//print the cookies, we'll see session cookies here
System.out.println(cookies);
//If we use that cookies, your code runs Ok
String url = "https://www.cyberport.de/?token=7a2d9b195e32082fec015dca45ba3aa4&sSearchId=565eee12d987b&EVENT=itemsearch&view=liste&query=&filterkategorie=";
res = Jsoup.connect(url).cookies(cookies).data("query", "4B05-525").execute();
Document doc = res.parse();
String tester = doc.select("span[id=articlePrice] > span[class=basis fl]").text();
String tester2 = doc.select("span[id=articlePrice] > span[class=decimal fl]").text();
System.out.println(tester + tester2 + " €");
//Extra check
System.out.println(doc.select("div.townName").text());
}
You'll see:
{SERVERID=realmN03, SCS=76fe7473007c80ea2cfa059f180c603d, SID=pphdh7otcefvc5apdh2r9g0go2}
4,90 €
Hamburg
Which, I hope, is the desired result.

Is it possible to get a list of workflows the document is attached to in Alfresco

I'm trying to get a list of workflows the document is attached to in an Alfresco webscript, but I am kind of stuck.
My original problem is that I have a list of files, and the current user may have workflows assigned to him with these documents. So, now I want to create a webscript that will look in a folder, take all the documents there, and assemble a list of documents together with task references, if there are any for the current user.
I know about the "workflow" object that gives me the list of workflows for the current user, but this is not a solution for my problem.
So, can I get a list of workflows a specific document is attached to?
Well, for future reference, I've found a way to get all the active workflows on a document from javascript:
var nodeR = search.findNode('workspace://SpacesStore/'+doc.nodeRef);
for each ( wf in nodeR.activeWorkflows )
{
// Do whatever here.
}
I used packageContains association to find workflows for document.
Below i posted code in Alfresco JavaScript for active workflows (as zladuric answered) and also for all workflows:
/*global search, logger, workflow*/
var getWorkflowsForDocument, getActiveWorkflowsForDocument;
getWorkflowsForDocument = function () {
"use strict";
var doc, parentAssocs, packages, packagesLen, i, pack, props, workflowId, instance, isActive;
//
doc = search.findNode("workspace://SpacesStore/8847ea95-108d-4e08-90ab-34114e7b3977");
parentAssocs = doc.getParentAssocs();
packages = parentAssocs["{http://www.alfresco.org/model/bpm/1.0}packageContains"];
//
if (packages) {
packagesLen = packages.length;
//
for (i = 0; i < packagesLen; i += 1) {
pack = packages[i];
props = pack.getProperties();
workflowId = props["{http://www.alfresco.org/model/bpm/1.0}workflowInstanceId"];
instance = workflow.getInstance(workflowId);
/* instance is org.alfresco.repo.workflow.jscript.JscriptWorkflowInstance */
isActive = instance.isActive();
logger.log(" + instance: " + workflowId + " (active: " + isActive + ")");
}
}
};
getActiveWorkflowsForDocument = function () {
"use strict";
var doc, activeWorkflows, activeWorkflowsLen, i, instance;
//
doc = search.findNode("workspace://SpacesStore/8847ea95-108d-4e08-90ab-34114e7b3977");
activeWorkflows = doc.activeWorkflows;
activeWorkflowsLen = activeWorkflows.length;
for (i = 0; i < activeWorkflowsLen; i += 1) {
instance = activeWorkflows[i];
/* instance is org.alfresco.repo.workflow.jscript.JscriptWorkflowInstance */
logger.log(" - instance: " + instance.getId() + " (active: " + instance.isActive() + ")");
}
}
getWorkflowsForDocument();
getActiveWorkflowsForDocument();
Unfortunately the javascript API doesn't expose all the workflow functions. It look like getting the list of workflow instances that are attached to a document only works in Java (or Java backed webscripts).
List<WorkflowInstance> workflows = workflowService.getWorkflowsForContent(node.getNodeRef(), true);
A usage of this can be found in the workflow list in the document details: http://svn.alfresco.com/repos/alfresco-open-mirror/alfresco/HEAD/root/projects/web-client/source/java/org/alfresco/web/ui/repo/component/UINodeWorkflowInfo.java
To get to the users who have tasks assigned you would then need to use getWorkflowPaths and getTasksForWorkflowPath methods of the WorkflowService.

Categories

Resources