I am trying to read a lot of html pages using jsoup. I have an arraylist called "allPageLinks" that keeps html page links. Here is my code:
Document doc;
for (int i = 0; i < allPageLinks.size(); i++) {
try {
doc = Jsoup.connect(allPageLinks.get(i)).timeout(0).get();
Element page_clips = doc.getElementById("page_clips");
Element page_clip_content = page_clips
.getElementById("content");
Elements product_grid = page_clip_content
.select(".product-list.margin-left-5");
Elements products = product_grid.get(0).children();
for (int j = 0; j < products.size(); j++) {
try {
String productName = products.get(j)
.getElementsByClass("name").text();
String productPrice = products.get(j)
.getElementsByClass("price").text();
String productLink = products.get(j)
.getElementsByClass("image").select("a")
.first().attr("href");
Document newDoc = Jsoup.connect(productLink).get();
Elements elements = newDoc.getElementsByClass("left");
Elements productNameElement = elements.get(0)
.getElementsByClass("colorbox");
String productImage = productNameElement.attr("href");
elements = newDoc.getElementsByClass("right");
String productId = elements.get(0)
.getElementsByClass("field").get(1).text();
writer.append(productName);
writer.append(';');
writer.append(productPrice);
writer.append(';');
writer.append(productId);
writer.append(';');
writer.append(productImage);
writer.append(';');
writer.append(productLink);
writer.append('\n');
} catch (Exception ex) {
System.out.println(ex.getMessage() + " " + i + " "
+ allPageLinks.get(i) + " ICTEKICATCH");
}
}
} catch (Exception ex) {
System.out.println(ex.getMessage() + " " + i + " "
+ allPageLinks.get(i));
}
}
Even though i set connection timeout to zero, i am getting a lot of connect time out exceptions for most of the links. Can anyone help me to get rid of that exception?
Thanks
You forgot to add specify the timeout for this connection within the loop of the code:
Document newDoc = Jsoup.connect(productLink).get();
Should be:
Document newDoc = Jsoup.connect(productLink).timeout(0).get();
That is where the timeout exception is most likely occurring.
Related
I am currently facing an issue regarding this method getSurroundingSumGrid() which is supposed to take data from an earlier grid that was built based off of text file data and use it to determine new values within the array sumGrid. The STATICGRID array gets built at first with the correct values but then as the for loop continues on, the STATICGRID values change to what i have set sumGrid to change to. I don't have any defined code where STATICGRID is ever set to equal another value and if I did it should give an error.
public double[][] getSurroundingSumGrid() {
this.sumGrid = getBaseGrid();
for (int rowNum = 0; rowNum < sumGrid.length; rowNum++) {
final double[][] STATICGRID = this.getBaseGrid();
double topNum = 0, botNum = 0, rightNum = 0, leftNum = 0;
for (int colNum = 0; colNum < sumGrid[0].length; colNum++) {
try {
topNum = STATICGRID[rowNum - 1][colNum];
System.out.println("TOPNUM : (" + (rowNum-1) + "," + colNum + ") " + STATICGRID[rowNum-1][colNum]);
} catch (Exception e) {
topNum = STATICGRID[rowNum][colNum];
System.out.println("Top IndexOutOfBoundsException: " + STATICGRID[rowNum][colNum] + " used instead.");
}
try {
botNum = STATICGRID[rowNum + 1][colNum];
System.out.println("BOTNUM : (" + (rowNum+1) + "," + colNum + ") " + STATICGRID[rowNum+1][colNum]);
} catch (Exception e) {
botNum = STATICGRID[rowNum][colNum];
System.out.println("Bot IndexOutOfBoundsException: " + STATICGRID[rowNum][colNum] + " used instead.");
}
try {
leftNum = STATICGRID[rowNum][colNum - 1];
System.out.println("LEFTNUM : (" + rowNum + "," + (colNum-1) + ") " + STATICGRID[rowNum][colNum-1]);
} catch (Exception e) {
leftNum = STATICGRID[rowNum][colNum];
System.out.println("Left IndexOutOfBoundsException: " + STATICGRID[rowNum][colNum] + " used instead.");
}
try {
rightNum = STATICGRID[rowNum][colNum + 1];
System.out.println("RIGHTNUM : (" + rowNum + "," + (colNum+1) + ") " + STATICGRID[rowNum][colNum+1]);
} catch (Exception e) {
rightNum = STATICGRID[rowNum][colNum];
System.out.println("Right IndexOutOfBoundsException: " + STATICGRID[rowNum][colNum] + " used instead.");
}
this.sumGrid[rowNum][colNum] = topNum + botNum + rightNum + leftNum;
System.out.println("STATICGRID NEW NUM : " + STATICGRID[rowNum][colNum]);
System.out.println("SUMGRID NEW NUM : " + sumGrid[rowNum][colNum]);
}
}
return this.sumGrid;
}
When doing these tests with the code I can see very clearly that the data in both arrays are changing overtime, and in turn giving me wrong results. I've tried for about 2 hours just moving things around and can't seem to figure out how to get this to work properly.
As you can even see, I even attempted rebuilding the STATICGRID array every single time the for loop completed and it wouldn't even hinder the result. It does the same thing regardless of where you put the STATICGRID at (either outside or inside at the top-most level of the for loop, and it doesn't matter whether it's final or not), it does the same thing. After looking at it for so long I'm beyond confused on why my code isn't working and I have a slight feeling that it is the try-catch statement but I wouldn't at all know why. I don't know a ton about the statement and what it does entirely but the reason it is there is because the data can get an IndexOutOfBoundsException so instead of getting that it would instead count itself for each IndexOutOfBoundsException it got as per the assignment instructions.
Thanks and I hope this makes sense.
Alright, thanks to FredK's suggestion at using a deepCopy, I did some research and used this method to get better results. This is the unoptomizedDeepCopy by Philip Isehour. I don't exactly understand it but I'm going to just use it for now and spend some time learning more about this and how they work. I'm currently a CS221 student and we haven't gone over deepCopy yet.
public static Object deepCopy(Object orig) {
Object obj = null;
try {
// Write the object out to a byte array
ByteArrayOutputStream bos = new ByteArrayOutputStream();
ObjectOutputStream out = new ObjectOutputStream(bos);
out.writeObject(orig);
out.flush();
out.close();
// Make an input stream from the byte array and read
// a copy of the object back in.
ObjectInputStream in = new ObjectInputStream(
new ByteArrayInputStream(bos.toByteArray()));
obj = in.readObject();
}
catch(IOException e) {
e.printStackTrace();
}
catch(ClassNotFoundException cnfe) {
cnfe.printStackTrace();
}
return obj;
}
After this method, I then was able to stop the array from changing value by using this in front of the getBaseGrid() method.
public double[][] getSurroundingSumGrid() {
this.sumGrid = (double[][]) GridMonitor.deepCopy(this.getBaseGrid());
double[][] staticGrid = (double[][]) GridMonitor.deepCopy(this.getBaseGrid());
double topNum, botNum, rightNum, leftNum;
for (int rowNum = 0; rowNum < sumGrid.length; rowNum++) {
for (int colNum = 0; colNum < sumGrid[0].length; colNum++) {
try {
topNum = staticGrid[rowNum - 1][colNum];
} catch (Exception e) {
topNum = staticGrid[rowNum][colNum];
}
try {
botNum = staticGrid[rowNum + 1][colNum];
} catch (Exception e) {
botNum = staticGrid[rowNum][colNum];
}
try {
leftNum = staticGrid[rowNum][colNum - 1];
} catch (Exception e) {
leftNum = staticGrid[rowNum][colNum];
}
try {
rightNum = staticGrid[rowNum][colNum + 1];
} catch (Exception e) {
rightNum = staticGrid[rowNum][colNum];
}
this.sumGrid[rowNum][colNum] = topNum + botNum + rightNum + leftNum;
}
}
return this.sumGrid;
}
Thanks!
Im scraping data from multiple web pages using Jsoup, how can I get the scraped data to save to file without it overwriting the previous webpage that got scraped
I've tried searching on stack overflow and Jsoup docs for a solution.
int j = 0;
int i = 0;
String URL = ("https://www.ufc.com/athletes/all?gender=All&search=&page="+j);
Document doc = Jsoup.connect(URL).userAgent("mozilla/70.0.1").get();
Elements temp = doc.select("div.c-listing-athlete__text");
for (Element fighterList:temp) {
i++;
System.out.println(i + " " + fighterList.getElementsByClass("c-listing-athlete__name").first().text());
}
j++;
URL = ("https://www.ufc.com/athletes/all?gender=All&search=&page="+j);
doc = Jsoup.connect(URL).userAgent("mozilla/70.0.1").get();
temp = doc.select("div.c-listing-athlete__text");
for (Element fighterList:temp) {
i++;
System.out.println(i + " " + fighterList.getElementsByClass("c-listing-athlete__name").first().text());
}
If you need to save the data from code, just check this, maybe it can help you:
int i = 0;
int pagesNumber = 10;
String URL = "";
Document doc = null;
Elements temp = null;
try {
// Create file
FileWriter fstream = new FileWriter(System.currentTimeMillis() + "out.txt");
BufferedWriter out = new BufferedWriter(fstream);
for (i=0; i<pagesNumber; i++) {
URL = ("https://www.ufc.com/athletes/all?gender=All&search=&page="+i);
doc = Jsoup.connect(URL).userAgent("mozilla/70.0.1").get();
temp = doc.select("div.c-listing-athlete__text");
for (Element fighter : temp) {
out.write(i + " " + fighter.getElementsByClass("c-listing-athlete__name").first().text());
}
}
//Close the output stream
out.close();
} catch (Exception e) { // Catch exception if any
System.err.println("Error: " + e.getMessage());
}
Hope it helps :)
I keep getting same links from the site which I'm testing.
This is my code
"
List<WebElement> activeLinks = new ArrayList<WebElement>();
//2.Iterate LinksList: Exclude all the links/images - doesn't have any href attribute and exclude images starting with javascript.
boolean breakIt = true;
for(WebElement link:AllTheLinkList)
{
breakIt = true;
try
{
//System.out.println((link.getAttribute("href")));
if(link.getAttribute("href") != null && !link.getAttribute("href").contains("javascript") && link.getAttribute("href").contains("pharmacy")) //&& !link.getAttribute("href").contains("pharmacy/main#"))
{
activeLinks.add(link);
}
}
catch(org.openqa.selenium.StaleElementReferenceException ex)
{
breakIt = false;
}
if (breakIt)
{
continue;
}
}
//Get total amount of Other links
log.info("Other Links ---> " + (AllTheLinkList.size()-activeLinks.size()));
//Get total amount of links in the page
log.info("Size of active links and images in pharmacy ---> "+ activeLinks.size());
for(int j=0; j<activeLinks.size(); j++) {
HttpURLConnection connection = (HttpURLConnection) new URL(activeLinks.get(j).getAttribute("href")).openConnection();
connection.setConnectTimeout(4000);
connection.connect();
String response = connection.getResponseMessage(); //Ok
int code = connection.getResponseCode();
connection.disconnect();
//System.out.println((j+1) +"/" + activeLinks.size() + " " + activeLinks.get(j).getAttribute("href") + "---> status:" + response + " ----> code:" + code);
log.info((j+1) +"/" + activeLinks.size() + " " + activeLinks.get(j).getAttribute("href") + "---> status:" + response + " ----> code:" + code);
}
And this is my output:
I'm getting same links again and again. it's like they are repeating.
Anybody can help me with this?
Try copying the list items to a set because set does not allow duplicates.
For example:
WebDriver driver = new ChromeDriver();
List<WebElement> anchors = driver.findElements(By.tagName("a"));
Set<WebElement> hrefs = new HashSet<WebElement>(anchors);
Iterator<WebElement> i = hrefs.iterator();
while(i.hasNext()) {
WebElement anchor = i.next();
if(anchor.getAttribute("href").contains(href)) {
anchor.click();
break;
}
}
Hope this helps.
public String generateDataPDF() {
System.out.println("Inside generate PDF");
String filePath = "";
HttpSession sess = ServletActionContext.getRequest().getSession();
try {
sess.setAttribute("msg", "");
if (getCrnListType().equalsIgnoreCase("F")) {
try {
filePath = getModulePath("CRNLIST_BASE_LOCATION") + File.separator + getCrnFileFileName();
System.out.println("File stored path : " + filePath);
target = new File(filePath);
FileUtils.copyFile(crnFile, target);
} catch (Exception e) {
System.out.println("File path Exception " + e);
}
}
System.out.println("Values from jsp are : 1)Mode of Generation : " + getCrnListType() + " 2)Policy Number : " + getCrnNumber() + " 3)Uploaded File Name : " + getCrnFileFileName() + " 4)LogoType : " + getLogoType()
+ " 5)Output Path : " + getOutputPath() + " 6)Type of Generation : " + getOptionId() + " 7)PDF Name : " + getPdfName());
String srtVAL = "";
String arrayVaue[] = new String[]{getCrnListType(), getCrnListType().equalsIgnoreCase("S") ? getCrnNumber() : filePath, getLogoType().equalsIgnoreCase("WL") ? "0" : "1",
getOutputPath(), getGenMode(), getRenType()};
//INS DB Connection
con = getInsjdbcConnection();
ArrayList selectedCRNList = new ArrayList();
String selectedCRNStr = "";
selectedCRNStr = getSelectedVal(selectedCRNStr, arrayVaue[1]);
String[] fileRes = selectedCRNStr.split("\\,");
if (fileRes[0].equalsIgnoreCase("FAIL")) {
System.out.println("fileRes is FAIL beacause of other extension file.");
sess.setAttribute("pr", "Please upload xls or csv file.");
return SUCCESS;
}
System.out.println("List file is : " + selectedCRNStr);
String st[] = srtVAL.split("[*]");
String billDateStr = DateUtil.getStrDateProc(new Date());
Timestamp strtPasrsingTm = new Timestamp(new Date().getTime());
String minAMPM = DateUtil.getTimeDate(new Date());
String str = "";
String batchID = callSequence();
try {
System.out.println("Inside Multiple policy Generation.");
String userName=sess.getAttribute("loginName").toString();
String list = getProcessesdList(userName);
if (list != null) {
System.out.println("list is not null Users previous data is processing.....");
//setTotalPDFgNERATEDmSG("Data is processing please wait.");
sess.setAttribute("pr","Batch Id "+list+" for User " + userName + " is currently running.Please wait till this Process complete.");
return SUCCESS;
}
String[] policyNo = selectedCRNStr.split("\\,");
int l = 0, f = 0,counter=1;
for (int j = 0; j < policyNo.length; j++,counter++) {
String pdfFileName = "";
int uniqueId=counter;
globUniqueId=uniqueId;
insertData(batchID, new Date(), policyNo[j], getOptionId(), userName,uniqueId);
System.out.println("Executing Proc one by one.");
System.out.println("policyNo[j]" + policyNo[j]);
System.out.println("getOptionId()" + getOptionId());
System.out.println("seqValue i.e batchId : " + batchID);
}
str = callProcedure(policyNo[j], getOptionId(), batchID);
String[] procResponse = str.split("\\|");
for (int i = 0; i < procResponse.length; i++) {
System.out.println("Response is : " + procResponse[i]);
}
if (procResponse[0].equals("SUCCESS")) {
Generator gen = new Generator();
if (getPdfName().equalsIgnoreCase("true")) {
System.out.println("Checkbox is click i.e true");
pdfFileName = procResponse[1];
} else {
System.out.println("Checkbox is not click i.e false");
String POLICY_SCH_GEN_PSS = getDetailsForFileName(userName, policyNo[j], batchID);
String[] fileName = POLICY_SCH_GEN_PSS.split("\\|");
if (getLogoType().equals("0") || getLogoType().equals("2")) {
System.out.println("If logo is O or 1");
pdfFileName = fileName[1];
} else if (getLogoType().equals("1")) {
System.out.println("If logo is 2");
pdfFileName = fileName[0];
}
}
b1 = gen.genStmt(procResponse[1], procResponse[2], "2", getLogoType(), "0", pdfFileName,"1",userName,batchID);
l++;
updateData(uniqueId,batchID, "Y");
} else {
f++;
updateData(uniqueId,batchID, "F");
}
}
sess.setAttribute("pr","Total "+l+" "+getGenericModulePath("PDF_RES1") + " " + " " + getGenericModulePath("PDF_RES2") + " " + f);
}catch (Exception e) {
updateData(globUniqueId,batchID, "F");
System.out.println("Exception in procedure call");
setTotalPDFgNERATEDmSG("Fail");
e.printStackTrace();
sess.setAttribute("pr", "Server Error.");
return SUCCESS;
}
}catch (Exception ex) {
ex.printStackTrace();
sess.setAttribute("pr", "Server Error.");
return SUCCESS;
}
System.out.println("Above second return");
return SUCCESS;
}
GenerateDataPDf method generates PDF based on the parameters i.e ProductType(GenMode),CrnList(uploaded in excel file...)Code works fine when only single user generates PDF. But If two different User(User and roles are assigned in application) start the process same time request paraeters are overridden then! Suppose first user request pdf for 50 customers for product 1. User1's process is still running and second user request for product2. Now User1's pdf are generated but for product2.....! Here batchId is unique for every single request.One table is maintained where batch_id,all pdf,generation flags are mainained there. How do I solve this?
As per your comment, this is what I would do, It's probably not the best way to do !
Firstly : Create a function to collet all your data at the beginning. You should not modify/update/create anything when you are generating a PDF. IE : array/list collectPDFData() wich should retourn an array/list.
Secondly : Use a synchronized methods like synchronized boolean generatePDF(array/list)
"Synchronized" methods use monitor lock or intrinsic lock in order to manage synchronization so when using synchronized, each method share the same monitor of the corresponding object.
NB : If you use Synchronize, it's probably useless to collect all your data in a separate way, but I think it's a good practice to make small function dedicated to a specific task.
Thus, your code should be refactored a little bit.
I have this requirement that I need to replace URL in CSS, so far I have this code that display the rules of a css file:
#Override
public void parse(String document) {
log.info("Parsing CSS: " + document);
this.document = document;
InputSource source = new InputSource(new StringReader(this.document));
try {
CSSStyleSheet stylesheet = parser.parseStyleSheet(source, null, null);
CSSRuleList ruleList = stylesheet.getCssRules();
log.info("Number of rules: " + ruleList.getLength());
// lets examine the stylesheet contents
for (int i = 0; i < ruleList.getLength(); i++)
{
CSSRule rule = ruleList.item(i);
if (rule instanceof CSSStyleRule) {
CSSStyleRule styleRule=(CSSStyleRule)rule;
log.info("selector: " + styleRule.getSelectorText());
CSSStyleDeclaration styleDeclaration = styleRule.getStyle();
//assertEquals(1, styleDeclaration.getLength());
for (int j = 0; j < styleDeclaration.getLength(); j++) {
String property = styleDeclaration.item(j);
log.info("property: " + property);
log.info("value: " + styleDeclaration.getPropertyCSSValue(property).getCssText());
}
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
However, I am not sure whether how to actually replace the URL since there is not much a documentation about CSS Parser
Here is the modified for loop:
//Only images can be there in CSS.
Pattern URL_PATTERN = Pattern.compile("http://.*?jpg|jpeg|png|gif");
for (int j = 0; j < styleDeclaration.getLength(); j++) {
String property = styleDeclaration.item(j);
String value = styleDeclaration.getPropertyCSSValue(property).getCssText();
Matcher m = URL_PATTERN.matcher(value);
//CSS property can have multiple URL. Hence do it in while loop.
while(m.find()) {
String originalUrl = m.group(0);
//Now you've the original URL here. Change it however ou want.
}
}