Im scraping data from multiple web pages using Jsoup, how can I get the scraped data to save to file without it overwriting the previous webpage that got scraped
I've tried searching on stack overflow and Jsoup docs for a solution.
int j = 0;
int i = 0;
String URL = ("https://www.ufc.com/athletes/all?gender=All&search=&page="+j);
Document doc = Jsoup.connect(URL).userAgent("mozilla/70.0.1").get();
Elements temp = doc.select("div.c-listing-athlete__text");
for (Element fighterList:temp) {
i++;
System.out.println(i + " " + fighterList.getElementsByClass("c-listing-athlete__name").first().text());
}
j++;
URL = ("https://www.ufc.com/athletes/all?gender=All&search=&page="+j);
doc = Jsoup.connect(URL).userAgent("mozilla/70.0.1").get();
temp = doc.select("div.c-listing-athlete__text");
for (Element fighterList:temp) {
i++;
System.out.println(i + " " + fighterList.getElementsByClass("c-listing-athlete__name").first().text());
}
If you need to save the data from code, just check this, maybe it can help you:
int i = 0;
int pagesNumber = 10;
String URL = "";
Document doc = null;
Elements temp = null;
try {
// Create file
FileWriter fstream = new FileWriter(System.currentTimeMillis() + "out.txt");
BufferedWriter out = new BufferedWriter(fstream);
for (i=0; i<pagesNumber; i++) {
URL = ("https://www.ufc.com/athletes/all?gender=All&search=&page="+i);
doc = Jsoup.connect(URL).userAgent("mozilla/70.0.1").get();
temp = doc.select("div.c-listing-athlete__text");
for (Element fighter : temp) {
out.write(i + " " + fighter.getElementsByClass("c-listing-athlete__name").first().text());
}
}
//Close the output stream
out.close();
} catch (Exception e) { // Catch exception if any
System.err.println("Error: " + e.getMessage());
}
Hope it helps :)
Have a scenario that was to collect all the div ids and loop them one by one to complete the iteration. I have done the scenario but it takes more time to pass all the ids.
Can you please suggest how to make it faster.
Below is my code snippet.
List<WebElement> listoftab = driver.findElements(by.xpath(".//*[contains (#id, 'tabZ')]/div/div[1]"));
Thread.sleep(1000);
String clas1 = "tablist";
String clas2 = "tabView";
for(int i =1; i<=110;i++){
boolean present;
try {
driver.findElement(By.xpath(".//*[#id='tabZ"+i+"']/div/div[1]"));
present = true;
if(clas1.equalsIgnoreCase(driver.findElement(By.xpath(".//*[#id='tabZ"+i+"']/div/div[1]")).getAttribute("class"))) {
tabloop:
for(int j=1;j<=15;j++) {
if(clas2.equalsIgnoreCase(driver.findElement(By.xpath(".//*[#id='tabZ"+i+"']/div/div[1]/div["+j+"]")).getAttribute("class"))) {
String ls = driver.findElement(By.xpath(".//*[#id='tabZ"+i+"']/div/div[1]/div["+j+"]")).getAttribute("id");
System.out.println(ls);
driver.findElement(By.xpath(".//*[#id='"+ls+"']/div[1]/div[2]/canvas[2]")).click();
Thread.sleep(3000);
break tabloop;
}
}
}
} catch (NoSuchElementException e) {
present = false;
continue;
}
}
Try this code as you are using driver.findElement() multiple times.Try to avoid finding element instead store them in a variable.
List<WebElement> listoftab = driver.findElements(By
.xpath(".//*[contains (#id, 'tabZ')]/div/div[1]"));
Thread.sleep(1000);
String clas1 = "tablist";
String clas2 = "tabView";
for (int i = 1; i <= 110; i++) {
boolean present;
try {
WebElement element=driver.findElement(By.xpath(".//*[#id='tabZ" + i
+ "']/div/div[1]"));
present = true;
if (clas1.equalsIgnoreCase(element.getAttribute("class"))) {
tabloop: for (int j = 1; j <= 15; j++) {
WebElement element1=driver.findElement(
By.xpath(".//*[#id='tabZ" + i
+ "']/div/div[1]/div[" + j + "]"));
if (clas2.equalsIgnoreCase(element1
.getAttribute("class"))) {
String ls = element1.getAttribute("id");
System.out.println(ls);
driver.findElement(
By.xpath(".//*[#id='" + ls
+ "']/div[1]/div[2]/canvas[2]"))
.click();
break tabloop;
}
}
}
} catch (NoSuchElementException e) {
present = false;
continue;
}
}
Try to avoid hard wait also.Better go with fluentwait.
I am trying to parse a pdf file with "iText". What I am trying to achieve is to parse all pages at once.
try {
PdfReader reader = new PdfReader("D:\\hl_sv\\L04MF.pdf");
int pages = reader.getNumberOfPages();
String content = "";
for (int i = 0; i <= pages; i++) {
System.out.println("============PAGE NUMBER " + i + "=============" );
content = content + " " + PdfTextExtractor.getTextFromPage(reader, i);
}
System.out.println(content);
}
I am getting this error:
Exception in thread "main" java.lang.NullPointerException
at com.itextpdf.text.pdf.parser.PdfReaderContentParser.processContent(PdfReaderContentParser.java:77)
at com.itextpdf.text.pdf.parser.PdfTextExtractor.getTextFromPage(PdfTextExtractor.java:74)
at com.itextpdf.text.pdf.parser.PdfTextExtractor.getTextFromPage(PdfTextExtractor.java:89)
at com.pdf.PDF.main(PDF.java:18)
Other problem I am facing is that the - hyphen is being parsed as ? question mark. How can I fix that?
I appreciate any help.
Edit
It works for me like this but I cant still solve the hyphen bug.
try {
PdfReader reader = new PdfReader("D:\\hl_sv\\L04MF.pdf");
int pages = reader.getNumberOfPages();
for(int i = 1; i<= pages; i++) {
System.out.println("============PAGE NUMBER " + i + "=============" );
String line = PdfTextExtractor.getTextFromPage(reader,i);
System.out.println(line);
}
}
public static String extractPdfText() throws IOException {
PdfReader pdfReader = new PdfReader("/path/to/file/myfile.pdf");
int pages = pdfReader.getNumberOfPages();
String pdfText = "";
for (int ctr = 1; ctr < pages + 1; ctr++) {
pdfText += PdfTextExtractor.getTextFromPage(pdfReader, ctr); // Page number cannot be 0 or will throw NPE
}
pdfReader.close();
return pdfText;
}
I have a string which contains dynamic HTML. The HTML can contain static image, maps, texts, links, etc. You can take a look at this link.
The answer to this question is working when I am having text and links (a href). But, if the html contains images or maps, its malfunctioning and the html is not getting generated as expected.
The methods which I have created to do the job are:
private void createHtmlWeb(){
String listOfElements = "null"; // normally found if
// webTextcontains.maps.google.com
Toast.makeText(getApplicationContext(), "" + mainEditText.getHeight(), Toast.LENGTH_SHORT).show();
ParseObject postObject = new ParseObject("Post");
Spannable s = mainEditText.getText();
String webText = Html.toHtml(s);
webText = webText.replaceAll("(</?(?:b|i|u)>)\\1+", "$1").replaceAll("</(b|i|u)><\\1>", "");
// Logic to add center tag before image
// Document doc = Jsoup.parse(webText);
// Elements imgs = doc.select("img");
// for (Element img : imgs) {
// img.attr("src", "images/" + img.attr("src")); // or whatever
// }
//
// doc.outerHtml(); // returns the modified HTML
//Determine link and favourite types to add favourite a class around it.
// Determine link and favourite types to add favourite a class around
// it.
if (webText.contains("a href")) {
String favourite = "favourite";
// Parse it into jsoup
Document doc = Jsoup.parse(webText);
// Create an array to tackle every type individually as wrap can
// affect whole body types otherwises.
Element[] array = new Element[doc.select("a").size()];
for (int i = 0; i < doc.select("a").size(); i++) {
if (doc.select("a").get(i) != null) {
array[i] = doc.select("a").get(i);
}
}
for (int i = 0; i < array.length; i++) {
// we don't want to wrap link types. Common part links have is
// http. Should update for somethng more secure.
if (array[i].toString().contains("http") == false) {
// wrapping inner href with a tag attributes
Elements link = doc.select("a");
String linkHref = link.attr("href");
Log.e("linkHref",linkHref);
array[i] = array[i].wrap("<a class=" + favourite + " href='"+linkHref+"'></a>");
}
}
// Log.e("From doc.body html *************** ", " " + doc.body());
Element element = doc.body();
Log.e("From element html *************** ", " " + element.html());
//changes to update html ahref
String currentHtml = element.html();
String newHtml = currentHtml.substring(0,currentHtml.indexOf("<a href")+1)+currentHtml.substring(currentHtml.indexOf("font"),currentHtml.indexOf("</a>"))+currentHtml.substring(currentHtml.indexOf("</a>")+4,currentHtml.length());
listOfElements = newHtml;
//refactoring html
listOfElements = wrapImgWithCenter(listOfElements);
//listOfElements = element.html();
}
// First need to do a check of the code if iti s a google maps image
if (webText.contains("maps.google.com")) {
Document doc = Jsoup.parse(webText); // Parse it into jsoup
for (int i = 0; i < doc.select("img").size(); i++) {
if (doc.select("img").get(i).toString().contains("maps.google.com")) {
// Get all numbers + full stops + get all numbers
Pattern noImage = Pattern.compile("(\\-?\\d+(\\.\\d+)?),(\\-?\\d+(\\.\\d+))+%7C(\\-?\\d+(\\.\\d+)?),(\\-?\\d+(\\.\\d+))");
// Gets the URL SRC basically.. almost.. lets try it
Matcher matcherer = noImage.matcher(doc.select("img").get(i).toString());
// Have two options - multi route or single route
if (matcherer.find() == true) {
for (int j = 0; j < matcherer.groupCount(); j++) {
latitude_to = Double.parseDouble(matcherer.group(1));
longitude_to = Double.parseDouble(matcherer.group(3));
latitude_from = Double.parseDouble(matcherer.group(5));
longitude_from = Double.parseDouble(matcherer.group(7));
}
String coOrds = "" + latitude_to + "," + longitude_to + "," + latitude_from + "," + longitude_from;
Element ele = doc.body();
ele.select("img").get(i).wrap("");
listOfElements = ele.html();
listOfElements = listOfElements.replace("&", "&");
} else if (matcherer.find() == false) {
noImage = Pattern.compile("(\\-?\\d+(\\.\\d+)?),\\s*(\\-?\\d+(\\.\\d+)?)");
matcherer = noImage.matcher(doc.select("img").get(i).toString());
Toast.makeText(getApplicationContext(), "Regex Count:" + matcherer.groupCount(), Toast.LENGTH_LONG).show();
if (matcherer.find()) {
for (int j = 0; j < matcherer.groupCount(); j++) {
latitude = Double.parseDouble(matcherer.group(1));
parseGeoPoint.setLatitude(latitude);
longitude = Double.parseDouble(matcherer.group(3));
parseGeoPoint.setLongitude(longitude);
}
}
String coOrds = "" + latitude + "," + longitude;
Element ele = doc.body();
ele.select("img").get(i).wrap("");
listOfElements = ele.html();
listOfElements = listOfElements.replace("&", "&");
}
} else {
// standard photo
Element ele = doc.body();
ele.select("img").get(i);
listOfElements = ele.html();
}
}
Log.e("listOfElements", listOfElements);
//refactoring html
listOfElements = wrapImgWithCenter(listOfElements);
// Put new value in htmlContent
postObject.put("htmlContent", listOfElements);
} else {
//refactoring html
webText = wrapImgWithCenter(webText);
postObject.put("htmlContent", webText);
}
mainEditText.getViewTreeObserver().addOnGlobalLayoutListener(new ViewTreeObserver.OnGlobalLayoutListener() {
#Override
public void onGlobalLayout(){
// TODO Auto-generated method stub
Rect r = new Rect();
mainEditText.getWindowVisibleDisplayFrame(r);
// int screenHeight = mainEditText.getRootView().getHeight();
// int heightDifference = screenHeight - (r.bottom - r.top);
}
});
// See if a trip exists
if (finalTrip != null) {
}
// Want to put the location in the location section
// if parsegeoPoint != null -- old information
if (latitude != -10000 && longitude != -10000) {
// Toast.makeText(getApplicationContext(),
// "Adding in location co-ods: " + latitude + " : " + longitude ,
// Toast.LENGTH_SHORT).show();
postObject.put("location", parseGeoPoint);
}
postObject.put("type", Post.PostType.HTML.getPostVal());
postObject.put("user", ParseObject.createWithoutData("_User", user.getObjectId()));
// Transfer these details
Intent i = new Intent(getApplicationContext(), WriteStoryAnimation.class);
i.putExtra("listOfElements", listOfElements);
i.putExtra("webText", webText);
i.putExtra("finalTrip", finalTrip);
i.putExtra("latitude", latitude);
i.putExtra("longitude", longitude);
if(mainEditText.length() > 0){
finish();
//Conflict was here from html merge.
startActivity(i);
} else {
Toast.makeText(getApplicationContext(), "Your story is empty", Toast.LENGTH_SHORT).show();
}
// finish();
// Toast.makeText(getApplicationContext(), "EditText Sie: " + height +
// " : " + desiredHeight, Toast.LENGTH_LONG).show();
}
// method to refactor html
public String wrapImgWithCenter(String html){
Document doc = Jsoup.parse(html);
//adding center tag before images
doc.select("img").wrap("<center></center>");
//adding gap after last p tag
for (int i =0; i<= 1; i++) {
doc.select("p").last().after("<br>");
}
Log.e("Wrapping", doc.html());
return doc.html();
}
You have to read the question in the link to understand the input and the output.
Other output with image and links for your reference:
<html>
<head></head>
<body>
<p dir="ltr">
<center>
<img src="http://files.parsetfss.com/bcff7108-cbce-4ab8-b5d1-1f82827e6519/tfss-9fca384a-2f7b-4632-a585-65c78f40842a-file" />
</center><br /> <font color="#009a49">Rohit Lalwani</font><br /> <a href="45.5033204,-99.8865083">
<center>
<img src="http://maps.google.com/maps/api/staticmap?center=45.5033204,-99.8865083&zoom=15&size=960x540&sensor=false&markers=color:blue%7Clabel:!%7C45.5033204,-99.8865083" />
</center></a><br /> </p>
<br />
<br />
</body>
</html>
There you can see that the class="favourite" in the href tag is missing. This is what I need to rectify. Please suggest me what to do.
Reading your original question I see that you can achieve what you want this way:
You have an anchor (a.favorite)
You have to pick his grandchild (font in this particular case, but it could be an img or whatever)
You delete the children of the original anchor
and then you append the grandchildren as a new child!.
This may sound complicated but it is very easy, here you are a code example:
String html ="<a class=\"favourite\" href=\"LixWQfueLU\"><font color=\"#009a49\">Rohit Lalwani</font></a>";
Document doc = Jsoup.parse(html);
//The original anchor
Element afav = doc.select(".favourite").first();
//The grandchild
Element select = doc.select("font").first();
afav.remove();
afav.appendChild(select);
System.out.println(afav);
Output:
<a class="favourite" href="LixWQfueLU"><font color="#009a49">Rohit Lalwani</font></a>
Hope it helps!
I am trying to read a lot of html pages using jsoup. I have an arraylist called "allPageLinks" that keeps html page links. Here is my code:
Document doc;
for (int i = 0; i < allPageLinks.size(); i++) {
try {
doc = Jsoup.connect(allPageLinks.get(i)).timeout(0).get();
Element page_clips = doc.getElementById("page_clips");
Element page_clip_content = page_clips
.getElementById("content");
Elements product_grid = page_clip_content
.select(".product-list.margin-left-5");
Elements products = product_grid.get(0).children();
for (int j = 0; j < products.size(); j++) {
try {
String productName = products.get(j)
.getElementsByClass("name").text();
String productPrice = products.get(j)
.getElementsByClass("price").text();
String productLink = products.get(j)
.getElementsByClass("image").select("a")
.first().attr("href");
Document newDoc = Jsoup.connect(productLink).get();
Elements elements = newDoc.getElementsByClass("left");
Elements productNameElement = elements.get(0)
.getElementsByClass("colorbox");
String productImage = productNameElement.attr("href");
elements = newDoc.getElementsByClass("right");
String productId = elements.get(0)
.getElementsByClass("field").get(1).text();
writer.append(productName);
writer.append(';');
writer.append(productPrice);
writer.append(';');
writer.append(productId);
writer.append(';');
writer.append(productImage);
writer.append(';');
writer.append(productLink);
writer.append('\n');
} catch (Exception ex) {
System.out.println(ex.getMessage() + " " + i + " "
+ allPageLinks.get(i) + " ICTEKICATCH");
}
}
} catch (Exception ex) {
System.out.println(ex.getMessage() + " " + i + " "
+ allPageLinks.get(i));
}
}
Even though i set connection timeout to zero, i am getting a lot of connect time out exceptions for most of the links. Can anyone help me to get rid of that exception?
Thanks
You forgot to add specify the timeout for this connection within the loop of the code:
Document newDoc = Jsoup.connect(productLink).get();
Should be:
Document newDoc = Jsoup.connect(productLink).timeout(0).get();
That is where the timeout exception is most likely occurring.