Hello I have to parse pages wich URI is resolved by server redirect.
Example:
I have http://www.juventus.com/wps/poc?uri=wcm:oid:91da6dbb-4089-49c0-a1df-3a56671b7020 that redirected is http://www.juventus.com/wps/wcm/connect/JUVECOM-IT/news/primavera%20convocati%20villar%20news%2010agosto2013?pragma=no-cache
This is URI of the page that I have to parse. The problem is that redirect URI contains spaces, here's the code.
String url = "http://www.juventus.com/wps/poc?uri=wcm:oid:91da6dbb-4089-49c0-a1df-3a56671b7020";
Document doc = Jsoup.connect(url).get();
Element img = doc.select(".juveShareImage").first();
String imgurl = img.absUrl("src");
System.out.println(imgurl);
I get this error at the second line:
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=404, URL=http://www.juventus.com/wps/wcm/connect/JUVECOM-IT/news/primavera convocati villar news 10agosto2013?pragma=no-cache
that contains the redirected url, so this means that JSoup gets the correct redirected URI. Is there a way to replace the ' ' with %20 so I can parse with no problem?
Thanks!
You are right. This is the problem. The only solution I see is to do the redirects manual. I wrote this small recursive method doing this for you. See:
public static void main(String[] args) throws IOException
{
String url = "http://www.juventus.com/wps/poc?uri=wcm:oid:91da6dbb-4089-49c0-a1df-3a56671b7020";
Document document = manualRedirectHandler(url);
Elements elements = document.getElementsByClass("juveShareImage");
for (Element element : elements)
{
System.out.println(element.attr("src"));
}
}
private static Document manualRedirectHandler(String url) throws IOException
{
Response response = Jsoup.connect(url.replaceAll(" ", "%20")).followRedirects(false).execute();
int status = response.statusCode();
if (status == HttpURLConnection.HTTP_MOVED_TEMP || status == HttpURLConnection.HTTP_MOVED_PERM || status == HttpURLConnection.HTTP_SEE_OTHER)
{
String redirectUrl = response.header("location");
System.out.println("Redirect to: " + redirectUrl);
return manuelRedirectHandler(redirectUrl);
}
return Jsoup.parse(response.body());
}
This will print you
Redirect to: http://www.juventus.com:80/wps/portal/!ut/p/b0/DcdJDoAgEATAF00GXFC8-QqVWwMuJLLEGP2-1q3Y8Mwm4Qk77pATzv_L6-KQgx-09FDeWmpEr6nRThCk36hGq1QnbScqwRMbNuXCHsFLyuTgjpVLjOMHyfCBUg!!/
Redirect to: http://www.juventus.com/wps/wcm/connect/JUVECOM-IT/news/primavera convocati villar news 10agosto2013?pragma=no-cache
/resources/images/news/inlined/42d386ef-1443-488d-8f3e-583b1e5eef61.jpg
I also added a patch for Jsoup for that:
https://github.com/jhy/jsoup/pull/354
Try this Instead
String url = "http://www.juventus.com/wps/wcm/connect/JUVECOM-IT/news/primavera%20convocati%20villar%20news%2010agosto2013";
Document doc = Jsoup.connect(url)
.data("pragma", "no-cache")
.get();
Element img = doc.select(".juveShareImage").first();
String imgurl = img.absUrl("src");
System.out.println(imgurl);
Related
I'm using selenium to get a link from a , and i wanted to check if it was a download link.
For that i used this code that i made with URL and URLConnection :
final WebElement element = driver.findElement(By.xpath(pathToFile));
URL url = null;
final String urlFileToDownload = element.getAttribute("href");
URLConnection myCon = null;
String contentDisposition = "";
try {
url = new URL(urlFileToDownload);
myCon = url.openConnection();
contentDisposition = myCon.getHeaderField("Content-Disposition");
if (!contentDisposition.contains("attachment;filename=")) {
assertTrue(false, "The link isn't a download link.");
}
} catch (final MalformedURLException e) {
throw new TestIntegrationException("Error while creating URL : " + e.getMessage());
} catch (final IOException e) {
throw new TestIntegrationException("Error while connecting to the URL : " + e.getMessage());
}
assertTrue(true, "Link is a download link.");
The probleme is that my link is a download link as you can see on this picture : Image-link-download. (the picture is a print-screen of the console)
And when i open the connection of url.openConnection();
myCon.getHeaderField("Content-Disposition") is null.
I've searched a way to do this but everytime my header-field is empty and i can't find the problem because when i check with the console, my headerfield isn't empty ...
EDIT : I'm launching my selenium test on a docker server, i think that's a important point to know.
try this:
driver.get("https://i.stack.imgur.com/64qFG.png");
WebElement img = wait5s.until(ExpectedConditions.elementToBeClickable(By.xpath("/html/body/img")));
Dimension h = img.getSize();
Assert.assertNotEquals(0, h);
Instead of looking for attachments why don't you look at the MIME type?
String contentType = myCon.getContentType();
if(contentType.startsWith("text/")) {
assertTrue("The link isn't a download link.", false);
}
My problem was caused by my session who was different with the url.openConnection().
To correct the problem i've collected my cookie JSESSION using selenium like that :
String cookieTarget = null;
for (final Cookie cookie : this.kSupTestCase.getDriver().manage().getCookies()) {
if (StringUtils.equalsIgnoreCase(cookie.getName(), "JSESSIONID")) {
cookieTarget = cookie.getName() + "=" + cookie.getValue();
break;
}
}
Then i've put the cookie to the opened connection :
try {
url = new URL(urlFichierATelecharger);
myCon = url.openConnection();
myCon.setRequestProperty("Cookie", cookieCible);
contentDisposition = myCon.getHeaderField("Content-Disposition");
if (!contentDisposition.contains("attachment;filename=")) {
assertTrue(false, "The link isn't a download link.");
}
} catch [...]
Like that i've got the good session and my URL was recognized as a download link.
I'm trying to extract information from websites using Jsoup but I don't get the same HTML code as in my browser.
I tried to use .userAgent() but it didn't work. I currently use the following function wich works for Amazon.com :
public static String getHTML(String urlToRead) throws Exception {
StringBuilder result = new StringBuilder();
URL url = new URL(urlToRead);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 5.1; rv:19.0) Gecko/20100101 Firefox/19.0");
conn.setRequestMethod("GET");
BufferedReader rd = new BufferedReader(new InputStreamReader(conn.getInputStream(), "UTF-8"));
String line;
while ((line = rd.readLine()) != null) {
result.append(line);
}
rd.close();
return result.toString();
}
The website I'm trying to parse is http://www.asos.com/ but the price of the product is always missing.
I fond this topic which is pretty close to mine but I would like to do it using only java and no external app.
So after a little playing around with the site I came up with a solution.
Now the site uses API responses to get the prices for each item, this is why you are not getting the prices in your HTML that you are receiving from Jsoup. Unfortunately there's a little more code than first expected, and you'll have to do some working out on how it should know which product Id to use instead of the hardcoded value. However, other than that the following code should work in your case.
I've included comments that hopefully explain each step, and I recommend taking a look at the API response, as there maybe some other data you require, in fact this maybe the same with the product details and description, as further data will need to be parsed out of elementById field.
Good luck and let me know if you need any further help!
import org.json.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.*;
import org.jsoup.select.Elements;
import java.io.IOException;
public class Main
{
final String productID = "8513070";
final String productURL = "http://www.asos.com/prd/";
final Product product = new Product();
public static void main( String[] args )
{
new Main();
}
private Main()
{
getProductDetails( productURL, productID );
System.out.println( "ID: " + product.productID + ", Name: " + product.productName + ", Price: " + product.productPrice );
}
private void getProductDetails( String url, String productID )
{
try
{
// Append the product url and the product id to retrieve the product HTML
final String appendedURL = url + productID;
// Using Jsoup we'll connect to the url and get the HTML
Document document = Jsoup.connect( appendedURL ).get();
// We parse the HTML only looking for the product section
Element elementById = document.getElementById( "asos-product" );
// To simply get the title we look for the H1 tag
Elements h1 = elementById.getElementsByTag( "h1" );
// Because more than one H1 tag is returned we only want the tag that isn't empty
if ( !h1.text().isEmpty() )
{
// Add all data to Product object
product.productID = productID;
product.productName = h1.text().trim();
product.productPrice = getProductPrice(productID);
}
}
catch ( IOException e )
{
e.printStackTrace();
}
}
private String getProductPrice( String productID )
{
try
{
// Append the api url and the product id to retrieve the product price JSON document
final String apiURL = "http://www.asos.com/api/product/catalogue/v2/stockprice?productIds=" + productID + "&store=COM";
// Using Jsoup again we connect to the URL ignoring the content type and retrieve the body
String jsonDoc = Jsoup.connect( apiURL ).ignoreContentType( true ).execute().body();
// As its JSON we want to parse the JSONArray until we get to the current price and return it.
JSONArray jsonArray = new JSONArray( jsonDoc );
JSONObject currentProductPriceObj = jsonArray
.getJSONObject( 0 )
.getJSONObject( "productPrice" )
.getJSONObject( "current" );
return currentProductPriceObj.getString( "text" );
}
catch ( IOException e )
{
e.printStackTrace();
}
return "";
}
// Simple Product object to store the data
class Product
{
String productID;
String productName;
String productPrice;
}
}
Oh, and you'll also need org.json for parse the JSON response from the API.
i want read JSP page and write it to HTML page. I have 3 method in parse class. first readHTMLBody(), second WriteNewHTML(), third ZipToEpub().
When I called this method in parse class, all method work. But called in JSP or webservice UTF-8 character looks like "?" in readHTMLBody(). How can I fix it?
public String readHTMLBody() {
try {
String url = "http://localhost:8080/Library/part.jsp";
Document doc = Jsoup.parse((new URL(url)).openStream(), "utf-8", url);
String body = doc.html();
Elements title = doc.select("xxx");
linkURI = title.toString();
linkURI = linkURI.replaceAll("<xxx>", "");
linkURI = linkURI.replaceAll("</xxx>", "");
linkURI = linkURI.replaceAll("\\s", "");
resultBody = body;
resultBody = resultBody.replaceAll("part/" + linkURI + "/assets/", "assets/");
} catch (IOException e) {
}
return resultBody;
}
I tried to make a image links downloader with jsoup. I have made a downloader HTML code part, and when I have done a parse part, I recognized, that sometimes links to images appeared without main part. So I found absUrl solution, but by some reasons it did not work (it gave me null). So I tried use uri.resolve(), but it gave me unchanged result. So now I do not know how to solve it. I attached part of my code, that responsible for parsing ant writing url to string:
public static String finalcode(String textin) throws Exception {
String text = source(textin);
Document doc = Jsoup.parse(text);
Elements images = doc.getElementsByTag("img");
String Simages = images.toString();
int Limages = countLines(Simages);
StringBuilder src = new StringBuilder();
while (Limages > 0) {
Limages--;
Element image = images.get(Limages);
String href = image.attr("src");
src.append(href);
src.append("\n");
}
String result = src.toString();
return result;
}
It looks like you are parsing HTML from String, not from URL. Because of that jsoup can't know from which URL this HTML codes comes from, so it can't create absolute path.
To set this URL for Document you should parse it using Jsoup.parse(String html, String baseUri) version, like
String url = "http://server/pages/document.htlm";
String text = "<img src = '../images/image_name1.jpg'/><img src = '../images/image_name2.jpg'/>'";
Document doc = Jsoup.parse(text, url);
Elements images = doc.getElementsByTag("img");
for (Element image : images){
System.out.println(image.attr("src")+" -> "+image.attr("abs:src"));
}
Output:
../images/image_name1.jpg -> http://server/images/image_name1.jpg
../images/image_name2.jpg -> http://server/images/image_name2.jpg
Other option would be letting Jsoup parse page directly by supplying URL instead of String with HTML
Document doc = Jsoup.connect("http://example.com").get();
This way Document will know from which URL it came, so it will be able to create absolute paths.
I am trying to run google search api from the SO link below :-
How can you search Google Programmatically Java API
Here is my code below:-
public class RetrieveArticles {
public static void main(String[] args) throws UnsupportedEncodingException, IOException {
// TODO Auto-generated method stub
String google = "http://www.google.com/news?&start=1&q=";
String search = "Police Violence in USA";
String charset = "UTF-8";
String userAgent = "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"; // Change this to your company's name and bot homepage!
Elements links = Jsoup.connect(google + URLEncoder.encode(search, charset)).userAgent(userAgent).get().children();
for (Element link : links) {
String title = link.text();
String url = link.absUrl("href"); // Google returns URLs in format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
url = URLDecoder.decode(url.substring(url.indexOf('=') +1, url.indexOf('&')), "UTF-8");
if (!url.startsWith("http")) {
continue; // Ads/news/etc.
}
System.out.println("Title: " + title);
System.out.println("URL: " + url);
}
}
}
When I try to run this I get the below error . Can anyone please help me fix it .
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.substring(String.java:1911)
at google.api.search.RetrieveArticles.main(RetrieveArticles.java:34)
Thanks in advance .
The problem is here :
url.substring(url.indexOf('=') +1, url.indexOf('&'))
Either url.indexOf('=') or url.indexOf('&') returned -1, which is an illegal argument in subString.
You should validate the url you are parsing before assuming that it contains = and &.
add System.Out.Println(Url); before the
url = URLDecoder.decode(url.substring(url.indexOf('=') +1, url.indexOf('&')), "UTF-8");
then you will come to know, wether url string is containg '=','&' or not .