When running the following code:
try {
Document doc = Jsoup.connect("https://pomofocus.io/").get();
Elements text = doc.select("div.sc-kEYyzF");
System.out.println(text.text());
}
catch (IOException e) {
e.printStackTrace();
}
No output occurs. When changing the println to:
System.out.println(text.first().text());
I get a NullPointerException but nothing else.
jsoup doesn't execute javascript - it parses the HTML that the server returns. You can check View Source (vs Inspect) to see the response from the server, and what is selectable.
Related
I am building a web-scraper using Java and JavaFx. I already have an application running using JavaFx.
I am building a web-scraper following similar procedures as this blog: https://ksah.in/introduction-to-web-scraping-with-java/
However, instead of having a fixed url, I want to input any url and scrape. For this, I need to handle the error when the url is not found. Therefore, I need to display "Page not found" in my application console when the url is not found.
Here is my code for the part where I get URL:
void search() {
List<Course> v = scraper.scrape(textfieldURL.getText(), textfieldTerm.getText(),textfieldSubject.getText());
...
}
and then I do:
try {
HtmlPage page = client.getPage(baseurl + "/" + term + "/subject/" + sub);
...
}catch (Exception e) {
System.out.println(e);
}
in the scraper file.
It seems that the API will throw FailingHttpStatusCodeException if you set it up correctly.
if the server returns a failing status code AND the property
WebClientOptions.setThrowExceptionOnFailingStatusCode(boolean) is set
to true.
You can also get the WebResponse from the Page and call getStatusCode() to get the HTTP status code.
The tutorial you added contains the following code:
.....
WebClient client = new WebClient();
client.getOptions().setCssEnabled(false);
client.getOptions().setJavaScriptEnabled(false);
try {
String searchUrl = "https://newyork.craigslist.org/search/sss?sort=rel&query=" + URLEncoder.encode(searchQuery, "UTF-8");
HtmlPage page = client.getPage(searchUrl);
}catch(Exception e){
e.printStackTrace();
}
.....
With this code when client.getPage throws any error, for example a 404, it will be catched and printed to the console.
As you stated you want to print "Page not found", which means we have to catch a specific exception and log the message. The library used in the tutorial is net.sourceforge.htmlunit and as you can see here (http://htmlunit.sourceforge.net/apidocs/com/gargoylesoftware/htmlunit/WebClient.html#getPage-java.lang.String-) the getPage method throws a FailingHttpStatusCodeException, which contains the status code from the HttpResponse. (http://htmlunit.sourceforge.net/apidocs/com/gargoylesoftware/htmlunit/FailingHttpStatusCodeException.html)
This means we have to catch the FailingHttpStatusCodeException and check if the statuscode is a 404. If yes, log the message, if not, print the stacktrace for example.
Just for the sake of clean code, try not to catch them all (like in pokemon) as in the tutorial but use specific catch-blocks for the IOException, FailingHttpStatusCodeException and MalformedURLException from the getPage method.
at the moment I am trying to program a program which is able to render a link of an xml-file. I use Jsoup, my current code is the following
public static String XmlReader() {
InputStream is = RestService.getInstance().getWsilFile();
try {
Document doc = Jsoup.parse(fis, null, "", Parser.xmlParser());
} catch (Exception e) {
e.printStackTrace();
return null;
}
}
}
I would like to read the following part from a XML file:
<wsil:service>
<wsil:abstract>Read the full documentation on: https://host/sap/bc/mdrs/cdo?type=psm_isi_r&objname=II_QUERY_PROJECT_IN&saml2=disabled</wsil:abstract>
<wsil:name>Query Projects</wsil:name>
<wsil:description location="host/sap/bc/srt/wsdl/srvc_00163E5E1FED1EE897C188AB4A5723EF/wsdl11/allinone/ws_policy/document?sap-vhost=host&saml2=disabled" referencedNamespace="http://schemas.xmlsoap.org/wsdl/"/>
</wsil:service>
I want to return the following URL as String
host/sap/bc/srt/wsdl/srvc_00163E5E1FED1EE897C188AB4A5723EF/wsdl11/allinone/ws_policy/document?sap-vhost=host&saml2=disabled
How can I do that ?
Thank you
If there is only one tag wsil:description then you can use this code:
doc.outputSettings().escapeMode(EscapeMode.xhtml);
String val = doc.select("wsil|description").attr("location");
Escape mode should be changed, since you are not working on regular html, but xml.
If you have more than one tag with given name you can search for distinct neighbour element, and find required tag with respect to it:
String val = doc.select("wsil|name:contains(Query Projects)").first().parent().select("wsil|description").attr("location");
I'm working on a plugin for Eclipse and I created a StructuredTextEditor. The editor contains an XML. And I want to align the code nicely (like indent etc.). I search a possibility to apply the standard function "Format" of Eclipse SHIFT+Ctrl+F.
I found a code snippet that does exaclty this but I didn't get it to work:
String commandId = IJavaEditorActionDefinitionIds.FORMAT;
IHandlerService handlerService = (IHandlerService)PlatformUI.getWorkbench().getService(IHandlerService.class);
try {
handlerService.executeCommand(commandId, null);
} catch (Exception e1) {
e1.printStackTrace();
}
I always get the following Exception:
org.eclipse.core.commands.NotHandledException: There is no handler to execute for command org.eclipse.jdt.ui.edit.text.java.format
Does anyone can help me get running this code, or got en other solution to format the xml content, its important to use the same format like the eclipse formatter uses.
Thanks to greg-449 I searched the correct function to call and found it.
Here is my function that works with StructuredTextEditor.
private void formatString() {
String commandId = "org.eclipse.wst.sse.ui.format.document";
IHandlerService handlerService = (IHandlerService) PlatformUI.getWorkbench().getService(IHandlerService.class);
try {
handlerService.executeCommand(commandId, null);
} catch (Exception e1) {
}
}
I have a html page that I am reading.
If the format I am reading in that page is not present I want to exit and continue with the next page but that is not working.
can you please let me know what I am missing
try
{
Document doc = Jsoup.connect(urlget).get();
Element tables = doc.select("div.itembody");
websiteaddress= tables.text();
}
catch (IOException ee)
{
}
If the get is not having itembody I am seeing a exception:
Exception in thread "main" java.lang.NullPointerException
I want this loop to be continued not the program exsit when there is a exception
doc.select returns an object of type Elements (a list of Elements) not Element. If no element in your html matches the query you get an empty list of elements. Change your code to:
try
{
Document doc = Jsoup.connect(urlget).get();
Elements tables = doc.select("div.itembody");
if(tables.isEmpty())
noDivItembodyInHTML();
else
websiteaddress = tables.first().text();
}
catch (IOException ee)
{
}
I have a swing application that read HTML pages using the following command
String urlzip = null;
try {
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
for (Element link : links) {
if (link.attr("abs:href").contains("BcfiHtm.zip")) {
urlzip = link.attr("abs:href").toString();
}
}
} catch (IOException e) {
textAreaStatus.append("Failed to get new file from internet:"+e.getMessage()+"\n");
e.printStackTrace();
}
return urlzip;
then my swing application will return a string, It works fine and it reads any HTML page that I give to it. However, some times the application gave me the following error type Exception report. How can i increase timeOut?
There's an example on this page.
Jsoup.connect("http://example.com").timeout(3000)
This error occurs while you are trying to read data and because of large data or connection problem it can not complete the task. I would suggest you to increase your Timeout using above code atleast for 1 minute. so it will be like below code,
Jsoup.connect("http://example.com").timeout(60000);