HTMLunit suppress errors: deprecated? - java

I am trying to suppress the JavaScript errors that HTMLunit almost always shows when loading a page.
But strangely enough, the following code does not work:
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlForm;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlPasswordInput;
import com.gargoylesoftware.htmlunit.html.HtmlSubmitInput;
import com.gargoylesoftware.htmlunit.html.HtmlTextInput;
public class HttpClientLogin {
public static void main(String[] args) throws Exception
{
HttpClientLogin logInNow = new HttpClientLogin();
logInNow.loadPage();
}
public void loadPage() throws Exception {
WebClient webClient = new WebClient();
HtmlPage currentPage = webClient.getPage("the url link here");
webClient.setThrowExceptionOnFailingStatusCode(false);
String textSource = currentPage.asText();
String xmlSource = currentPage.asXml();
System.out.println(xmlSource);
}
}
It gives the following error:
The method setThrowExceptionOnFailingStatusCode(boolean) is undefined for the type WebClient
Are these methods deprecated or am I using the wrong package?

The setThrowExceptionOnFailingStatusCode(boolean) is defined in the WebClientOptions class, not in WebClient.
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
http://htmlunit.sourceforge.net/apidocs/com/gargoylesoftware/htmlunit/WebClientOptions.html#setThrowExceptionOnFailingStatusCode(boolean)

The setThrowExceptionOnFailingStatusCode(boolean) is defined on the both the class webClient and WebClientOptions.

Related

Get string with translation from Google translate

i have faced some problem - i need to integrate Google Translate API to my project but i'm new and can't understand how to do it properly. This code is made just for example.
What i have now when i launch:few seconds for input and then program is closing.
What i want to have: i put my input and get translation in console(+ in array if possible).
Also i made folder "libs" and added here gson-2.8.5.jar.
Thank you in advance.
package com.company;
import java.io.IOException;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
public class Connect {
public void gogo() throws IOException, InterruptedException {
String query = "key=AIzaSyB2HijQLlsmI1udH9ARl45oC5eAj4XfjTw"
+"&source=en"
+"&target=uk"
+"&q=hello";
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create("https://www.googlepis.com/language/translate/v2?"+ query))
.header("Referer", "https://www.daytranslations.com/free-translation-online/")
.GET()
.build();
String responseJson = HttpClient.newHttpClient()
.send(request, HttpResponse.BodyHandlers.ofString())
.body();
System.out.println(responseJson);
}
}
package com.company;
import java.io.IOException;
public class Main {
public static void main(String[] args) throws IOException, InterruptedException {
Connect connect = new Connect();
connect.gogo();
}
}
There seems to be a typo in your request, try "https://www.googleapis.com/language/translate/v2?" instead.
A basic way to use Gson to deserialize the API response would be:
JsonParser.parseString(responseJson).getAsJsonObject()
.get("data").getAsJsonObject()
.get("translations").getAsJsonArray()
.get(0).getAsJsonObject()
.get("translatedText").getAsString();

Web Scraping with Java using HTMLUnit

I am trying to web scrape https://www.nba.com/standings#/
Here is my code
What I am trying to use is page.getByXPath("//caption[#class='standings__header']/span")
Which should pull back Eastern Conference and Western Conference but instead it pulls back nothing I don't understand if my Xpath is wrong?
package Standings;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlSpan;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
public class Standings {
private static final String baseUrl = "https://www.nba.com/standings#/";
public static void main(String[] args) {
WebClient client = new WebClient();
client.getOptions().setJavaScriptEnabled(false);
client.getOptions().setCssEnabled(false);
client.getOptions().setUseInsecureSSL(true);
String jsonString = "";
ObjectMapper mapper = new ObjectMapper();
try {
HtmlPage page = client.getPage(baseUrl);
System.out.println(page.asXml());
page.getByXPath("//caption[#class='standings__header']/span")
} catch (IOException e) {
e.printStackTrace();
}
}
}
Have used this code to verify your problem:
public static void main(String[] args) throws IOException {
final String url = "https://www.nba.com/standings#/";
try (final WebClient webClient = new WebClient()) {
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setUseInsecureSSL(true);
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(10000);
System.out.println(page.asXml());
}
}
When running this i got a bunch of warning and errors in the log.
(BTW: the page produces also many error/warnings when running with real browsers. Seems the maintainer of the page has a interesting view on quality)
I guess the problematic error is this one
TypeError: Cannot modify readonly property: constructor. (https://www.nba.com/ng/game/main.js#1)
There is a known bug in the javascript support of HtmlUnit (https://sourceforge.net/p/htmlunit/bugs/1897/). Because the bug is thrown from main.js i guess this will stop the processing of the page javascript before the content you are looking for is generated.
So far i found no time to fix this (looks like this has to be fixed in Rhino) but this one is on the list.
Have a look at https://twitter.com/HtmlUnit to get informed about updates.
The page you are trying to scrape needs Javascript to display properly. If you disable it, most of the elements won't load.
Changing the line
client.getOptions().setJavaScriptEnabled(false);
to
client.getOptions().setJavaScriptEnabled(true);
should do the trick

HtmlUnit (junit) me is returning an error in the code

first anything clarified that I am using Google Translator. I am Hispanic. not be much English
Well, said you what I need to do
I'm trying to make this code work but it gives me an error, note that I am putting as same ta at the official website::
official website: http://htmlunit.sourceforge.net/gettingStarted.html
package serieflv;
import org.junit.Test;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import junit.framework.Assert;
public class webClient {
#Test
public void homePage() throws Exception {
final WebClient webClient = new WebClient();
try (final WebClient webClient = new WebClient()) {
final HtmlPage page = webClient.getPage("http://htmlunit.sourceforge.net");
Assert.assertEquals("HtmlUnit - Welcome to HtmlUnit", page.getTitleText());
final String pageAsXml = page.asXml();
Assert.assertTrue(pageAsXml.contains("<body class=\"composite\">"));
final String pageAsText = page.asText();
Assert.assertTrue(pageAsText.contains("Support for the HTTP and HTTPS protocols"));
}
}
}
These are the errors that I launches
These are the errors that I launches
You seem to have to have incorrectly imported a JUnit 3 class here while your test case is clearly a JUnit 4 test case.
Have the following line
import junit.framework.Assert;
modified to
import org.junit.Assert;

Grizzly and ServletContainerContext

I'm trying to get hold of some injected context (for example Session or HttpServletRequest) in a Servlet I've written, running on Grizzly, but nothing I do seems to work. The whole process seems to stall rather prematurely with the following error:
SEVERE: Missing dependency for field: javax.servlet.http.HttpServletRequest com.test.server.LolCat.hsr
The server is dead simple, it consists of two files, the static entry point (Main.java):
package com.test.server;
import java.io.IOException;
import java.net.URI;
import javax.ws.rs.core.UriBuilder;
import org.glassfish.grizzly.http.server.HttpServer;
import com.sun.jersey.api.container.grizzly2.GrizzlyServerFactory;
import com.sun.jersey.api.core.ClassNamesResourceConfig;
import com.sun.jersey.api.core.ResourceConfig;
public class Main {
private static URI getBaseURI() {
return UriBuilder.fromUri("http://localhost/").port(8080).build();
}
public static final URI BASE_URI = getBaseURI();
public static void main(String[] args) throws IOException {
ResourceConfig rc = new ClassNamesResourceConfig(LolCat.class);
HttpServer httpServer = GrizzlyServerFactory.createHttpServer(BASE_URI, rc);
System.in.read();
httpServer.stop();
}
}
and the serlvet (LolCat.java):
package com.test.server;
import javax.servlet.http.HttpServletRequest;
import javax.ws.rs.GET;
import javax.ws.rs.Path;
import javax.ws.rs.core.Context;
#Path(value = "/lol")
public class LolCat {
#Context HttpServletRequest hsr;
#GET
#Path(value="/cat")
public String list() {
return "meow";
}
}
Specifically, it's the #Context-line in the above source file that is the source and solution to all my problems. I need it, and according to everything I've read about Jersey and Servlets it should work, but alas it does not. I've also tried using GrizzlyWebContainerFactory instead of the GrizzlyServerFactory, but to no avail.
For reference, the project is compiled with the following dependencies:
org.glassfish.grizzly:grizzly-framework:jar:2.2.21
org.glassfish.grizzly:grizzly-http:jar:2.2.21
org.glassfish.grizzly:grizzly-http-servlet:jar:2.2.21
org.glassfish.grizzly:grizzly-http-server:jar:2.2.21
com.sun.jersey:jersey-server:jar:1.17
com.sun.jersey:jersey-servlet:jar:1.17
com.sun.jersey:jersey-core:jar:1.17
javax.servlet:javax.servlet-api:jar:2.5.0
com.sun.jersey:jersey-grizzly2:jar:1.17
com.sun.jersey:jersey-grizzly2-servlet:jar:1.17
asm:asm:jar:3.3.1
This Main class works fine for me:
package com.test.server;
import com.sun.jersey.api.container.grizzly2.GrizzlyServerFactory;
import java.io.IOException;
import java.net.URI;
import javax.ws.rs.core.UriBuilder;
import com.sun.jersey.api.core.ClassNamesResourceConfig;
import com.sun.jersey.spi.container.servlet.ServletContainer;
import org.glassfish.grizzly.http.server.HttpHandler;
import org.glassfish.grizzly.http.server.HttpServer;
import org.glassfish.grizzly.http.server.Request;
import org.glassfish.grizzly.http.server.Response;
import org.glassfish.grizzly.servlet.ServletRegistration;
import org.glassfish.grizzly.servlet.WebappContext;
public class Main {
private static final String JERSEY_SERVLET_CONTEXT_PATH = "";
private static URI getBaseURI() {
return UriBuilder.fromUri("http://localhost").port(8080).path("/").build();
}
public static final URI BASE_URI = getBaseURI();
public static void main(String[] args) throws IOException {
// Create HttpServer and register dummy "not found" HttpHandler
HttpServer httpServer = GrizzlyServerFactory.createHttpServer(BASE_URI, new HttpHandler() {
#Override
public void service(Request rqst, Response rspns) throws Exception {
rspns.setStatus(404, "Not found");
rspns.getWriter().write("404: not found");
}
});
// Initialize and register Jersey Servlet
WebappContext context = new WebappContext("WebappContext", JERSEY_SERVLET_CONTEXT_PATH);
ServletRegistration registration = context.addServlet("ServletContainer", ServletContainer.class);
registration.setInitParameter(ServletContainer.RESOURCE_CONFIG_CLASS,
ClassNamesResourceConfig.class.getName());
registration.setInitParameter(ClassNamesResourceConfig.PROPERTY_CLASSNAMES, LolCat.class.getName());
registration.addMapping("/*");
context.deploy(httpServer);
System.in.read();
httpServer.stop();
}
}
Try http://localhost:8080/lol/cat in your browser.
You can change JERSEY_SERVLET_CONTEXT_PATH to update Servlet's context-path.
As per developers explanations - Grizzly is not fully compliant to JAX-RS 2.0 so there will be no official contexts injections/wrapping. See Jersey Bug-1960
Applicable for Jersey + Grizzly version 2.7+
Luckily there is a way to inject Grizzly request/response objects. Kind of tricky but works
Code sample provided in one of Jersey's unit tests. See Jersey container test
So code fragment will be:
import javax.inject.Inject;
import javax.inject.Provider;
public someclass {
#Inject
private Provider<Request> grizzlyRequestProvider;
public void method() {
if (grizzlyRequestProvider != null) {
Request httpRequest = grizzlyRequestProvider.get();
// Extract what you need
}
}
}
Works fine both for filters and service methods
You can also manually register a ResourceContext
HttpServer httpServer = GrizzlyHttpServerFactory.createHttpServer(getBaseURI());
WebappContext context = new WebappContext("WebappContext", "/api");
ServletRegistration registration = context.addServlet("ServletContainer",
new ServletContainer(config));
registration.addMapping("/*");
context.deploy(httpServer);
Where config is your resource context.
Try something like this :-
public class Main {
private static URI getBaseURI() {
return UriBuilder.fromUri("http://localhost/").port(8080).build();
}
public static void main(String[] args) throws IOException {
ResourceConfig rc = new ResourceConfig().packages("com.example");//path to you class files
HttpServer httpServer = GrizzlyHttpServerFactory.createHttpServer(getBaseURI(), rc);
System.in.read();
httpServer.stop();
}
}

How to scrape the images from web pages?

I used htmlunit to scrape the images from web pages. I am beginner in htmlunit. I coded, but don't know how to get the images. Below is my code.
import java.io.*;
import java.net.URL;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class urlscrap {
public static void main(String[] args) throws Exception
{
//WebClient webClient = new WebClient(Opera);
WebClient webClient = new WebClient();
HtmlPage currentPage = (HtmlPage) webClient.getPage(new URL("http://www.google.com"));
System.out.println(currentPage.asText());
//webClient.closeAllWindows();
}
}
Does this work for you??
import java.net.URL;
import java.util.List;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlImage;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class urlscrap {
public static void main(String[] args) throws Exception
{
//WebClient webClient = new WebClient(Opera);
WebClient webClient = new WebClient();
HtmlPage currentPage = (HtmlPage) webClient.getPage(new URL("http://www.google.com"));
//get list of all divs
final List<?> images = currentPage.getByXPath("//img");
for (Object imageObject : images) {
HtmlImage image = (HtmlImage) imageObject;
System.out.println(image.getSrcAttribute());
}
//webClient.closeAllWindows();
}
}
Looks like you're getting the text of the page, which is indeed the first step. What's your question? Are you having a problem finding all the images referenced within the page? I recommend looking up how to do DOM parsing in Java, and use it to extract all the img tags from the page.
If you don't mind switching languages, then I would recommend Python's scrapy. It is the best framework I've used so far to scrape web content, including images (it can even create thumbnails for you automatically). Personally, I would not use java for such tasks.

Categories

Resources