WebScraping with HTML Unit Issue with apache lang3 - java

UPDATE: I ended up using ghost.py but would appreciate a response.
I have been using straight java/apache httpd and nio to crawl must pages recently but came across what I expected was a simple issue that actually appears to not be. I am trying to use html unit to crawl a page but every time I run the code below I get the error proceeding the code telling me a jar is missing. Unfortunately, I could not find my answer here as there is a weird part to this question.
So, here is the weird part. I have the jar (lang3) it is up to date and it contains a method StringUtils.startsWithIgnoreCase(String string,String prefix) that works. I would really like to avoid selenium as I need to crawl (if sampling tells me properly), about 1000 pages on the same site over several months.
Is there a particular version I need? All I saw was the note to update to 3-1 which I have. Is there a method if installation that works?
Thanks.
The code I am running is:
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.Page;
import com.gargoylesoftware.htmlunit.RefreshHandler;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlForm;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlTable;
import com.gargoylesoftware.htmlunit.html.HtmlTableRow;
public class crawl {
public crawl()
{
//TODO Constructor
crawl_page();
}
public void crawl_page()
{
//TODO control the crawling
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_10);
webClient.setRefreshHandler(new RefreshHandler() {
public void handleRefresh(Page page, URL url, int arg) throws IOException {
System.out.println("handleRefresh");
}
});
//the url for CA's Megan's law sex off
String url="http://www.myurl.com" //not my url
HtmlPage page;
try {
page = (HtmlPage) webClient.getPage(url);
HtmlForm form=page.getFormByName("_ctl0");
form.getInputByName("cbAgree").setChecked(true);
page=form.getButtonByName("Continue").click();
System.out.println(page.asText());
} catch (FailingHttpStatusCodeException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (MalformedURLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
The error is:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.commons.lang3.StringUtils.startsWithIgnoreCase(Ljava/lang/CharSequence;Ljava/lang/CharSequence;)Z
at com.gargoylesoftware.htmlunit.util.URLCreator$URLCreatorStandard.toUrlUnsafeClassic(URLCreator.java:66)
at com.gargoylesoftware.htmlunit.util.UrlUtils.toUrlUnsafe(UrlUtils.java:193)
at com.gargoylesoftware.htmlunit.util.UrlUtils.toUrlSafe(UrlUtils.java:171)
at com.gargoylesoftware.htmlunit.WebClient.<clinit>(WebClient.java:159)
at ca__soc.crawl.crawl_page(crawl.java:34)
at ca__soc.crawl.<init>(crawl.java:24)
at ca__soc.us_ca_ca_soc.main(us_ca_ca_soc.java:17)

According to documentation
Since:
2.4, 3.0 Changed signature from startsWithIgnoreCase(String, String) to startsWithIgnoreCase(CharSequence, CharSequence)
so, probably you have two similar jars on your classpath.

Related

Cannot submit a website form through Selenium

This is the second post on Stack Overflow on my quest to access this godforsaken website: https://portal.mcpsmd.org/guardian/home.html
import org.openqa.selenium.By;
import org.openqa.selenium.Keys;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;
public class WebAccessor {
public static void main(String[] args) {
WebDriver driver = new HtmlUnitDriver();
driver.get("https://portal.mcpsmd.org/public/");
System.out.println(driver.getCurrentUrl());
// Find the text input element by its name
WebElement username = driver.findElement(By.id("fieldAccount"));
WebElement password = driver.findElement(By.id("fieldPassword"));
// Enter something to search for
username.sendKeys("");
password.sendKeys("");
WebElement submitBtn = driver.findElement(By.id("btn-enter"));
submitBtn.click();
System.out.println(driver.getCurrentUrl());
driver.quit();
}
}
This code is tested and works on Facebook
I am sure that my button is being pressed as when I click submit, the site URL changes from
https://portal.mcpsmd.org/public/
to
https://portal.mcpsmd.org/guardian/home.html
When I type in usernames and passwords, (actual user and pass cannot be disclosed for obvious reasons), the password line actually tacks on another 20 or so characters to the end of the password field. (You can see this by typing in any random username and password and clicking submit).
This has lead me to believe there is some sort of front-end encryption going on. Is there any feasible way to log in?
Many thanks in advance.
due to lack of credentials, my answer is just a bet.
But i think you should redirect after login, with a little tweak to avoid exceptions, like this:
import java.io.IOException;
import java.net.MalformedURLException;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlForm;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class WebAccessor {
public static void main(String[] args) {
WebClient WEB_CLIENT = new WebClient(BrowserVersion.CHROME);
WEB_CLIENT.getCookieManager().setCookiesEnabled(true);
HtmlPage loginPage;
try {
loginPage = WEB_CLIENT.getPage("https://portal.mcpsmd.org/public/");
HtmlForm loginForm = loginPage.getFirstByXPath("//form[#id='LoginForm']");
loginForm.getInputByName("account").setValueAttribute("YOURPASSWORD");
loginForm.getInputByName("pw").setValueAttribute("YOURPASSWORD");
loginForm.getElementsByTagName("button").get(0).click();
HtmlPage landing = WEB_CLIENT.getPage("https://portal.mcpsmd.org/guardian/home.html#/termGrades");
System.out.println(landing.getTitleText());
} catch (FailingHttpStatusCodeException e) {
// TODO Auto-generated catch block
//e.printStackTrace();
} catch (MalformedURLException e) {
// TODO Auto-generated catch block
//e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
//e.printStackTrace();
}
}
}
My output is: Student and Parent Sign In. But if you set correct attributes, it should be ok.

validate webhook using java Event.validateReceivedEvent always fails signature validation

I prepared a servlet in my web site to be notified from PayPal webhook. The development version of the servlet logs the http headers and the body. Here is a screen capture with one example:
I've created a "self contained test application" that shows the problem.
package com.rsws.renew;
import java.io.InputStream;
import java.security.InvalidKeyException;
import java.security.NoSuchAlgorithmException;
import java.security.SignatureException;
import java.util.HashMap;
import java.util.Map;
import com.paypal.api.payments.Event;
import com.paypal.base.Constants;
import com.paypal.base.rest.APIContext;
import com.paypal.base.rest.PayPalRESTException;
import com.paypal.base.rest.PayPalResource;
/**
* #author Ignacio
*
*/
public class TestWebHook {
public static void main(String[] argv) {
try {
InputStream is = InvoicePaid.class
.getResourceAsStream("/sdk_config.properties");
try {
PayPalResource.initConfig(is);
} catch (PayPalRESTException e) {
e.printStackTrace();
}
APIContext apiContext = new APIContext();
Map<String, String> map = new HashMap<>(PayPalResource.getConfigurations());
apiContext.setConfigurationMap(map);
Map<String,String> headers = new HashMap<String,String>();
// this is the data provided by PayPal sandbox
map.put(Constants.PAYPAL_WEBHOOK_ID, "3W2725225F637605K");
String payload = "{\"id\":\"WH-0T490472X6099635W-7LJ29748BW389372K\",\"create_time\":\"2015-09-25T23:14:14Z\",\"resource_type\":\"invoices\",\"event_type\":\"INVOICING.INVOICE.PAID\",\"summary\":\"An invoice was created\",\"resource\":{\"id\":\"INV2-8FSD-3HT6-BRHR-UHYV\",\"number\":\"MM00063\",\"status\":\"PAID\",\"merchant_info\":{\"email\":\"example#outlook.com\",\"first_name\":\"Dennis\",\"last_name\":\"Doctor\",\"business_name\":\"Medical Professional LLC\",\"address\":{\"line1\":\"1234 Main St\",\"line2\":\"Apt 302\",\"city\":\"Portland\",\"state\":\"OR\",\"postal_code\":\"97217\",\"country_code\":\"US\"}},\"billing_info\":[{\"email\":\"example#example.com\",\"business_name\":\"Medical Professionals LLC\",\"language\":\"en_US\"}],\"items\":[{\"name\":\"Sample Item\",\"quantity\":1,\"unit_price\":{\"currency\":\"USD\",\"value\":\"1.00\"},\"unit_of_measure\":\"QUANTITY\"}],\"invoice_date\":\"2015-09-28 PDT\",\"payment_term\":{\"term_type\":\"DUE_ON_RECEIPT\",\"due_date\":\"2015-09-28 PDT\"},\"tax_calculated_after_discount\":true,\"tax_inclusive\":false,\"total_amount\":{\"currency\":\"USD\",\"value\":\"1.00\"},\"payments\":[{\"type\":\"PAYPAL\",\"transaction_id\":\"22592127VV907111U\",\"transaction_type\":\"SALE\",\"method\":\"PAYPAL\",\"date\":\"2015-09-28 14:37:13 PDT\"}],\"metadata\":{\"created_date\":\"2015-09-28 14:35:46 PDT\",\"last_updated_date\":\"2015-09-28 14:37:13 PDT\",\"first_sent_date\":\"2015-09-28 14:35:47 PDT\",\"last_sent_date\":\"2015-09-28 14:35:47 PDT\"},\"paid_amount\":{\"paypal\":{\"currency\":\"USD\",\"value\":\"1.00\"}},\"links\":[{\"rel\":\"self\",\"href\":\"https://api.paypal.com/v1/invoicing/invoices/INV2-8FSD-3HT6-BRHR-UHYV\",\"method\":\"GET\"}]},\"links\":[{\"href\":\"https://api.paypal.com/v1/notifications/webhooks-events/WH-0T490472X6099635W-7LJ29748BW389372K\",\"rel\":\"self\",\"method\":\"GET\"},{\"href\":\"https://api.paypal.com/v1/notifications/webhooks-events/WH-0T490472X6099635W-7LJ29748BW389372K/resend\",\"rel\":\"resend\",\"method\":\"POST\"}]}";
headers.put("PAYPAL-CERT-URL", "https://api.paypal.com/v1/notifications/certs/CERT-360caa42-fca2a594-df8cd2d5");
headers.put("PAYPAL-TRANSMISSION-ID", "464163d0-e0ae-11e5-af72-51ae350aaff1");
headers.put("PAYPAL-TRANSMISSION-TIME", "2016-03-02T19:38:01Z");
headers.put("PAYPAL-AUTH-ALGO", "SHA256withRSA");
headers.put("PAYPAL-TRANSMISSION-SIG", "S3AjY87GLp1MP/UsGAWPoEes+laa7xbV4X7pMi9PdC0QR7MoNC/L/O2UThAh1IBzDZ5DGXvkEDvXK9fF0IfoS2QtLJUBm5+UFoo1jJMlH+QCiJUEHSuio2UrFGbxoqaIPcA1PN0tmd5FwikDRPCnpht6pvMvCZV1FEQbBMr9ld3d3XoWBKeWQG+oxAWSTNYJiKQIrM6l/8+hKVQ1LZID8dtR3c7y6eFxNFsDQ3WgwChZZ15vpyhDWQ4t08m3PsWFyjvsQmNRyXQyUeAC8xw96sBwGmHsgwKJwbAamVrWicQqQ/tXuUcqx9Y0pg3P4LuGNPFKzktq9L3ZImTEJxpRLA==");
// this shows invalid
System.out.println(Event.validateReceivedEvent(apiContext, headers, payload) ? "valid" : "invalid");
// this is the data provided in the sdk examples https://github.com/paypal/PayPal-Java-SDK/blob/master/rest-api-sdk/src/test/java/com/paypal/base/ValidateCertTest.java
map.put(Constants.PAYPAL_WEBHOOK_ID, "3RN13029J36659323");
payload = "{\"id\":\"WH-2W7266712B616591M-36507203HX6402335\",\"create_time\":\"2015-05-12T18:14:14Z\",\"resource_type\":\"sale\",\"event_type\":\"PAYMENT.SALE.COMPLETED\",\"summary\":\"Payment completed for $ 20.0 USD\",\"resource\":{\"id\":\"7DW85331GX749735N\",\"create_time\":\"2015-05-12T18:13:18Z\",\"update_time\":\"2015-05-12T18:13:36Z\",\"amount\":{\"total\":\"20.00\",\"currency\":\"USD\"},\"payment_mode\":\"INSTANT_TRANSFER\",\"state\":\"completed\",\"protection_eligibility\":\"ELIGIBLE\",\"protection_eligibility_type\":\"ITEM_NOT_RECEIVED_ELIGIBLE,UNAUTHORIZED_PAYMENT_ELIGIBLE\",\"parent_payment\":\"PAY-1A142943SV880364LKVJEFPQ\",\"transaction_fee\":{\"value\":\"0.88\",\"currency\":\"USD\"},\"links\":[{\"href\":\"https://api.sandbox.paypal.com/v1/payments/sale/7DW85331GX749735N\",\"rel\":\"self\",\"method\":\"GET\"},{\"href\":\"https://api.sandbox.paypal.com/v1/payments/sale/7DW85331GX749735N/refund\",\"rel\":\"refund\",\"method\":\"POST\"},{\"href\":\"https://api.sandbox.paypal.com/v1/payments/payment/PAY-1A142943SV880364LKVJEFPQ\",\"rel\":\"parent_payment\",\"method\":\"GET\"}]},\"links\":[{\"href\":\"https://api.sandbox.paypal.com/v1/notifications/webhooks-events/WH-2W7266712B616591M-36507203HX6402335\",\"rel\":\"self\",\"method\":\"GET\"},{\"href\":\"https://api.sandbox.paypal.com/v1/notifications/webhooks-events/WH-2W7266712B616591M-36507203HX6402335/resend\",\"rel\":\"resend\",\"method\":\"POST\"}]}";
headers.put("PAYPAL-CERT-URL", "https://api.sandbox.paypal.com/v1/notifications/certs/CERT-360caa42-fca2a594-a5cafa77");
headers.put("PAYPAL-TRANSMISSION-ID", "b2384410-f8d2-11e4-8bf3-77339302725b");
headers.put("PAYPAL-TRANSMISSION-TIME", "2015-05-12T18:14:14Z");
headers.put("PAYPAL-AUTH-ALGO", "SHA256withRSA");
headers.put("PAYPAL-TRANSMISSION-SIG", "vSOIQFIZQHv8G2vpbOpD/4fSC4/MYhdHyv+AmgJyeJQq6q5avWyHIe/zL6qO5hle192HSqKbYveLoFXGJun2od2zXN3Q45VBXwdX3woXYGaNq532flAtiYin+tQ/0pNwRDsVIufCxa3a8HskaXy+YEfXNnwCSL287esD3HgOHmuAs0mYKQdbR4e8Evk8XOOQaZzGeV7GNXXz19gzzvyHbsbHmDz5VoRl9so5OoHqvnc5RtgjZfG8KA9lXh2MTPSbtdTLQb9ikKYnOGM+FasFMxk5stJisgmxaefpO9Q1qm3rCjaJ29aAOyDNr3Q7WkeN3w4bSXtFMwyRBOF28pJg9g==");
// this shows valid
System.out.println(Event.validateReceivedEvent(apiContext, headers, payload) ? "valid" : "invalid");
} catch (InvalidKeyException e) {
e.printStackTrace();
} catch (NoSuchAlgorithmException e) {
e.printStackTrace();
} catch (SignatureException e) {
e.printStackTrace();
} catch (PayPalRESTException e) {
e.printStackTrace();
}
}
}
The code shows valid when the data has been taken from examples and invalid when the data comes from paypal web site.
I wonder why this cannot be validated. Any help is welcome.
You may want to test the validation with actual sandbox transactions and webhook events. Simulator mock data may not be updated with the sandbox algorithm, and is recommended for testing URL accessibility of your script.

Run Java program inside PHP code [duplicate]

This question already has answers here:
How to run java code (.class) using php and display on the same web page
(2 answers)
Closed 7 years ago.
I am trying to make a simple recommender system, and I found that with mahout it is pretty easy to make one. I have the following code (I am running it on eclipse and everything works great:
package com.predictionmarketing.RecommenderApp;
import java.io.File;
import java.io.IOException;
import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.neighborhood.ThresholdUserNeighborhood;
import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender;
import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;
import org.apache.mahout.cf.taste.recommender.RecommendedItem;
import org.apache.mahout.cf.taste.recommender.UserBasedRecommender;
import org.apache.mahout.cf.taste.similarity.UserSimilarity;
/**
* Java's application, user based recommender system
*
*/
public class App
{
public static void main( String[] args )
{
// Modelo
DataModel model = null;
// Inicializar similaridad
UserSimilarity similarity = null;
// Leer .cv userID, itemID, value
try {
model = new FileDataModel(new File("data/dataset.csv"));
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
// Encontrar matriz de similaridad
try {
similarity = new PearsonCorrelationSimilarity(model);
} catch (TasteException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, similarity, model);
UserBasedRecommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);
java.util.List<RecommendedItem> recommendations = null;
try {
recommendations = recommender.recommend(2, 3);
} catch (TasteException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
// Mostrar Recomendaciones
for (RecommendedItem recommendation : recommendations) {
System.out.println(recommendation.getItemID());
}
}
}
However, I need to run this code online because I am making the application on PHP and that is where my problem arises. Is there a way to run this code on PHP, so I can use the "recommendation" variable?
You can run this java code (compiled first) from php code with shell_exec.
But is a better solution build a REST service (or another) to do it language agnostic.
There is no simple solution for this. To make it work and communicate with PHP you have to create some interface for it. For example create java servlet, and put it on Servlet container (Java web server). This is simplest I see now.
Other solution you could consider also REST or SOAP service, to exchange data between this Java code and your PHP application. This also will need JavaEE container.

How to catch all links that are in Windows clipboard?

I would realize a method for catch all links that are in Windows clipboard when i select and copy a text HTML, but i do not found any example to realize it.
I'm already know how to catch string from clipboard but when try to print it (or paste it), i lost formatting (and relative href).
Any idea?
#VGR: I found your answer very helpful. I use it and i have create this class that copy all HTML data. Now i make/search a parser method to catch links and problem is solved.
import java.awt.Toolkit;
import java.awt.datatransfer.Clipboard;
import java.awt.datatransfer.DataFlavor;
import java.awt.datatransfer.UnsupportedFlavorException;
import java.io.IOException;
public class main
{
public static void main(String[] args)
{
Clipboard clipboard=Toolkit.getDefaultToolkit().getSystemClipboard();
DataFlavor df=DataFlavor.allHtmlFlavor;
try
{
System.out.println("HTML of selected text="+clipboard.getData(df));
}
catch(UnsupportedFlavorException|IOException exception)
{
exception.printStackTrace();
}
}
}

FTPClient (commons net) Upload problem

I use the following piece of code to upload a photo to a ftp host. But the photo seems to be corrupted after being uploaded:
There are narrow gray lines at the bottom of the photo.
The size of gray lines could be decreased by decreasing the Buffer Size of the FTPClient object.
import java.io.File;
import java.io.FileInputStream;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.commons.net.ftp.FTPClient;
import java.io.IOException;
import java.io.InputStream;
import org.apache.commons.net.ftp.FTP;
import org.apache.commons.net.ftp.FTPReply;
import sun.misc.Cleaner;
public class FtpConnectDemo1 {
public static void main(String[] args) {
FTPClient client = new FTPClient();
try {
client.connect("ftp.ftpsite.com");
//
// When login success the login method returns true.
//
boolean login = client.login("user#ftpsite.com", "pass");
if (login) {
System.out.println("Login success...");
int replay = client.getReplyCode();
if (FTPReply.isPositiveCompletion(replay)) {
File file = new File("C:\\Users\\e.behravesh\\Pictures\\me2_rect.jpg");
FileInputStream input = new FileInputStream(file);
client.setFileType(FTP.BINARY_FILE_TYPE);
if (!client.storeFile(file.getName(), input)) {
System.out.println("upload failed!");
}
input.close();
}
//
// When logout success the logout method returns true.
//
boolean logout = client.logout();
if (logout) {
System.out.println("Logout from FTP server...");
}
} else {
System.out.println("Login fail...");
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
//
// Closes the connection to the FTP server
//
client.disconnect();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
this is known error resolved in newest version of library:
http://commons.apache.org/net/changes-report.html#a3.0.1
Never ever heard of corruption of that type, but: are you uploading from behind a firewall? Try doing client.enterLocalPassiveMode(); before calling storeFile.
I've just tried your code on my local computer and it works. I didn't see any gray lines.
So I guess this is either a passive mode thing as Femi suggest or some network/firewall/lower-level problem.
probably late, but it could help somone to avoid waste time.
Check conf file and permitions!! In Unix using vsftp check that
write_enable=YES
stay uncomment.
Check with another FTP client if it posible to upload files.
FTP file sending is not atomic meaning that if there was a crash in the connection only partial file has been send. i would offer add change name to know when transfer is completed in the end of file send.

Categories

Resources