I would like to just crawl with crawler4j, certain URLs which have a certain prefix.
So for example, if an URL starts with http://url1.com/timer/image it is valid. E.g.: http://url1.com/timer/image/text.php.
This URL is not valid: http://test1.com/timer/image
I tried to implement it like that:
public boolean shouldVisit(Page page, WebURL url) {
String href = url.getURL().toLowerCase();
String adrs1 = "http://url1.com/timer/image";
String adrs2 = "http://url2.com/house/image";
if (!(href.startsWith(adrs1)) || !(href.startsWith(adrs2))) {
return false;
}
if (filters.matcher(href).matches()) {
return false;
}
for (String crawlDomain : myCrawlDomains) {
if (href.startsWith(crawlDomain)) {
return true;
}
}
return false;
}
However, it does not seem that this works, because the crawler also visits other URLs.
Any recommendation what I could so?
I appreciate your answer!
Basically you can have an array of prefixes which holds allowed URLs which you want to crawl. And inside your method just travers the array return true if only it machetes with any of your allowed prefix. That means you dont have to list any domains which you don't want to crawl.
public boolean shouldVisit(Page page, WebURL url) {
String href = url.getURL().toLowerCase();
// prefixes that you want to crawl
String allowedPrefixes[] = {"http://url1.com", "http://url2.com"};
for (String allowedPrefix : allowedPrefixes) {
if (href.startsWith(allowedPrefix)) {
return true;
}
}
return false;
}
Your code is not working because your condition is incorrect:
(!(href.startsWith(adrs1)) || !(href.startsWith(adrs2))
Another reason is you might not have configured crawlerDomains. It is configured during startup of your application by calling CrawlController#setCustomData(crawler1Domains);
Look at sample source code of crawler4j, crawlerDomains are set here: MultipleCrawlerController.java#79
Look at the below code. it may help you.
public boolean shouldVisit(Page page,WebURL url) {
String href = url.getURL().toLowerCase();
String adrs1 = "http://url1.com/timer/image";
String adrs2 = "http://url2.com/house/image";
return !FILTERS.matcher(href).matches() && (href.startsWith(adrs1) || href.startsWith(adrs2));
}
Related
I wrote a program in Java to find all pages of a website, starting with the URL of the startpage (using Jsoup as webcrawler). It is ok for small websites but too slow for sites with 200 or more pages:
public class SiteInspector {
private ObservableSet<String> allUrlsOfDomain; // all URLS found for site
private Set<String> toVisit; // pages that were found but not visited yet
private Set<String> visited; // URLS that were visited
private List<String> invalid; // broken URLs
public SiteInspector() {...}
public void getAllWebPagesOfSite(String entry) //entry must be startpage of a site
{
toVisit.add(entry);
allUrlsOfDomain.add(entry);
while(!toVisit.isEmpty())
{
String next = popElement(toVisit);
getAllLinksOfPage(next); //expensive
toVisit.remove(next);
}
}
public void getAllLinksOfPage(String pageURL) {
try {
if (urlIsValid(pageURL)) {
visited.add(pageURL);
Document document = Jsoup.connect(pageURL).get(); //connect to pageURL (expensive network operation)
Elements links = document.select("a"); //get all links from page
for(Element link : links)
{
String nextUrl = link.attr("abs:href"); // "http://..."
if(nextUrl.contains(new URL(pageURL).getHost())) //ignore URLs to external hosts
{
if(!isForbiddenForCrawlers(nextUrl)) // URLS forbidden by robots.txt
{
if(!visited.contains(nextUrl))
{
toVisit.add(nextUrl);
}
}
allUrlsOfDomain.add(nextUrl);
}
}
}
else
{
invalid.add(pageURL); //URL-validation fails
}
}
catch (IOException e) {
e.printStackTrace();
}
}
private boolean isForbiddenForCrawlers(String url){...}
private boolean urlIsValid(String url) {...}
public String popElement(Set<String> set) {...}
I know I have to run the expensive network-operation in extra threads.
Document document = Jsoup.connect(pageURL).get(); //connect to pageURL
My problem is that I have no idea how to properly outsource this operation while keeping the sets consistent (how to synchronize?). If possible I want to use a ThreadPoolExecutor to control the amount of threads that is getting started during the process. Do you guys have an idea how to solve this? Thanks in advance.
To use threads and also keep the sets consistent, you just need to create a thread that receives the variable you want to add to the Set but created empty, so the thread fills it when done and then adds it to the Set.
A simple example of that could be:
Main.class
for (String link : links) {
String validUrl = null;
taskThread = new Thread( new WebDownloadThreadHanlder(link, validUrl, barrier));
taskThread.start();
if (validUrl != null) {
allUrlsOfDomain.add(validUrl);
}
}
barrier.acquireUninterruptibly(links.size());
WebDownloadThreadHandler.class
public class WebDownloadThreadHandler implements Runnable {
private String link;
private String validUrl;
private Semaphore barrier;
public ScopusThreadHandler(String link, String validUrl, Semaphore barrier) {
this.link = link;
this.validUrl = null;
this.barrier = barrier;
}
public void run () {
try {
Document document = Jsoup.connect(this.link).userAgent("Mozilla/5.0");
Elements elements = document.select(YOUR CSS QUERY);
/*
YOUR JSOUP CODE GOES HERE, AND STORE THE VALID URL IN: this.validUrl = THE VALUE YOU GET;
*/
} catch (IOException) {
e.printStackTrace();
}
this.barrier.release();
}
}
What you are doing here is creating a thread for every web you want to get all the links from, and storing them into variables, if you want to retrieve more than one lvalid link from every page, you can do it using a Set and adding it a to a global set (appending it). The thing is that to keep your code consistent you need to store the retrieved values in the variable you pass the thread as argument using THIS keyword.
Hope it helps! If you need anything else feel free to ask me!
We are working for internationalizing an old application with some dirty code. For example, we have an object DTO InstrumentDto:
private String label;
private Quotation quotation;
private ExprQuote quoteExp;
public String getTypeCouponCouru() {
if (this.quoteExp.getCode().equals(Constants.INS_QUOTE_EXPR_PCT_NOMINAL_CPN_INCLUS)
|| this.quoteExp.getCode().equals(Constants.INS_QUOTE_EXPR_PCT_NOMINAL_INTERET)) {
return "Coupon attaché";
} else if(this.quoteExp.getCode().equals(Constants.INS_QUOTE_EXPR_PCT_NOMINAL_PIED_CPN)){
return "Coupon détaché";
} else {
return "";
}
}
public String getFormattedLabel() {
StringBuilder formattedLabel = new StringBuilder(this.label);
Quotation quote = this.quotation;
if (this.quotation != null) {
formattedLabel.append(" ");
formattedLabel.append(FormatUtil.formatDecimal(this.quotation.getCryQuote());
if (this.quoteExp.getType().equals("PERCENT")) {
formattedLabel.append(" %");
} else {
formattedLabel.append(" ");
formattedLabel.append(this.quotation.getCurrency().getCode());
}
formattedLabel.append(" le ");
formattedLabel.append(DateUtil.formatDate(this.quotation.getValoDate()));
}
return formattedLabel.toString();
}
Then, those methods are used on JSPs. For example for getFormattedLabel(), we have :
<s:select name = "orderUnitaryForm.fieldInstrument"
id = "fieldInstrument"
list = "orderUnitaryForm.instrumentList"
listKey = "id"
listValue = "formattedLabel" />
IMO, the first method doesn't have its place on the DTO. We are expecting the view to manage the label to print. And in this view (the JSP), no problem to translate those words.
Additionally, this method is just used in 2 JSP. Not a problem to "repeat" the conditional tests.
But it's more difficult for getFormattedLabel() : this method is used in a lot of JSP, and the building of the formatted label is "complicated". And it's not possible having the i18n service in the DTO.
So how to do that ?
Your code in getFormattedLabel() seems to be business logic.
A DTO is a simple object without any complex test/behavior (see wiki definition).
IMO, you should move this chunk of code to your Action and split your *.properties file like this:
Your *.properties:
message1= {0} % le {1}
message2= {0} {1} le {2}
Your Action:
public MyAction extends ActionSupport {
public String execute(){
//useful code here
InstrumentDto dto = new InstrumentDto();
StringBuilder formattedLabel = new StringBuilder(label);
if (this.quotation != null) {
String cryQuote = FormatUtil.formatDecimal(this.quotation.getCryQuote());
String date = DateUtil.formatDate(this.quotation.getValoDate());
if (this.quoteExp.getType().equals("PERCENT")) {
formattedLabel.append(getText("message1", new String[] { cryQuote, date }));
} else {
String cryCode = this.quotation.getCurrency().getCode();
formattedLabel.append(getText("message2", new String[] { cryQuote, cryCode, date }));
}
}
dto.setFormattedLabel(formattedLabel);
}
}
Hope this will help ;)
In my code I have a lot of instances like this:
if (!valid){
validate();
}
if (valid){
\\ execute some code
}
and I was wondering if there was a better way to do this? First off, it's annoying to have to write a bunch of these consecutive if statements, and secondly, part of my code in validate() requires that I load a webview with a login page, and then login. Once I've reached the logged in page, I retrieve a value using JavaScript which then changes the value of valid to true if it matches. There's no real convenient until function, and using while(!valid) doesn't quite give me what I want.
Here is my validate()
private void validate(){
class MyJavaScriptInterface {
#JavascriptInterface
public void showHTML(String content) {
// grants access based on authorization level
loggedIn = true;
if(content.contains("OK")){
valid = true;
Toast.makeText(getApplicationContext(), "Log In Successful",
Toast.LENGTH_SHORT).show();
}else {
valid = false;
Toast.makeText(getApplicationContext(), "No Access Granted", Toast.LENGTH_SHORT)
.show();
}
updateMenuTitles();
}
}
// open up the login page
final WebView wv = (WebView)findViewById(R.id.login_webview);
wv.getSettings().setJavaScriptEnabled(true);
wv.addJavascriptInterface(new MyJavaScriptInterface(), "HTMLOUT");
wv.setWebViewClient(new WebViewClient() {
#Override
public void onPageFinished(WebView view, String url) {
//once page is finished loading, check id="role" pass that value to showHTML
if(url.contains(getString(R.string.loginURL))) {
wv.loadUrl("javascript:(function() { " +
"window.HTMLOUT.showHTML(document.getElementById('course-eval-status')" +
".innerHTML);})()");
wv.setVisibility(View.GONE);
closeWebview.setVisibility(View.GONE);
}
}
#Override
public void onReceivedError(WebView view, int errorCode, String description,
String failingUrl) {
Log.w("LoginActivity: ", description);
}
});
wv.loadUrl(getString(R.string.loginURL));
if(!loggedIn) {
wv.setVisibility(View.VISIBLE);
closeWebview.setVisibility(View.VISIBLE);
}else{
closeWebview.setVisibility(View.GONE);
wv.setVisibility(View.GONE);
}
}
Your question is not clear but could this be what you are looking for?
boolean valid = false;
private boolean validate() {
System.out.println("Validating");
return true;
}
public void test() {
if (valid || (valid = validate())) {
System.out.println("Try 1");
}
if (valid || (valid = validate())) {
System.out.println("Try 2");
}
}
This will only call validate once. However - a better mechanism would be something like:
private boolean valid() {
return valid || (valid = validate());
}
public void test2() {
if (valid()) {
System.out.println("Try 1");
}
if (valid()) {
System.out.println("Try 2");
}
}
If validate() changes the value of valid, then I don't really see a way around this. It seems to me that the issue isn't so much syntactical as it is business logic. You mention that your code "doesn't wait around" -- couldn't you change it to make it wait? If not, then perhaps you need some unit tests to better be able to validate your code. Regardless, the issue doesn't seem like it's with the valid flag.
I'm using the upload component of vaadin(7.1.9), now my trouble is that I'm not able to restrict what kind of files that can be sent with the upload component to the server, but I haven't found any API for that purpose. The only way is that of discarding file of wrong types after the upload.
public OutputStream receiveUpload(String filename, String mimeType) {
if(!checkIfAValidType(filename)){
upload.interruptUpload();
}
return out;
}
Is this a correct way?
No, its not the correct way. The fact is, Vaadin does provide many useful interfaces that you can use to monitor when the upload started, interrupted, finished or failed. Here is a list:
com.vaadin.ui.Upload.FailedListener;
com.vaadin.ui.Upload.FinishedListener;
com.vaadin.ui.Upload.ProgressListener;
com.vaadin.ui.Upload.Receiver;
com.vaadin.ui.Upload.StartedListener;
Here is a code snippet to give you an example:
#Override
public void uploadStarted(StartedEvent event) {
// TODO Auto-generated method stub
System.out.println("***Upload: uploadStarted()");
String contentType = event.getMIMEType();
boolean allowed = false;
for(int i=0;i<allowedMimeTypes.size();i++){
if(contentType.equalsIgnoreCase(allowedMimeTypes.get(i))){
allowed = true;
break;
}
}
if(allowed){
fileNameLabel.setValue(event.getFilename());
progressBar.setValue(0f);
progressBar.setVisible(true);
cancelButton.setVisible(true);
upload.setEnabled(false);
}else{
Notification.show("Error", "\nAllowed MIME: "+allowedMimeTypes, Type.ERROR_MESSAGE);
upload.interruptUpload();
}
}
Here, allowedMimeTypes is an array of mime-type strings.
ArrayList<String> allowedMimeTypes = new ArrayList<String>();
allowedMimeTypes.add("image/jpeg");
allowedMimeTypes.add("image/png");
I hope it helps you.
Can be done.
You can add this and it will work (all done by HTML 5 and most browsers now support accept attribute) - this is example for .csv files:
upload.setButtonCaption("Import");
JavaScript.getCurrent().execute("document.getElementsByClassName('gwt-FileUpload')[0].setAttribute('accept', '.csv')");
I think it's better to throw custom exception from Receiver's receiveUpload:
Upload upload = new Upload(null, new Upload.Receiver() {
#Override
public OutputStream receiveUpload(String filename, String mimeType) {
boolean typeSupported = /* do your check*/;
if (!typeSupported) {
throw new UnsupportedImageTypeException();
}
// continue returning correct stream
}
});
The exception is just a simple custom exception:
public class UnsupportedImageTypeException extends RuntimeException {
}
Then you just simply add a listener if the upload fails and check whether the reason is your exception:
upload.addFailedListener(new Upload.FailedListener() {
#Override
public void uploadFailed(Upload.FailedEvent event) {
if (event.getReason() instanceof UnsupportedImageTypeException) {
// do your stuff but probably don't log it as an error since it's not 'real' error
// better would be to show sth like a notification to inform your user
} else {
LOGGER.error("Upload failed, source={}, component={}", event.getSource(), event.getComponent());
}
}
});
public static boolean checkFileType(String mimeTypeToCheck) {
ArrayList allowedMimeTypes = new ArrayList();
allowedMimeTypes.add("image/jpeg");
allowedMimeTypes.add("application/pdf");
allowedMimeTypes.add("application/vnd.openxmlformats-officedocument.wordprocessingml.document");
allowedMimeTypes.add("image/png");
allowedMimeTypes.add("application/vnd.openxmlformats-officedocument.presentationml.presentation");
allowedMimeTypes.add("application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
for (int i = 0; i < allowedMimeTypes.size(); i++) {
String temp = allowedMimeTypes.get(i);
if (temp.equalsIgnoreCase(mimeTypeToCheck)) {
return true;
}
}
return false;
}
I am working with Vaadin 8 and I there is no change in Upload class.
FileUploader receiver = new FileUploader();
Upload upload = new Upload();
upload.setAcceptMimeTypes("application/json");
upload.setButtonCaption("Open");
upload.setReceiver(receiver);
upload.addSucceededListener(receiver);
FileUploader is the class that I created that handles the upload process. Let me know if you need to see the implementation.
I have a form which uses CustomValidator to check for non empty field whenever we try to Add a record (PARAMETER, VALUE)
I'm looking for a way to disable form validation when I'm trying to Delete (the user can delete an empty listGridRecord if he changes his mind and needs no more to add).
I'm using this custom validator:
CustomValidator validatorParameter = new CustomValidator() {
#Override
protected boolean condition(Object value) {
parameterName = (String) value;
if ((value == null || ((String) value).trim().isEmpty())) {
rowIsValidate = false;
return false;
} else {
rowIsValidate = true;
return true;
}
}
};
which I'm setting in an init() method this way:
parametersListGrid.getField(PARAMETER).setValidators(validatorParameter);
I tried setting a flag "noValidation" on true whenever I detect a click on Delete button and used it this way:
CustomValidator validatorParameter = new CustomValidator() {
#Override
protected boolean condition(Object value) {
parameterName = (String) value;
if (((value == null || ((String) value).trim().isEmpty())) && !noValidation){
rowIsValidate = false;
return false;
} else {
rowIsValidate = true;
return true;
}
}
};
but I figured out that this flag is set later on after the validation happened so
rowIsValidate stays false and we can't delete the empty record given the errors shown after validation;
Any idea on how to pass this validation step just in deletion scenario?
Call discardEdits(rowNum) before deleting a record.
Same question is asked on SmartClient Forums.