Concurrency for recursive webcrawler-algorithm in Java - java

I wrote a program in Java to find all pages of a website, starting with the URL of the startpage (using Jsoup as webcrawler). It is ok for small websites but too slow for sites with 200 or more pages:
public class SiteInspector {
private ObservableSet<String> allUrlsOfDomain; // all URLS found for site
private Set<String> toVisit; // pages that were found but not visited yet
private Set<String> visited; // URLS that were visited
private List<String> invalid; // broken URLs
public SiteInspector() {...}
public void getAllWebPagesOfSite(String entry) //entry must be startpage of a site
{
toVisit.add(entry);
allUrlsOfDomain.add(entry);
while(!toVisit.isEmpty())
{
String next = popElement(toVisit);
getAllLinksOfPage(next); //expensive
toVisit.remove(next);
}
}
public void getAllLinksOfPage(String pageURL) {
try {
if (urlIsValid(pageURL)) {
visited.add(pageURL);
Document document = Jsoup.connect(pageURL).get(); //connect to pageURL (expensive network operation)
Elements links = document.select("a"); //get all links from page
for(Element link : links)
{
String nextUrl = link.attr("abs:href"); // "http://..."
if(nextUrl.contains(new URL(pageURL).getHost())) //ignore URLs to external hosts
{
if(!isForbiddenForCrawlers(nextUrl)) // URLS forbidden by robots.txt
{
if(!visited.contains(nextUrl))
{
toVisit.add(nextUrl);
}
}
allUrlsOfDomain.add(nextUrl);
}
}
}
else
{
invalid.add(pageURL); //URL-validation fails
}
}
catch (IOException e) {
e.printStackTrace();
}
}
private boolean isForbiddenForCrawlers(String url){...}
private boolean urlIsValid(String url) {...}
public String popElement(Set<String> set) {...}
I know I have to run the expensive network-operation in extra threads.
Document document = Jsoup.connect(pageURL).get(); //connect to pageURL
My problem is that I have no idea how to properly outsource this operation while keeping the sets consistent (how to synchronize?). If possible I want to use a ThreadPoolExecutor to control the amount of threads that is getting started during the process. Do you guys have an idea how to solve this? Thanks in advance.

To use threads and also keep the sets consistent, you just need to create a thread that receives the variable you want to add to the Set but created empty, so the thread fills it when done and then adds it to the Set.
A simple example of that could be:
Main.class
for (String link : links) {
String validUrl = null;
taskThread = new Thread( new WebDownloadThreadHanlder(link, validUrl, barrier));
taskThread.start();
if (validUrl != null) {
allUrlsOfDomain.add(validUrl);
}
}
barrier.acquireUninterruptibly(links.size());
WebDownloadThreadHandler.class
public class WebDownloadThreadHandler implements Runnable {
private String link;
private String validUrl;
private Semaphore barrier;
public ScopusThreadHandler(String link, String validUrl, Semaphore barrier) {
this.link = link;
this.validUrl = null;
this.barrier = barrier;
}
public void run () {
try {
Document document = Jsoup.connect(this.link).userAgent("Mozilla/5.0");
Elements elements = document.select(YOUR CSS QUERY);
/*
YOUR JSOUP CODE GOES HERE, AND STORE THE VALID URL IN: this.validUrl = THE VALUE YOU GET;
*/
} catch (IOException) {
e.printStackTrace();
}
this.barrier.release();
}
}
What you are doing here is creating a thread for every web you want to get all the links from, and storing them into variables, if you want to retrieve more than one lvalid link from every page, you can do it using a Set and adding it a to a global set (appending it). The thing is that to keep your code consistent you need to store the retrieved values in the variable you pass the thread as argument using THIS keyword.
Hope it helps! If you need anything else feel free to ask me!

Related

Create file structure from strings

I am trying to create a BitBucket plugin to get the repository structure and print it out in a structured format. The plugin creates a button on the repo page and when clicked it connects with a servlet to produce an output, however I cannot get my formatting code to work.
E.g
Instead of:
Folder 1
File 1
File 2
I want it to indent children:
Folder 1
File 1
File 2
I currently have a JS file which controls the button and makes an ajax call to a Java file, and also passes the servlet URL including the parameters for the repo (Project, Repo).
In my Java file I have a doGet which gets the repo from the parameters and uses a custom contentTreeCallback() to get the files within the repo in order to print them out, using callback.getFiles(). Within this same Java file, I have defined a node class which creates a linked hash map which takes each file, splits it into components, and with a recursive loop appends children to nested lists in order to create the file structure. This should work, however my custom contentTreeCallback() gets a string rather than the file array it needs to return. I cannot figure out what changes I need to make to get this to work. I'm guessing I either adjust the callback to get the files or I move the node class functionality into the callback class. I would prefer the second option since this class already splits the string, it seems a bit redundant to do it twice.
The servlet java class:
protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws IOException {
// Get values from the URL
projectName= req.getParameter("project");
repoName = req.getParameter("repository");
repo = repositoryService.getBySlug(projectName, repoName);
// ByteArrayOutputStream out = new ByteArrayOutputStream();
MyContentTreeCallback callback = new MyContentTreeCallback();
PageRequestImpl pr = new PageRequestImpl(0, 1000);
// Get information from the defined location, store in ByteArrayOutputStream
contentService.streamDirectory(repo, "Master", "", true, callback, pr);
resp.setContentType("text/html");
resp.getWriter().print("<html><body><p>Repository: " + repo.getName() + "</p>");
Node root = new Node(null);
for(int i = 0; i < callback.getFiles().size(); i++) {
root.add(callback.getFiles().get(i));
}
root.writeTo(resp.getWriter());
resp.getWriter().print("</body></html>");
}
static final class Node {
final String name;
final Map<String, Node> children = new LinkedHashMap<>();
Node(String name) {
this.name = name;
}
void add(File file) {
Node n = this;
for(String component: file.getPath().getComponents())
n = n.children.computeIfAbsent(component, Node::new);
}
void writeTo(Appendable w) throws IOException {
if(name != null) w.append("<li><a href='/'>").append(name).append("</a></li>\n");
if(!children.isEmpty()) {
w.append("<ul>\n");
for(Node ch: children.values()) ch.writeTo(w);
w.append("</ul>\n");
}
}
}
And the custom callback class:
public class MyContentTreeCallback extends AbstractContentTreeCallback {
ArrayList<File> files = new ArrayList<File>();
ContentTreeSummary fileSummary;
public MyContentTreeCallback() {
}
#Override
public void onEnd(#Nonnull ContentTreeSummary summary) {
fileSummary = summary;
}
#Override
public void onStart(#Nonnull ContentTreeContext context) {
System.out.print("On start");
}
#Override
public boolean onTreeNode(#Nonnull ContentTreeNode node) {
String filePath = "";
if (node.getPath().getComponents().length>1) {
for(int i=0;i<node.getPath().getComponents().length;i++) {
filePath+=node.getPath().getComponents()[i]+"/";
//filePath=filePath.substring(0,filePath.length() - 1)
}
}
else {
filePath+=node.getPath().getName();
}
String lastChar = String.valueOf(filePath.charAt(filePath.length() - 1));
if(lastChar.equals("/")){ filePath=filePath.substring(0,filePath.length() -
1); }
files.add(filePath);
return true;
}
public ArrayList<File> getFiles(){
return files;
}
}
files.add(filePath); Is where the issue is in the callback class.
I'm sure it's simpler than I am making it out to be... Thanks for any help you can give

Matlab & Java: Execute matlab asynchronously

so, here is my today problem:
First of all, please note that I do NOT have the Matlab parallel toolbox available.
I am running java code witch interact with Matlab. Sometime Matlab directly call some java functions, sometimes it is the opposite. In this case, we use a notification system which comes from here:
http://undocumentedmatlab.com/blog/matlab-callbacks-for-java-events
We then address the notification in proper callbacks.
Here is a simple use case:
My user select a configuration file using the java interface, loaded into Matlab.
Using an interface listener, we notify Matlab that the configuration file has been selected, it then run a certain number of functions that will analyzes the file
Once the analysis is done, it is pushed into the java runtime, which will populate interface tables with the result. This step involve that matlab will call a java function.
Finally, java request the interface to be switched to an arbitrary decided tab.
This is the order of which things would happen in an ideal world, however, here is the code of the listener actionPerformed method:
#Override
public void actionPerformed(ActionEvent arg0) {
Model wModel = controller.getModel();
Window wWindow = controller.getWindow();
MatlabStructure wStructure = new MatlabStructure();
if(null != wModel) {
wModel.readMatlabData(wStructure);
wModel.notifyMatlab(wStructure, MatlabAction.UpdateCircuit);
}
if(null != wWindow) {
wWindow.getTabContainer().setSelectedComponent(wWindow.getInfosPannel());
}
}
What happen here, is that, when the notifyMatlab method is called, the code does not wait for it to be completed before it continues. So what happen is that the method complete and switch to an empty interface page (setSelectedComponent), and then the component is filled with values.
What I would like to, is for java to wait that my notifyMatlab returns a "I have completed !!" signal, and then pursue. Which involves asynchrounous code since Matlab will code java methods during its execution too ...
So far here is what I tried:
In the MatlabEventObject class, I added an isAcknowledge member, so now the class (which I originaly found in the above link), look like this (I removed all unchanged code from the original class):
public class MatlabEventObject extends java.util.EventObject {
private static final long serialVersionUID = 1L;
private boolean isAcknowledged = false;
public void onNotificationReceived() {
if (source instanceof MatlabEvent) {
System.out.println("Catched a MatlabEvent Pokemon !");
MatlabEvent wSource = (MatlabEvent) source;
wSource.onNotificationReceived();
}
}
public boolean isAcknowledged() {
return isAcknowledged;
}
public void acknowledge() {
isAcknowledged = true;
}
}
In the MatlabEvent class, I have added a future task which goal is to wait for acknowledgement, the methods now look like this:
public class MatlabEvent {
private Vector<IMatlabListener> data = new Vector<IMatlabListener>();
private Vector<MatlabEventObject> matlabEvents = new Vector<MatlabEventObject>();
public void notifyMatlab(final Object obj, final MatlabAction action) {
final Vector<IMatlabListener> dataCopy;
matlabEvents.clear();
synchronized (this) {
dataCopy = new Vector<IMatlabListener>(data);
}
for (int i = 0; i < dataCopy.size(); i++) {
matlabEvents.add(new MatlabEventObject(this, obj, action));
((IMatlabListener) dataCopy.elementAt(i)).testEvent(matlabEvents.get(i));
}
}
public void onNotificationReceived() {
ExecutorService service = Executors.newSingleThreadExecutor();
long timeout = 15;
System.out.println("Executing runnable.");
Runnable r = new Runnable() {
#Override
public void run() {
waitForAcknowledgement(matlabEvents);
}
};
try {
Future<?> task = service.submit(r);
task.get(timeout, TimeUnit.SECONDS);
System.out.println("Notification acknowledged.");
} catch (Exception e) {
e.printStackTrace();
}
}
private void waitForAcknowledgement(final Vector<MatlabEventObject> matlabEvents) {
boolean allEventsAcknowledged = false;
while(!allEventsAcknowledged) {
allEventsAcknowledged = true;
for(MatlabEventObject eventObject : matlabEvents) {
if(!eventObject.isAcknowledged()) {
allEventsAcknowledged = false;
}
break;
}
}
}
}
What happen is that I discover that Matlab actually WAIT for the java code to be completed. So my waitForAcknowledgement method always wait until it timeouts.
In addition, I must say that I have very little knowledge in parallel computing, but I think our java is single thread, so having java waiting for matlab code to complete while matlab is issuing calls to java functions may be an issue. But I can't be sure : ]
If you have any idea on how to solve this issue in a robust way, it will be much much appreciated.

Java function returns empty string [duplicate]

This question already has answers here:
How to use Jsoup with Volley?
(3 answers)
Closed 6 years ago.
I'm trying to parse data from my server in Java with jsoup. I wrote a new function and it should return data in string format, but it returns blank string. Here is my code:
public String doc;
public String pare(final String url){
Thread downloadThread = new Thread() {
public void run() {
try {
doc = Jsoup.connect(url).get().toString();
}
catch (IOException e) {
e.printStackTrace();
}
}
};
downloadThread.start();
return doc;
}
You're returning the doc object immediately, before the thread has had a chance to add any data to it, so it should be no surprise that this returns empty. You can't return threaded information in this way, and instead will need to use some type of call-back mechanism, one that notifies you when the thread is done and when data is ready to be consumed.
On android platform, you shouldn't ask Jsoup to download anything for you. Under the hood, Jsoup make use of HttpUrlConnection. This class is notoriously slow and has some known issues.
Use a faster alternative instead: Volley.
Here is the function in your post taking advantage of Volley. In the following sample code, I'm using a CountDownLatch to wait for the data.
private static RequestQueue myRequestQueue = null;
public String pare(final String url) throws Exception {
final String[] doc = new String[1];
final CountDownLatch cdl = new CountDownLatch(1);
StringRequest documentRequest = new StringRequest( //
Request.Method.GET, //
url, //
new Response.Listener<String>() {
#Override
public void onResponse(String response) {
doc[0] = Jsoup.parse(response).html();
cdl.coutDown();
}
}, //
new Response.ErrorListener() {
#Override
public void onErrorResponse(VolleyError error) {
Log.e("MyActivity", "Error while fetching " + url, error);
}
} //
);
if (myRequestQueue == null) {
myRequestQueue = Volley.newRequestQueue(this);
}
// Add the request to the queue...
myRequestQueue.add(documentRequest);
// ... and wait for the document.
// NOTA: User experience can be a concern here. We shouldn't freeze the app...
cdl.await();
return doc[0];
}
I totally agree with the above answer. You can follow any of the below tutorials for fetching data from server
http://www.androidhive.info/2014/05/android-working-with-volley-library-1/
http://www.vogella.com/tutorials/Retrofit/article.html
These two are the best libraries for Network calls in android
Before the return statement add a downloadThread.join(). This will wait until the thread has finished and put the response into doc. But: Doing so you will loose all benefit from the asynchronous execution, it's behaving the same as you just would code:
public String pare(final String url){
return Jsoup.connect(url).get().toString();
}

How to copy notes item using Java

I would like to copy note item from one note document to the other using Java below is the my lotus script version of what i want to achive in Java
Sub CopyItem(FromDoc As NotesDocument, ToDoc As NotesDocument, itemName As String)
Dim FromItem As NotesItem
Dim ToItem As NotesItem
If Not (FromDoc.Hasitem(itemName)) Then Exit Sub
Set FromItem = FromDoc.GetFirstItem(itemName)
If Not ToDoc.hasitem(itemName) Then Set ToItem = ToDoc.CreateItem(itemName)
ToItem.Values = FromDoc.Values
End Sub
I have tried the below:
public static void copyAnItem(Document FromDoc, Document ToDoc, String sItemName){
Vector<String> FromItem = new Vector<String>();
Vector<String> ToItem = new Vector<String>();
if(!FromDoc.hasItem((itemName))){
return;
}
FromItem = FromDoc.getItemValue(itemName);
if(!ToDoc.hasItem(sItemName)){
ToItem.add(itemName);
}
ToItem.addAll(FromDoc);
}
public static void copyAnItem(Document fromDoc, Document toDoc, String itemName){
try {
if(fromDoc.hasItem(itemName)) {
toDoc.copyItem(fromDoc.getFirstItem(itemName));
}
} catch (NotesException e) {
// your exception handling
}
}
You can get the whole item including all properties from fromDoc with getFirstItem and can copy it to toDoc with copyItem in just one line of code.
public static void copyAnItem(Document FromDoc, Document ToDoc, String sItemName){
if(FromDoc.hasItem(sItemName)){
ToDoc.replaceItemValue(sItemName, FromDoc.getItemValue(sItemName));
}
}
It won't work with Authors or Readers items. Better the Knut solution :)

Restricting file types upload component

I'm using the upload component of vaadin(7.1.9), now my trouble is that I'm not able to restrict what kind of files that can be sent with the upload component to the server, but I haven't found any API for that purpose. The only way is that of discarding file of wrong types after the upload.
public OutputStream receiveUpload(String filename, String mimeType) {
if(!checkIfAValidType(filename)){
upload.interruptUpload();
}
return out;
}
Is this a correct way?
No, its not the correct way. The fact is, Vaadin does provide many useful interfaces that you can use to monitor when the upload started, interrupted, finished or failed. Here is a list:
com.vaadin.ui.Upload.FailedListener;
com.vaadin.ui.Upload.FinishedListener;
com.vaadin.ui.Upload.ProgressListener;
com.vaadin.ui.Upload.Receiver;
com.vaadin.ui.Upload.StartedListener;
Here is a code snippet to give you an example:
#Override
public void uploadStarted(StartedEvent event) {
// TODO Auto-generated method stub
System.out.println("***Upload: uploadStarted()");
String contentType = event.getMIMEType();
boolean allowed = false;
for(int i=0;i<allowedMimeTypes.size();i++){
if(contentType.equalsIgnoreCase(allowedMimeTypes.get(i))){
allowed = true;
break;
}
}
if(allowed){
fileNameLabel.setValue(event.getFilename());
progressBar.setValue(0f);
progressBar.setVisible(true);
cancelButton.setVisible(true);
upload.setEnabled(false);
}else{
Notification.show("Error", "\nAllowed MIME: "+allowedMimeTypes, Type.ERROR_MESSAGE);
upload.interruptUpload();
}
}
Here, allowedMimeTypes is an array of mime-type strings.
ArrayList<String> allowedMimeTypes = new ArrayList<String>();
allowedMimeTypes.add("image/jpeg");
allowedMimeTypes.add("image/png");
I hope it helps you.
Can be done.
You can add this and it will work (all done by HTML 5 and most browsers now support accept attribute) - this is example for .csv files:
upload.setButtonCaption("Import");
JavaScript.getCurrent().execute("document.getElementsByClassName('gwt-FileUpload')[0].setAttribute('accept', '.csv')");
I think it's better to throw custom exception from Receiver's receiveUpload:
Upload upload = new Upload(null, new Upload.Receiver() {
#Override
public OutputStream receiveUpload(String filename, String mimeType) {
boolean typeSupported = /* do your check*/;
if (!typeSupported) {
throw new UnsupportedImageTypeException();
}
// continue returning correct stream
}
});
The exception is just a simple custom exception:
public class UnsupportedImageTypeException extends RuntimeException {
}
Then you just simply add a listener if the upload fails and check whether the reason is your exception:
upload.addFailedListener(new Upload.FailedListener() {
#Override
public void uploadFailed(Upload.FailedEvent event) {
if (event.getReason() instanceof UnsupportedImageTypeException) {
// do your stuff but probably don't log it as an error since it's not 'real' error
// better would be to show sth like a notification to inform your user
} else {
LOGGER.error("Upload failed, source={}, component={}", event.getSource(), event.getComponent());
}
}
});
public static boolean checkFileType(String mimeTypeToCheck) {
ArrayList allowedMimeTypes = new ArrayList();
allowedMimeTypes.add("image/jpeg");
allowedMimeTypes.add("application/pdf");
allowedMimeTypes.add("application/vnd.openxmlformats-officedocument.wordprocessingml.document");
allowedMimeTypes.add("image/png");
allowedMimeTypes.add("application/vnd.openxmlformats-officedocument.presentationml.presentation");
allowedMimeTypes.add("application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
for (int i = 0; i < allowedMimeTypes.size(); i++) {
String temp = allowedMimeTypes.get(i);
if (temp.equalsIgnoreCase(mimeTypeToCheck)) {
return true;
}
}
return false;
}
I am working with Vaadin 8 and I there is no change in Upload class.
FileUploader receiver = new FileUploader();
Upload upload = new Upload();
upload.setAcceptMimeTypes("application/json");
upload.setButtonCaption("Open");
upload.setReceiver(receiver);
upload.addSucceededListener(receiver);
FileUploader is the class that I created that handles the upload process. Let me know if you need to see the implementation.

Categories

Resources