Nutch - fetch new discovered domains

Nutch - fetch new discovered domains - java

We're using nutch 1.6 to crawl web. According to nutch configuration, one should give the seedlist and domain url-filter to traverse across specified domains. However, we want to fetch newly discovered urls if their extension is, let's say, co.uk (only for this extension) We can manage it by adding newly discovered url's domains to a file - or db, whatever -, stop crawler, update domain url-filters and seedlist, then restart it. But how can we do it dynamically, w/o stopping the crawler?
Thanks in advance.
P.S : co.uk domain extension is just an example, we also could add more than one extension to allow.

Got it.
You can add suffixes to domain-urlfilter.txt like "gov.uk" as DomainURLFilter source code on lines 186-189:
if (domainSet.contains(suffix) || domainSet.contains(domain)
|| domainSet.contains(host)) {
return url;
}
it checks for suffix, domain and host.
Also, you may keep domain urls in an HBase table and manage them via your own filter plugin instead of using DomainURLFilter.

Related

How to list all root folder and shares on the Google Drive API v3 while paging and using order by?

I've seen a few questions about this but none cover my scenario.
Basically what I want is to use tokens to do paging and also list all folders and files in the root folder including shared files and folders.
This appears to be working, but once I add orderBy it doesn't work well. It works ok with sorting if I remove or sharedWithMe = true but once I add it it like the shared items aren't sorted.
What am I doing wrong?
This is my code (Kotlin and on Android):
val response =
gDriveClient.files()
.list()
.setSpaces("drive")
.setCorpora("user")
.setFields("files(id, name, size, modifiedTime, mimeType, parents, quotaBytesUsed),nextPageToken")
.setQ("('root' in parents or sharedWithMe = true) and trashed = false")
.setOrderBy("folder,name")
.setPageSize(params.loadSize)
.setPageToken(token)

Unfortunately, the behaviour you are experiencing seems to be a bug as your query and request is formatted correctly and necessary to obtain exactly what you were looking for. I have reported this behaviour here : https://issuetracker.google.com/issues/174476354 . Please consider starring the report to indicate that this is also affecting you.
Workaround
A possible workaround to this would be to order and filter your response after the request has been executed which unfortunately will not let you perform the request with pagination for your specific purpose (as for ordering everything you would need all the files).
References
Drive.Files.list()
Query parameter sharedWithMe

How we can get forest data directory in MarkLogic

I am trying to get the forest Data directory in MarkLogic. I used the following method to get data directory...using the Server Evaluation Call Interface running queries as admin. If not, please let me know how I can get forest data directory
ServerEvaluationCall forestDataDirCall = client.newServerEval()
.xquery("admin:forest-get-data-directory(admin:get-configuration(), admin:forest-get-id(admin:get-configuration(), \"" + forestName +"\"))");
for (EvalResult forestDataDirResult : forestDataDirCall.eval()) {
String forestDataDir = null;
forestDataDir = forestDataDirResult.getString();
System.out.println("forestDataDir is " + forestDataDir);
}

I see no reason for needing to hit the server evaluation endpoint to ask this question to the server. MarkLogic comes with a robust REST based management API including getters for almost all items of interest.
Knowing that, you can use what is documented here:
http://yourserver:8002/manage/v2/forests
Results can be in JSON, XML or HTML
It is the getter for forest configurations. Which forests you care about can be found by iterating over all forests or by reaching through the database config and then to the forests. It all depends on what you already know from the outside.
References:
Management API
Scripting Administrative Tasks

How to pass elastic query to kibana dashboard

i have to integrate Kibana dashboard(Iframe) with my own elastic query .
so using rison-node how can i pass the elastic query into dashboard through URL.
Followings that are i tried:
https://discuss.elastic.co/t/dashboard-search-parameter-via-url/84385/2

Not the best solution. But it's a dirty one.
I would start by getting 2 URLs from the browser. First URL which links to the pure dashboard. Second, with a filter applied.
Now, compare the 2 URLs online or with a tool like BeyondCompare. This will reveal the changes required to add a filter.
All words no code :|
For example, I tried this on my own dashboard URL. See a part of this huge URL, that was changed.
filters:!(),options:(darkTheme:!f),panels:!((col:1,id:AWbJ883y-laqWN-SkuG2,panelIndex:1,row:4,size_x:6,size_y:3,type:visualization),(col:7,id:AWbJ9BBX-laqWN-SkuG3,panelIndex:2,row:1,size_x:6,size_y:3,type:vis
filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:AWbJsP0d-laqWN-SkuGu,key:user.keyword,negate:!f,type:phrase,value:aditya),query:(match:(user.keyword:(query:aditya,type:phrase))))),options:(darkTheme
Here, as you can see the filter section is empty in the first case, while the second case does have my filter query. Now, you can easily create dynamic URLs based on this approach.

Java Servlet as a HTTP Proxy

I have read hundreds of SO Posts and studied several Java HTTP-Proxy Sources available... but I could not find a solution for my Problem.
I wrote a WebApp that proxies Http-Requests. The WebApp is working, but links and referrers become broken because the "Root" of the proxied page points to the root of my server and not to the path of my proxyservlet..
To make it more clear:
My ProxyServlet gets a Request "http://myserver.com/proxy/ProxyServlet?foo=bar"
The ProxyServlet now fetches the pagecontent from ServerX (e.g. "http://original.com/test.html")
The content of the page is delivered to the browser by just reading and writing from one stream to the other and copying the headers.
The browser displays the page, the URL, that the browser shows is the original request ("http://myserver.com/proxy/ProxyServlet?foo=bar"), but all relative links now point to
"http://myserver.com/XXX.html" instead of "http://myserver.com/proxy/ProxyServlet/XXX.html"
Is there a response-header where I can change the "path" so that relative links correctly point to my ProxyServlet?
(Rewriting the page-content and replacing links would be too difficult, because the page contains relatively addressed elements such as javascript code and other active content...)
(Changing the mapping for my Servlet to "/*" is also not possible... it must be accessed via this path...)

You are inventing a "reverse proxy", and miss the "URL rewriting" feature...
Off the top of my search results, here's an open source proxy servlet that does this:
http://j2ep.sourceforge.net/docs/rewrite.html
Also you should know there is probably something wrong with the system architecture if you have to do this. Dropping in a standalone proxy like Apache, nginex, Varnish should always be an option, as you will HAVE to add one (or more!) as you start scaling.

It sounds like the page you're proxying in is using absolute links, e.g. <a href="/XXX.html"> which means "no matter where this link is found, look for it relative to the document root". If you have control of it, the best thing is for the proxy target to be more lenient in it's linking, and instead use <a href="XXX.html">. If you can't do that, then you need to re-write these URLs, some example code, using JSoup:
Document doc = Jsoup.parse(rawBody, getDisplayUrl());
for(Element cssALink : doc.select("link[rel=stylesheet],a[href]"))
{
cssALink.attr("href", cssALink.absUrl("href"));
}
for(Element imgJsLink : doc.select("script[src],img[src]"))
{
imgJsLink.attr("src", imgJsLink.absUrl("src"));
}
return doc.toString();

Strategy to consolidate Java webapp configuration files for multiple deployments

I apologize if this is a duplicate, I couldn't find anything describing exactly what I wanted. I'm building a webapp that has a number of different properties that need to change depending on the environment in addition to a number of .properties configuration files that need to change as well. Right now I have a global enum (DEVELOPMENT, STAGING, and PRODUCTION) that is used to determine which string constants are used in the application and then I utilize a bunch of comments in the configuration files to switch between database servers, etc. There has got to be a better way to do this...I'd ideally like to be able to make one change in one file (A large block comment would be fine...) to adjust these configurations. I saw this post where the answer is to utilize JNDI which I really like, but it would seem as though I would need to call that from a servlet that starts up or a bean that gets initialized on start in order to use it for my log4j or JDBC configuration files.
Does anybody have any strategies for dealing with this?
Thanks!

I'm not sure if this strategy will apply to your situation, but in the past I've successfully used our build tool (ant in that case) to build different war files depending on the profile. So you would have multiple log4j configuration files in your source tree, and then delete the ones you don't want from the final build depending on the profile that was used to build it.
Traceability becomes slightly hard (i.e. difficult sometimes to figure out which one was used to build it), but it's a very clean solution, from your code perspective, since it's all done in your build script.

We store all our default configuration values in a single XML file. During deployment we apply an XML patch (RFC-5261) with values specific to the environment.
https://www.rfc-editor.org/rfc/rfc5261
http://xmlpatch.sourceforge.net/

I am going to assume that your properties files are made up of 95% name=value pairs that are identical across all your deployment environments and 5% of name=value pairs that change from one deployment environment to another.
If this assumption is correct, then you could try something like the following pseudocode.
void generateRuntimeConfigFiles(int deploymentMode)
{
String[] searchAndReplacePairs;
if (deploymentMode == Constants.PRODUCTION) {
searchAndReplacePairs = ...
} else if (deploymentMode == Constants.STAGING) {
searchAndReplacePairs = ...
} else { // Constants.DEVELOPMENT
searchAndReplacePairs = ...
}
String[] filePairs = new String[] {
"log4j-template.properties", "log4j.properties",
"jdbc-template.properties", "jdbc.properties",
"foo-template.xml", "foo.xml",
...
};
for (int i = 0; i < filePairs.length; i += 2) {
String inFile = filePairs[i + 0];
String ouFile = filePairs[i + 1];
searchAndReplaceInFile(inFile, outFile,
searchAndReplacePairs);
}
}
Your application calls generateRuntimeConfigFiles() before initialising anything else that might rely on properties/XML files.
Now the only problem you have to deal with is how to store and retrieve different settings for searchAndReplacePairs. Perhaps you could obtain them from files with names such as production.properties, staging.properties and development.properties.
If the above approach is appealing to you, then email me for the source code of searchAndReplaceInFile() to save you having to re-invent the wheel. You can find my email address from the "info" box in my Stackoverflow profile.

I suggest using Apache Commons Configuration. It provides all the plumbing for dealing with different configurations depending on your environment.
http://commons.apache.org/configuration

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Nutch - fetch new discovered domains - java

Related

How to list all root folder and shares on the Google Drive API v3 while paging and using order by?

How we can get forest data directory in MarkLogic

How to pass elastic query to kibana dashboard

Java Servlet as a HTTP Proxy

Strategy to consolidate Java webapp configuration files for multiple deployments

Categories

Resources