StormCrawler: The URL Database Specifications

StormCrawler: The URL Database Specifications - java

I am quite new to StormCrawler - as I have been exploring the documentation, as well as the READMEs and additional resources, I have noticed that it is often referred to a "URL database" which should handle storing information concerning the the URLs from the run of the crawler (for example here).
I have, however, not found anywhere of what type this database is, nor how to customize it or replace it with custom modules. I have been following the code and got to IOOutputController, which has some quite confusing methods and with the lack of docstrings, it is quite challenging to actually even determine the class responsible for handling this.
I would be very grateful for any guidance!
Thank you for your time, Matyáš

The most commonly used storage for the URLs in StormCrawler is Elasticsearch. This is illustrated in the tutorials. There are other ones available such as SQL or SOLR, see enter link description here; StormCrawler is not limited to a specific database.
In most cases, people just use an existing backend implementation such as the Elasticsearch one.

Related

name of a concept with data persistence

I am looking a name of a concept, where you can configure
any type of persistence methods like RDBMS, XML Databases, RESTful APIs or a formatted file like CSV file to a programming language, such as Java.
I am quite sure that this concept has a name, but I just can't remember what it's called...
It's similar to "DataSource" in CakePHP, but as I mentioned, it wouldn't be restricted only to RDBMS.
Does anyone got an idea?
EDIT: and know an actual implementation of this concept to Java?

I think "Data Source" already describes it very well. And data sources in CakePHP are not limited to just RDBMS. See https://github.com/cakephp/datasources/tree/2.0/Model/Datasource
I do no think there is a real name for that, what you're looking for is more a pattern and I think this level of abstraction can be approached by using different ways or patterns. Check how CakePHP or any other good framework is doing it or directly check the Spring framework for java. If you want to write your own abstraction layer lets inspire you be them.

How about DataServiceProvider and Abstract/Persist through one of the DataAdapters "types"
Which your business objects, or application services can locate this via a common service locator.
We have a similar implementation, though practically it is not being used much as people would prefer their own tools to source data for other data types.

Is it possible to build a web application as a generic product rather than a one-time customized solution?

There is a business problem that needs to be solved. The obvious solution is an enterprise web application - a locally hosted website that provides the desired functionality.
I want to build this web application, but build it such that -
Its more of a product than a one-time solution; such that it can be customized for different clients
It is possible to provide 'fixes' for this web application, so that bugs can be removed and enhancements added with minimum impact on operations
The web app should be capable of working with different databases and existing authentication systems
Is this even possible? Is it a common enough approach that there is a known way of going about this? Would it be better to use an application framework like Spring or try and keep dependencies on frameworks to minimal?
Also, any links or references to books that will guide me will be greatly appreciated.
Thanks in advance StackOverflow!
(I feel like I dont know all what I need to know before embarking on this project, please feel free to point out things I haven't and should consider)

Developing software, esp. for re-usability, requires analyzing which parts/functions are common between use cases and which aren't, drawing the line between re-usable (library) and customized/specialized code.
If you know what use cases you expect or want to support in the future this can be feasible.
If you don't, you should not start trying to generalize arbitrary functionality in the first place, because you cannot know what you will be needing in the future.
Java provides some good abstractions of various functionalities, like universal DB support via JDBC.
If you didn't already, have a look at application servers like JBoss or Glassfish. They provide plenty of basic functionality for web applications, support very loose coupling between components, and are highly configurable. To switch from one DBMS to another, for instance, it is enough to alter a single line of configuration (given the supported SQL is similar enough). Deploying applications or parts can often be done on the fly ("hot deployment") without even stopping the server.
Plus: There is a vast amout of supporting libraries and frameworks out there to help you standardize your application design.

I have been working for a while on a webapp that can be deployed in multiple locations: it is designed to be instantiated on many hosts. It's entirely possible to do this, but it is difficult. Writing the code so that it can work this way takes a great deal of care.
The key to doing it is to make all your dependencies on things explicit and all your configuration driven by properties that can be set during installation. Spring makes this quite a lot easier! In particular, the org.springframework.web.context.support.ServletContextPropertyPlaceholderConfigurer class allows you to use the servlet context as a source of values that you can then inject into your beans (e.g., via #Value annotations). It's far harder to do all that yourself. Here's (a simplified version of) what I use:
<bean class="org.springframework.web.context.support.ServletContextPropertyPlaceholderConfigurer">
<property name="contextOverride" value="true" />
<property name="location" value="/WEB-INF/default.properties" />
</bean>
This merges the servlet context's properties on top of the ones you provide as defaults inside your webapp (definitely a good practice if most things aren't going to need to be modified most of the time) and then uses them to define properties. I then apply a configuration property (e.g., foo.bar) to a bean property using a placeholder, like this:
#Value("${foo.bar}")
public void setFoobar(String foobar) { ... }
Things to configure that way include the database configuration, absolute locations of files holding things that can't be packaged inside the webapp, etc. You'll have to use your skill and knowledge of the application domain to work out what things need to be listed.
Other key principles are to keep as much as possible inside the webapp (so reducing the opportunity for the deployer to mess it up), to be very careful about documenting everything, and to try it with multiple servlet containers. Remember, the person deploying your webapp does not have access to the contents of your thoughts: you have to write it down and tell them exactly what to do. (Too many instructions are at the level of “click this, click that, magic happens” but those are poor instructions since the exact method will vary over time: saying why will help far more because its more portable.)

We are currently developing a product that can be deployed internally for multiple clients and also as a public portal solution. Here is our experience.
As others have pointed out, there are different factors to keep in mind.
Security
Security that is associated with your product, and how you would manage the product functional requirements to external security roles.
Security, authentication and authorization should not be as part of the base product. Once authorized the roles need to be mapped to product roles for achieving said functionality.
Images and logos, that require customization.
Internationalization.
For working with multiple databases, assuming a product has typically two different views, persistence and querying. Our experience was to use hibernate to support multiple databases, but theoretically we have used only two databases in the past. db2 and mysql.
Testing for multiple databases for every release of your product is a pain. Your test cases goes 3 fold or atleast once in a while to support multiple databases.
Using custom databases and functions are a big no, you can use some general functions but custom database specific functions in your query are going to be a pain and have to be very diligent to avoid them.
Supported browsers in your product.
Licenses of the third party jars may not be compatible / acceptable to all institutions so you have to watch out for that carefully.
As much as possible, enable properties or configuration to customize all variables.
Caching strategy and properties initialization strategies.
A framework helps the team to keep on the same page, rather than an internal framework. There are many advantages to use a well established framework like Spring for performance and other consideration.
Cheers!

Restricting hibernate queried data based upon the owning group of a user

The standard example is probably where you offer a service to multiple companies on the same hosted instance and want employees to be able to see data only from other employees of the same company, not of potentially competitive companies.
I'm using JBossAS7 with Hibernate 4.x.
I could push the company information down from the UI layer and have the (stateless) persistence layer filter on that, but it seems like a bad idea to me, I'd rather have it done in one place closer to the database.
I'm guessing there must be a standard, secure solution for this, maybe around security domains or hibernate sessions? Thoughts? Thanks in advance.

You seem to be building a "multi-tenant application". Hibernate's support for multi-tenancy is quite restricted at the moment, with feature request 5697 having been recently completed, in 4.0.0.Alpha2. Note that this feature request does not address addition of tenant discriminator columns in the entities, which going by the discussion in JIRA, would arrive in 4.0.0.Alpha3 or 4.1.0 (going by JIRA). At the moment, you can store the data related to various tenants in different databases or schemas.
You can also read this related blog post, on various options regarding achieving multi-tenancy in Hibernate; this is quite old compared to the work done in HHH-5697, and does not discuss how one would create a multi-tenant application with tenant discriminator columns in the entity model.

I'm not sure of any standard, but have worked on two systems where it was important. These pre-dated tools like Hibernate and our use of J2EE.
In all systems I've worked on we've had to code this ourselves - using company as part of our keys in requests.
One possibility is a whole different "whatever your database calls its partition" for each customer. (Schema if you're in Oracle). Sounds more complex but it does guarantee isolation between companies and it does also allow some management of scaling or new/delete company. In my previous place of work I remember legal types felt nervous if anyone mentioned keeping more than one company's data in the same table - so that kept them happy.
You could either have your app server connect to the database as a trusted user who can access all, or make sure you pass the end user's credentials down when you connect. I've heard of this. It sounds good from a security point of view and means in a database like Oracle the right thing will just happen. I've not seen it done and wonder how well connection pooling would work if at all.
Edit: Vineet's answer above seems to cover it well. It's an area I'll have to look at more. We've probably got too much legacy code here to change.

use compass-lucene as caching technique

Any example of scenarios other than doing search for which I could use "compass"?
Lets say we have a page that list top 10 most view article. How to use compass to show this kind of results. Any demo/sample project on this to refer to? definitely Jira would be a good example but its source code is not available. I want to know how to maximize the benefits of using compass-lucene in an application.
May i know where can i download spring-compass jpa #annotated example? The nightly built i downloaded is xml-based.

Any example of scenarios other than doing search for which I could use "compass"?
Well, AFAIK, this is what is has been designed for.
Lets say we have a page that list top 10 most view article. How to use compass to show this kind of results.
I'm not sure this is a good use-case, Compass is in my opinion useful when you want to search results across the whole application business domain model i.e. not only lets say articles (in that case, you can just query the database).
Now, let's imagine that your domains objects are searchable classes and have a searchable property numberOfViews, I guess it would be possible to search on this property (refer to the whole Searching section).
Any demo/sample project on this to refer to?
The compass distribution includes samples: a Library basic example that highlights the main features of Compass::Core and a Petclinic sample that shows how to add compass to an existing application (the Spring petclinic sample).
I want to know how to maximize the benefits of using compass-lucene in an application.
Read the author's blog, Compass wiki and the reference documentation :)
May i know where I can download spring-compass jpa #annotated example?
As mentioned, the spring-compass sample included in compass distribution is based on Spring's petclinic which doesn't use annotations (see SPR-2960). Just in case, petclinic annotated entities are attached to SPR-2960 so feel free to use them.

What JDBC tools do you use for synchronization of data sources?

I'm hoping to find out what tools folks use to synchronize data between databases. I'm looking for a JDBC solution that can be used as a command-line tool.
There used to be a tool called Sync4J that used the SyncML framework but this seems to have fallen by the wayside.

I have heard that the Data Replication Service provided by Db4O is really good. It allows you to use Hibernate to back onto a RDBMS - I don't think it supports JDBC tho (http://www.db4o.com/about/productinformation/drs/Default.aspx?AspxAutoDetectCookieSupport=1)
There is an open source project called Daffodil, but I haven't investigated it at all. (https://daffodilreplicator.dev.java.net/)
The one I am currently considering using is called SymmetricDS (http://symmetricds.sourceforge.net/)
There are others, they each do it slightly differently. Some use triggers, some poll, some use intercepting JDBC drivers. You need to decide what technical limitations you are under to determine which one you really want to use.
Wikipedia provides a nice overview of different techniques (http://en.wikipedia.org/wiki/Multi-master_replication) and also provides a link to another alternative DBReplicator (http://dbreplicator.org/).

If you have a model and DAO layer that exists already for your codebase, you can just create your own sync framework, it isn't hard.
Copy data is as simple as:
read an object from database A
remove database metadata (uuid, etc)
insert into database B
Syncing has some level of knowledge about what has been synced already. You can either do it at runtime by getting a list of uuids from TableInA and TableInB and working out which entries are new, or you can have a table of items that need to be synced (populate with a trigger upon insert/update in TableInA), and run from that. Your tool can be a TimerTask so databases are kept synced at the time granularity that you desire.
However there is probably some tool out there that does it all without any of this implementation faff, and each implementation would be different based on business needs anyway. In addition at the database level there will be replication tools.

True synchronization requires some data that I hope your database schema has (you can read the SyncML doc to see how they proceed). Sync4J won't help you much, it's really high-level and XML oriented. If you don't foresee any conflicts (which means: really easy synchronisation), you could try with a lightweight ETL like Enhydra Octopus.

I'm primarily using Oracle at the moment, and the most full-featured route I've come across is Red Gate's Data Compare:
http://www.red-gate.com/products/oracle-development/data-compare-for-oracle/
This old blog gives a good summary of the solution routes available:
http://www.novell.com/coolsolutions/feature/17995.html
The JDBC-specific offerings I've come across have been very basic. The solution mentioned by Aidos seems the most feature complete if you want to go down the publish-subscribe route:
http://symmetricds.codehaus.org/
Hope this helps.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.