Heritrix single-site scrape, including required off-site assets - java

I believe need help compiling Heritrix decide rules, although I'm open to other Heritrix suggestions: https://webarchive.jira.com/wiki/display/Heritrix/Configuring+Crawl+Scope+Using+DecideRules
I need to scrape an entire copy of a website (in the crawler-beans.cxml seed list), but not scrape any external (off-site) pages. Any external resources needed to render the current website should be downloaded, however not following any links to off-site pages - only the assets for the current page/domain.
For example, CDN content required for the rendering of a page might be hosted on an external domain (maybe AWS or Cloudflare), so I would need to download that content, as well as following all on-domain links, however not follow any links to pages outside of the scope of the current domain.

You could use 3 decide rules:
The first one accepts all non-html pages, using a ContentTypeNotMatchesRegexDecideRule;
The second one accepts all urls in the current domain.
The third one rejects all pages not in the domain and not directly
reached from the domain (the alsoCheckVia option)
So something like that:
<bean id="scope" class="org.archive.modules.deciderules.DecideRuleSequence">
<property name="rules">
<list>
<!-- Begin by REJECTing all... -->
<bean class="org.archive.modules.deciderules.RejectDecideRule" />
<bean class="org.archive.modules.deciderules.ContentTypeNotMatchesRegexDecideRule">
<property name="decision" value="ACCEPT"/>
<property name="regex" value="(?i)html|wml"/>
</bean>
<bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
<property name="decision" value="ACCEPT"/>
<property name="surtsSource">
<bean class="org.archive.spring.ConfigString">
<property name="value">
<value>
http://(org,yoursite,
</value>
</property>
</bean>
</property>
</bean>
<bean class="org.archive.modules.deciderules.surt.NotSurtPrefixedDecideRule">
<property name="decision" value="REJECT"/>
<property name="alsoCheckVia" value="true"/>
<property name="surtsSource">
<bean class="org.archive.spring.ConfigString">
<property name="value">
<value>
http://(org,yoursite,
</value>
</property>
</bean>
</property>
</bean>
</list>
</property>
</bean>

I asked a related question in Crawling rules in heritrix, how to load embedded content? and came up with a solution there. Later I found this post as well. I am submitting my solution here as well:
Note: I know the question is old so it was most likely made for an older heritrix version. I am using 3.4
<bean id="scope" class="org.archive.modules.deciderules.DecideRuleSequence">
<property name="rules">
<list>
<bean class="org.archive.modules.deciderules.AcceptDecideRule" />
<bean class="org.archive.modules.deciderules.NotMatchesListRegexDecideRule">
<property name="decision" value="REJECT"/>
<property name="regexList">
<list>
<value>.*site\.domain/path/.*</value>
</list>
</property>
</bean>
<bean class="org.archive.modules.deciderules.HopsPathMatchesRegexDecideRule">
<property name="decision" value="ACCEPT"/>
<property name="regex" value="(E|X)" />
</bean>
<!-- Below are some of the "standard" rules set up on a fresh job, it behaves the same with and without them when it comes to not loading embedded stuff -->
<bean class="org.archive.modules.deciderules.TooManyHopsDecideRule">
<!-- <property name="maxHops" value="20" /> -->
</bean>
<!-- ...and REJECT those with suspicious repeating path-segments... -->
<bean class="org.archive.modules.deciderules.PathologicalPathDecideRule">
<!-- <property name="maxRepetitions" value="2" /> -->
</bean>
<!-- ...and REJECT those with more than threshold number of path-segments... -->
<bean class="org.archive.modules.deciderules.TooManyPathSegmentsDecideRule">
<!-- <property name="maxPathDepth" value="20" /> -->
</bean>
<!-- ...but always ACCEPT those marked as prerequisitee for another URI... -->
<bean class="org.archive.modules.deciderules.PrerequisiteAcceptDecideRule">
</bean>
<!-- ...but always REJECT those with unsupported URI schemes -->
<bean class="org.archive.modules.deciderules.SchemeNotInSetDecideRule">
</bean>
</list>
</property>
</bean>
Adjust <value>.*site\.domain/path/.*</value> to match you site, and path if any.
You can also adjust <property name="regex" value="(E|X)" /> where E|X can be just E if you just want the known included things in the page, like images, css etc. X is a bit experimental for trying things found in javascript files as well.

Related

Broadleaf Commerce - My workflow activities are being executed twice

It seems that the built in workflow activities are being executed twice. I am testing the checkout workflow and the DecrementInventoryActivity is removing the quantity from the sku twice.
Is this a known bug or am I doing something wrong?
I created the workflow like so:
<!-- Checkout Workflow Configuration -->
<bean id="blCheckoutWorkflow" class="org.broadleafcommerce.core.workflow.SequenceProcessor">
<property name="processContextFactory">
<bean class="org.broadleafcommerce.core.checkout.service.workflow.CheckoutProcessContextFactory"/>
</property>
<property name="activities">
<list>
<bean p:order="6000" id="blDecrementInventoryActivity" class="org.broadleafcommerce.core.checkout.service.workflow.DecrementInventoryActivity">
<property name="rollbackHandler" ref="blDecrementInventoryRollbackHandler" />
</bean>
<bean p:order="7000" id="blCompleteOrderActivity" class="org.broadleafcommerce.core.checkout.service.workflow.CompleteOrderActivity">
<property name="rollbackHandler" ref="blCompleteOrderRollbackHandler" />
</bean>
<bean p:order="9999999" class="com.mycompany.workflow.checkout.NotifyExternalInventorySystem" />
</list>
</property>
<property name="defaultErrorHandler">
<bean class="org.broadleafcommerce.core.workflow.DefaultErrorHandler">
<property name="unloggedExceptionClasses">
<list>
<value>org.broadleafcommerce.core.inventory.service.InventoryUnavailableException</value>
</list>
</property>
</bean>
</property>
</bean>
Starting with Broadleaf 4.0, the DecrementInventoryActivity was added by default to the blCheckoutWorkflow. See the 3.1.10-4.0.0 migration notes at http://www.broadleafcommerce.com/docs/core/4.0/migration-notes/3.1-to-4.0-migration/3.1.10-to-4.0-migration, in the section "Inventory Management".
This also goes for the defaultErrorHandler, and you can remove the blCompleteOrderActivity (that has always been managed in the framework). Basically, your customized blCheckoutWorkflow bean should change to:
<bean id="blCheckoutWorkflow" class="org.broadleafcommerce.core.workflow.SequenceProcessor">
<property name="activities">
<list>
<bean p:order="9999999" class="com.mycompany.workflow.checkout.NotifyExternalInventorySystem" />
</list>
</property>
</bean>
Starting with Broadleaf 3.0, any modifications to the blCheckoutWorkflow bean undergo the Broadleaf XML merging processing (which merges bean ids like blCheckoutWorkflow's list of activities). In your case, since the DecrementInventoryActivity is already defined in the core framework XML file and your definition of blCheckoutWorkflow merges with it, the final result is 2 instances of the DecrementInventoryActivity.

Spring config properties from database and properties

I asked a similar question, but based on the responses, I did a bad job describing what I am after. I have a spring 4 webapp that loads properties from a properties file. We consume those properties both via the "${proper.name"} expressions in spring, as well as by injecting a properties object into some of our classes.
We want to move most of the properties to a database table and make them reloadable. However, a few need to stay in local properties, potentially overriding the database setting. These should also be loaded dynamically after the app is running.
I know that once a particular bean is injected, it won't get reloaded, that doesn't concern me, it's up to that module to handle that. But I am having trouble getting the behavior I want. In particular, I have implemented an AbstractConfiguration from apache commons configuration to get the dual source and overriding I am after. But while it works for injecting the properties object, expressions loaded with "${prop.name}" don't work at all.
How can I get them to work? Did I override the wrong thing? Is it just some config detail?
<bean id="sysProperties" class="org.springframework.beans.factory.config.MethodInvokingFactoryBean">
<property name="targetObject" ref="databaseConfigurator" />
<property name="targetMethod" value="getProperties"/>
</bean>
<bean id="databaseConfigurator" class="my.util.config.MyDatabaseConfigurator">
<property name="datasource" ref="dataSource" />
<property name="propertyFile" value="/WEB-INF/my.properties" />
<property name="applicationName" value="ThisApp" />
</bean>
<bean id="dbConfigFactory" class="org.apache.commons.configuration.ConfigurationConverter" factory-method="getProperties">
<constructor-arg ref="databaseConfigurator" />
</bean>
I haven't tested this, but I think it might work.
<bean id="sysProperties" class="org.springframework.beans.factory.config.MethodInvokingFactoryBean">
<property name="targetObject" ref="databaseConfigurator" />
<property name="targetMethod" value="getProperties"/>
</bean>
<bean id="databaseConfigurator" class="my.util.config.MyDatabaseConfigurator">
<property name="datasource" ref="dataSource" />
<property name="propertyFile" value="/WEB-INF/my.properties" />
<property name="applicationName" value="ThisApp" />
</bean>
<bean name="PropertyPlaceholderConfigurer" class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer">
<property name="properties" ref="CommonsConfigurationFactoryBean"/>
</bean>
<bean name="CommonsConfigurationFactoryBean" class="org.springmodules.commons.configuration.CommonsConfigurationFactoryBean">
<constructor-arg ref="databaseConfigurator"/>
</bean>

Custom timeout configuration per operation using Spring

I am using a Soap WS and I have to customize timeout configuration per operation. The customization is actually done with cxf and its http-conf:conduit, which cannot be customized to the operation level.
My actual configuration is :
<bean id="proxyFactory" class="org.apache.cxf.jaxws.JaxWsProxyFactoryBean">
<property name="serviceClass" value="com.package.PortType" />
<property name="address" ref="URL_WS" />
</bean>
<bean id="URL_WS" class="java.lang.String">
<constructor-arg value="http://serveraddress/Service"/>
</bean>
<http-conf:conduit name="http://serveraddress/Service.*">
<http-conf:client ConnectionTimeout="10000" ReceiveTimeout="10000"/>
</http-conf:conduit>
With this configuration, all the timeout of this WS are up to 10000ms.
As explained above, I would like to customize it to the operation level, I have found this link and tried to follow the process, but I'm in front of a problem of implementation, but I only com.ibm.wsdl.util.xml.QNameUtils in my classpath which has for the factory-method :
public static QName newQName(Node paramNode), method which takes a org.w3c.dom.Node.
I tried to change the code with this implementation coming to:
<bean id="proxyFactory" class="org.apache.cxf.jaxws.JaxWsProxyFactoryBean">
<property name="delegate">
<jaxws:client serviceClass="com.package.PortType" address="URL_WS" >
<jaxws:outInterceptors>
<bean class="com.package.CustomTimeoutInterceptor">
<property name="receiveTimeoutByOperationName">
<map key-type="javax.xml.namespace.QName" value-type="java.lang.Long">
<entry value="10">
<key>
<bean class="com.ibm.wsdl.util.xml.QNameUtils" factory-method="newQName">
<!-- I don't know what to put here -->
</bean>
</key>
</entry>
</map>
</property>
</bean>
</jaxws:outInterceptors>
</jaxws:client>
</property>
</bean>
The Node's implementation I have is com.sun.org.apache.xerces.internal.dom.NodeImpl. I don't know which NodeImpl' subclass I have to use and how to create it to make it working in a bean way, I'm kinda losing myself in the documentation with these different implementations and these different dom Levels.
I just would like to create an Object subClass of Node which would work in this QNameUtils method
OR
find a different way to customize my configuration
I finally solved this problem, here is the working solution:
I kept the CustomTimeoutInterceptor of the link, mixed the solution with the help of this link.
I also kept my initial configuration, and I found that the javax.xml.namespace.QName had a factory method. I just added this part to my configuration:
<!-- Creation of the bean for the interceptor -->
<bean id="timeoutSetter" class="com.package.CustomTimeoutInterceptor">
<property name="receiveTimeoutByOperationName">
<map key-type="javax.xml.namespace.QName" value-type="java.lang.Long">
<entry value="20000">
<key>
<bean class="javax.xml.namespace.QName" factory-method="valueOf">
<constructor-arg value="{http://serveraddress/Service}Operation1" />
</bean>
</key>
</entry>
</map>
</property>
</bean>
<!-- I had the interceptor the list of outInterceptors -->
<cxf:bus>
<cxf:outInterceptors>
<ref bean="timeoutSetter"/>
</cxf:outInterceptors>
</cxf:bus>

Spring Data JPA with Hibernate mapping files

I want to use Spring Data JPA with Hibernate mapping files and without JPA-Annotations.
But I'am facing this exception on server startup (tomcat):
java.lang.IllegalStateException: No persistence units parsed from {classpath*:META-INF/persistence.xml}
at org.springframework.orm.jpa.persistenceunit.DefaultPersistenceUnitManager.obtainDefaultPersistenceUnitInfo(DefaultPersistenceUnitManager.java:547)
at org.springframework.orm.jpa.LocalContainerEntityManagerFactoryBean.determinePersistenceUnitInfo(LocalContainerEntityManagerFactoryBean.java:311)
at org.springframework.orm.jpa.LocalContainerEntityManagerFactoryBean.createNativeEntityManagerFactory(LocalContainerEntityManagerFactoryBean.java:260)
My dispatch-servlet.xml looks like the following:
<bean id="entityManagerFactory"
class="org.springframework.orm.jpa.LocalContainerEntityManagerFactoryBean">
<!--<property name="persistenceUnitName" value="BLUPP" />-->
<property name="dataSource" ref="dataSource" />
<property name="jpaVendorAdapter" ref="hibernateJpaVendorAdapter" />
<!-- <property name="packagesToScan" value="org.cleanyourway.server.beans" />-->
<property name="persistenceUnitPostProcessors">
<list>
<bean
class="org.springframework.data.jpa.support.ClasspathScanningPersistenceUnitPostProcessor">
<constructor-arg value="org.xxxxxx.server.beans" />
<property name="mappingFileNamePattern" value="**hbm.xml" />
</bean>
</list>
</property>
</bean>
Is it possible to use Hibernate mapping files with the ClasspathScanningPersistenceUnitPostProcessor?
I get it running with
<property name="packagesToScan" value="org.xxxxxxx.server.beans" />
and JPA Annotations.
Thanks for your help!
Briefly
Your problem probably comes from the mappingFileNamePattern you provide. Try **/*.hbm.xml instead of **hbm.xml.
Complete snippet:
<bean id="entityManagerFactory"
class="org.springframework.orm.jpa.LocalContainerEntityManagerFactoryBean">
<!--<property name="persistenceUnitName" value="BLUPP" />-->
<property name="dataSource" ref="dataSource" />
<property name="jpaVendorAdapter" ref="hibernateJpaVendorAdapter" />
<!-- <property name="packagesToScan" value="org.cleanyourway.server.beans" />-->
<property name="persistenceUnitPostProcessors">
<list>
<bean
class="org.springframework.data.jpa.support.ClasspathScanningPersistenceUnitPostProcessor">
<constructor-arg name="basePackage" value="org.xxxxxx.server.beans" />
<property name="mappingFileNamePattern" value="**/*hbm.xml" />
</bean>
</list>
</property>
</bean>
In details
Ant path patterns
Spring uses Ant path style patterns. You can find a good documentation on those patterns on the Ant Website. Double asterisk wildcard means: recurse in subdirectories. It should be followed by a slash as it stands for a directory.
ClasspathScanningPersistenceUnitPostProcessor
The mapping file detection part of ClasspathScanningPersistenceUnitPostProcessor takes the two parameters (basePackage (your constructors args) and mappingFileNamePattern) into account. With the suggested correction, Spring will search all **.hbm.xml* in subfolders org/xxxxxx/server/beans/ of the classpath.
Rephrased, you cannot expect that your mappingFileNamePattern would be interpreted alone for the search.
Hereunder, the code snippet of ClasspathScanningPersistenceUnitPostProcessor that makes the job:
String path = ResourcePatternResolver.CLASSPATH_ALL_URL_PREFIX
+ basePackage.replace('.', File.separatorChar)
+ File.separator + mappingFileNamePattern;
Small limitation of ClasspathScanningPersistenceUnitPostProcessor
You cannot scan for HBM files located at the root of JAR files in your classpath. basePackage doesn't support being empty and doesn't work with just a "." value.
Moreover, the underlying PathMatchingResourcePatternResolver doesn't work with Ant style path pattern with wilcard (* in you case) without a root directory (here and here (first warning in Other notes)).
Bug of ClasspathScanningPersistenceUnitPostProcessor
This class has never worked with Hibernate.
In the pre-1.4.x releases, there was this bug.
With this pull request, it seems there is a new bug that prevents me from getting the whole thing working with HBM in JARs. I got a NullPointerException at the line 146 because resource.getURI().getPath(); doesn't seem to work with an URI with two : in the protocol (jar:file:/ in this case) and returns a null path.
(I will update my answer with a link to a bug report either when I have find one or posted one.)

OAuth for Spring Security - Howto implement resource declaration

I am trying and understand the next steps I have to take starting from the reference application at
http://svn.codehaus.org/spring-security-oauth/trunk/sparklr/
in order to create my own implementation. What I do not understand is where and how to declare dynamic resources for Oauth. In the reference app, resources are hard coded within the xml config:
<bean id="photoServices" class="org.springframework.security.oauth.examples.sparklr.impl.PhotoServiceImpl">
<property name="photos">
<list>
<bean class="org.springframework.security.oauth.examples.sparklr.PhotoInfo">
<property name="id" value="1"/>
<property name="name" value="photo1.jpg"/>
<property name="userId" value="marissa"/>
<property name="resourceURL"
value="/org/springframework/security/oauth/examples/sparklr/impl/resources/photo1.jpg"/>
</bean>
<bean class="org.springframework.security.oauth.examples.sparklr.PhotoInfo">
<property name="id" value="2"/>
<property name="name" value="photo2.jpg"/>
<property name="userId" value="paul"/>
<property name="resourceURL"
value="/org/springframework/security/oauth/examples/sparklr/impl/resources/photo2.jpg"/>
</bean>
<!-- some more -->
</list>
</property>
</bean>
I guess, this is no way to handle resources created by the users in the real world. So: How is this supposed to be done?
In the example shown above, it looks like the beans are pre-configured at design time and pre-loaded by Spring.
Did you consider actually creating and loading the beans dynamically during runtime?
This way, you can access the "photos" list whenever a new dynamic resource is created and add it directly to the list "photos"?

Categories

Resources