How to customize Apache Nutch 2.3 generate step

How to customize Apache Nutch 2.3 generate step - java

I want Nutch to select specific URLs according to my own rules. This step is done at generate time. I know how to write parser/indexer plugin. But How to do it at generate time. My Nutch version is 2.3 series

The Nutch generator is not really an extension point in Nutch, so you are not able of writing plugins to customize it. Nevertheless, nothing stops you from writing your own generator with your own logic.
You would need to adjust the bin/nutch and bin/crawl scripts in order to call your own generator instead of the default one. Keep in mind that some other parts of Nutch rely on some parts of the generator implementation (SegmentMerger for instance). If you customize these parts, then you'll need to update some other classes as well.
The generator uses the ScoringFilter.generatorSortValue() method when is deciding which elements to return. So, this is one alternative that doesn't require changing the generator.
Side note, this is not entirely uncommon to do, I've seemed some clients requiring customized generators.

As suggested by Jorge, you could write a scoringfilter to assign scores to pages based on your own logic and filter during the generation step based on that. Alternatively, if by chance your selection rules can be determined based on the URL alone, you could have a bespoke URL normaliser used with a scope of generate (or whatever the value is) which would rewrite the URLs into something that the URL filters would then discard. You'd need to activate the filtering as part of the generate step. This is an ugly hack.
Nutch 2.x is really awkward and I am not sure you could create a copy of your table based on a filter of the original one.
What Gora backend do you use?
StormCrawler is a lot more flexible for this and we've recently added a mechanism for filtering URLs at the spout level, which is exactly what you'd need. You could do a similar thing in Nutch 2.x but that would probably mean changing things in GORA as well.

Related

OpenAPI - generate server code for a changing api?

I am maintaining a Java application where we're constantly adding new features (changes in the api). I want to move towards using OpenAPI as a way to document the api. I see two schools of thought:
Write the code, use some annotations to generate the OpenAPI spec.
Write the OpenAPI, use it to generate some server code.
While both seem fine and dandy, the server code is simply stubbed out, and would then require a lot of manual plugging in of services. While that seems fine as a one time cost, then next time I update the interface, it seems to me the only two options are
Generate them all again, re-do all the manual wiring.
Hand edit the previously generated classes to match the new spec file (potentially introducing errors).
Am I correct with those options? If so, it seems that using the code to generate the api spec file is really the only sane choice.

I would recommend an API First approach where you describe your API in the yaml file and generate with each new addition.
Now how do you deal with generator overwriting manual work?
You could use inheritance to create models and controllers based on the code that is generated.
You can also use the .ignore file provided with the generator to if you want to be sure of files not being overwritten.

How to validate API in tests with Swagger?

I'm trying to figure out the best way to have my API documentation be the source of truth and use it to validate the actual Java REST code ideally through integration testing or something of that sort. We're using the contract first or consumer contract type of approach, so we don't want the documentation to be generated from annotated code necessarily and updating every time a developer makes a change.
One thought has been to use Swagger, but I'm not sure how best to make it be used for validating the API. Ideally, it'd be good to have the validation occur in the build or integration testing process to see if the real response (and request if possible) match what's expected. I know there are a lot of uses and tools for Swagger and just trying to wrap my head around it. Or if there is a better alternative to work with Java code.

Recently, we (swagger-codegen community) start adding automatic test case generation to API clients (C#, PHP, Ruby). We've not added that to Java yet. Here are some example test cases generated by Swagger-Codegen for C#:
https://github.com/swagger-api/swagger-codegen/tree/master/samples/client/petstore/csharp/SwaggerClient/src/IO.Swagger.Test
It's still very preliminary and we would like to hear feedback from you to see if that's what you're looking for.

I think you should try swagger-request-validator:
https://bitbucket.org/atlassian/swagger-request-validator
Here are some examples how to use it:
https://bitbucket.org/atlassian/swagger-request-validator/src/master/swagger-request-validator-examples/
Another alternative is assertj-swagger:
https://github.com/RobWin/assertj-swagger

You may want to look at Spring Cloud Contract. It offers you a DSL, where you can describe the scenarios (more or less what is the response I get for a given request) and it seems to fit well to what you described as a requirement...

If you're using the Spring Framework, I'd highly recommend checking out Spring RestDocs which allow you to generate

How to generate sequence diagrams automatically on executing junit

I have been given a task of "generate sequence diagrams automatically on execution of junit/test case" in eclipse. I am learning UML. I found tools that can generate a sequence, and I am aware of junit, but how do I club this both.
The tools that I found good were UMLet,ModelGoon UML, Object Aid. But I zeroed in on ModelGoon. I found that simple and easy to use. How do I automate this task, if so please guide me.
If there are any-other tools that are available then guide me.

First: This is a very good idea, and there are several ways to go about it. I will make the assumption that you are working in a jvm language (e.g. Kotlin or Java) so the suggestions I will make are biased by that.
Direct approach
Set up your logging to log using json, it makes the rest much simpler: https://www.baeldung.com/java-log-json-output
Make a library where you log the name of the component/method you are in, and the session you are processing. There are many ways of doing this, but a simple one is to a thread local variable: Set the variable to contain the name of the thing you are tracing ("usecase foobar"), and some unique ID (UUIDs are a decent choice). Another would be to generate some tracing ID (or get one from an external interaction), and send that as a parameter to all involved methods. Both of these will work, and which one is the simplest in practice depends on the architecture of your application.
In the methods you want to trace, write a log entry that contains that tracing information (name of usecase, trace ID, or any combination thereof), the location where the log entry was written, and any other information you want to add to your sequence diagram.
Run your test normally. A log will be produced. You need to be able to retrieve that log. There are many ways this can be done, use one :-)
Filter the log entries so you get only the ones you are interested in. Using the "jq" utility is a decent choice.
Process the filtered output to generate "plant uml" input files (http://plantuml.com/) for sequence diagrams.
Process the plant UML files to get sequence diagrams.
Done.
Industrial approach
Use some standard tooling for tracing like "https://opentracing.io/", instrument your application using this tooling, and extract your diagrams using that standard tooling.
This will also work in production an will probably scale much better than the direct approach, but if scaling isn't your thing, then the direct approach may be what you want to do.

Introduce per-customer personalization in java application

I've searched on internet and here on SO, but couldn't wrap my mind around the various options.
What I need is a way to be able to introduce customer specific customization in any point of my app, in an "external" way, external as in "add drop-in jar and get the customized behavior".
I know that I should implement some sort of plugin system, but I really don't know where to start.
I've read some comment about spring, osgi, etc, can't figure out what is the best approach.
Currently, I have a really simple structure like this:
com.mycompany.module.client.jar // client side applet
com.mycompany.module.server.jar // server side services
I need a way of doing something like:
1) extend com.mycompany.module.client.MyClass as com.mycompany.module.client.MyCustomerClass
2) jar it separately from the "standard" jars: com.mycompany.customer.client.jar
3) drop in the jar
4) start the application, and have MyCustomerClass used everywhere the original MyClass was used.
Also, since the existing application is pretty big and based on a custom third-party framework, I can't introduce devastating changes.
Which is the best way of doing this?
Also, I need the solution to be doable with java 1.5, since the above mentioned third-party framework requires it.

Spring 3.1 is probably the easiest way to go about implementing this, as their dependency injection framework provides exactly what you need. With Spring 3.1's introduction of Bean Profiles, separating concerns can be even easier.
But integrating Spring into an existing project can be challenging, as there is some core architecture that must be created. If you are looking for a quick and non-invasive solution, using Spring containers programmatically may be an ideal approach.
Once you've initialized your Spring container in your startup code, you can explicitly access beans of a given interface by simply querying the container. Placing a single jar file with the necessary configuration classes on the classpath will essentially automatically include them.

Personalization depends on the application design strongly. You can search for a pluggable application on the Internet and read a good article (for an example: http://solitarygeek.com/java/a-simple-pluggable-java-application). In the pluggable application, you can add or remove a feature that a user decides. A way for the pluggable application is using the Interface for de-coupling of API layer and its implementation.
There is a good article in here

User personalisation is something which needs to be in the design. What you can change as an after thought if the main body of code cannot be changed, is likely to be very limited.
You need to start be identifying what can be changed on a per user basis. As it appears this cannot be changed, this is your main limiting factor. From this list determine what would be useful to change and implement this.

what is the benefit in dynamically generating java bean classes from xml?

I had written a lot of java bean classes using my IDE. Another person suggests a different approach. He suggests that I put an xml file with bean definitions in them. Then I either use jaxb or xslt to dynamically generate the classes during build time. Though its a novel and interesting approach, I do not see any major benefit in it.
I see only one benefit in this suggested approach : The java bean classes need not be maintained in configuration control. Any bean changes is going to require only an update in the xml file.
Are there any major benefits in dynamically generating java classes ? Is there any other reason why this approach is taken ?

I agree with #Akhilss. My experiences have been in large scale Java EE projects where code generation is common.
It all depends on your project. If you are coding only a few beans and only need basic functionality then I don't see the need to start with XML (Which is often over used anyway). Especially if you actually don't need the XML as well.
However if you are building a system which needs the XML, an example being a SOAP web service WSDL and schema, then generation is a good idea because it saves you from manually keep schemas and beans in sync. As well as providing factory classes and other support code.
As a counter argument, with EJB3 and similar standards, it's now often easier to write the beans and generate the messy XML stuff on the fly. Ie. let the server do the grunt work.
Another reason to consider code generation is if you require more complex functionality in your beans because they represent data structures. A few years ago I trialled the Apache Tuscany project for generating SDO beans from XML. The nice thing about that was that I could generate functionality like property change notifications so when I modified any of the bean's properties (including collections), other parts of your program could be notified automatically. Generated functionality like that can save you a lot of time and money if you need it.
Ultimately, I'd suggest adhering to the KISS principle. So don't add what you don't need. Generated code from XML is useful if it helps you in the long run. But like any technology, be sure you are adding it for the right reasons.

I have used Jibx and its generator in my project. My experience has been mixed.
The usual case for using JAXB's (XJC) generator is referred to in http://static.springsource.org/spring-ws/site/reference/html/why-contract-first.html
Conversion to and from XML maked it possible to store in the DB and retrieve for future use as well as use for test case input for functional tests.
Using any kind of generator (Jaxb,Jibx,XMLBeans,Custom) might make sense for large sized projects. It allows for standardization of data types (like BigDecimal for financial amounts, like ArrayList for all lists), forcing interfaces (like Serializable or Cloneable). This enforces good practices and reduce the need for reviews of generated files.
It allows for injection of code through XSLT or post processing of generated java file. Example is to inject Rounding code to a specific decimal size(2,6,9) with a specific policy (UP,DOWN,NEAR) within the setter method for each field of type financialAmount. Forcing such behavior does reduce the instance of bugs(for incorrect financial values which companies are liable for).
The disadvantage are
Usually each java class can be only a bean class. Any customization made will be overwritten. Since (in my case) the generator is tied in to the build process. The classes get generated with every build.
You cannot do implementation of your custom interfaces on a bean class or add annotations for your own or third party frameworks.
You cannot easily implement patterns like a factory method since default constructors are usually generated. Refactoring is usually difficult since generators do not usually support it.
You may(not sure now, was true a couple of years ago for Jibx) not be able to generated ENUMS when it would be most applicable.
You may not be able to override the default datatype with your own regardless of the need. CopyOnWrite list and not ArrayList for a variable shared across threads or a custom implementation of a List which also implements the Observer pattern.
The benefits of a generator outweigh the costs for large sized (in persons and not code, think 150 developers in three locations) distributed projects. You can work around the disadvantages by defining your custom classes which contain the bean and implements behaviour or post processing (adding additional code) with further metadata picked up from XSD annotations or another configuration file. Remember support and Maintenance of the generator become critical since the entire project depends on it. Use it with CAUTION.
For smaller sized projects I personally would write my own classes. For larger sized projects I personally would not use it in the middle tier mostly because of the lack of refactoring support. It can be used for simple beans meant to be bound to UI frameworks.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.