Java jar files into a repository (CVS, SVN..)

Java jar files into a repository (CVS, SVN..) - java

Why it's a bad idea to commit Java jar files into a repository (CVS, SVN..)

Because you can rebuild them from the source. On the hand if you are talking about third-party JAR files which are required by your project then it is a good idea to commit them into the repository so that the project is self-contained.

So, you have a project that use some external dependencies. This dependencies are well known. They all have
A group (typically, the organization/forge creating them)
An identifier (their name)
A version
In maven terminology, these informations are called the artifact (your Jar) coordinates.
The dependencies I was talking about are either internal (for a web application, it can be your service/domain layer) or external (log4j, jdbc driver, Java EE framework, you name it, ...). All those dependencies (also called artifacts) are in fact, at their lowest level, binary files (JAR/WAR/EAR) that your CVS/SVN/GIT won't be able to store efficently. Indeed, SCM use the hypothesis that versionned content, the one for which diff operations are the most efficient) is text only. As a consequence, when binary data is stored, their is rarely storage optimization (contrary to text, where only versions differences are stored).
As a consequence, what I would tend to recommand you is to use a dependency management build system, like maven, Ivy, or Gradle. using such a tool, you will declare all your dependencies (in fact, in this file, you will declare your dependencies' artifacts coordinates) in a text (or maybe XML) file, which will be in your SCM. BUT your dependencies won't be in SCM. Rather, each developper will download them on its dev machine.
This transfers some network load from the SCM server to the internet (which bandwidth is often more limitated than internal enterpise network), and asks the question of long-term availability of artifacts. Both of these answers are solved (at least in amven work, but I believe both Ivy and gradle are able to connect to such tools - and it seems some questions are been asked on this very subject) using enterprises proxies, like Nexus, Artifactory and others.
The beauty of these tools is that they make available in internal network a view of all required artifacts, going as far as allowing you to deploy your own artifacts in these repositories, making sharing of your code both easy and independant from the source (which may be an advantage).
To sum up this long reply : use Ivy/Maven/Gradle instead of simple Ant build. These tools will allow you to define your dependencies, and do all the work of downloading these dependencies and ensuring you use the declared version.
On a personnal note, the day I discovered those tools, my vision of dependency handling in Java get from nightmare to heaven, as I now only have to say that I use this very version of this tool, and maven (in my case), do all the background job of downloading it and storing at the right location on my computer.

Source control systems are designed for holding the text source code. They can hold binary files, but that isn't really what they are designed for. In some cases it makes sense to put a binary file in source control, but java dependencies are generally better managed in a different way.
The ideal setup is one that lets you manage your dependencies outside of source control. You should be able to manage your dependencies outside of the source and simply "point" to the desired dependency from within the source. This has several advantages:
You can have a number of projects dependent on the same binaries without keeping a separate copy of each binary. It is common for a medium sized project to have hundreds of binaries it depends on. This can result in a great deal of duplication which wastes local and backup resources.
Versions of binaries can be managed centrally within your local environment or within the corporate entity.
In many situations, the source control server is not a local resource. Adding a bunch of binary files will slow things down because it increases the amount of data that needs to be sent across a slower connection.
If you are creating a war, there may be some jars you need for development, but not deployment and vice versa. A good dependency management tool lets you handle these types of issues easily and efficiently.
If you are depending on a binary file that comes from another one of your projects, it may change frequently. This means you could be constantly overwriting the binary with a new version. Since version control is going to keep every copy, it could quickly grow to an unmanageable size--particularly if you have any type of continuous integration or automated build scripts creating these binaries.
A dependency management system offers a certain level of flexibility in how you depend on binaries. For example, on your local machine, you may want to depend on the latest version of a dependency as it sits on your file system. However, when you deploy your application you want the dependency packaged as a jar and included in your file.
Maven's dependency management features solve these issues for you and can help you locate and retrieve binary dependencies as needed. Ivy is another tool that does this as well, but for Ant.

They are binary files:
It's better to reference the source, since that's what you're using source control for.
The system can't tell you which differences between the files
They become a source of merge-conflicts, in case they are compiled from the source in the same repository.
Some systems (e.g. SVN) don't deal quite well with large binary files.
In other words, better reference the source, and adjust your build scripts to make everything work.

The decision to commit jar files to SCM is usually influenced by the build tool being used. If using Maven in a conventional manner then you don't really have the choice. But if your build system allows you the choice, I think it is a good idea to commit your dependencies to SCM alongside the source code that depends on them.
This applies to third-party jars and in-house jars that are on a separate release cycle to your project. For example, if you have an in-house jar file containing common utility classes, I would commit that to SCM under each project that uses it.
If using CVS, be aware that it does not handle binary files efficiently. An SVN repository makes no distinction between binary and text files.
http://svnbook.red-bean.com/en/1.5/svn.forcvs.binary-and-trans.html
Update in response to the answer posted by Mark:
WRT bullet point 1: I would say it is not very common for even a large project to have hundreds of dependencies. In any case, disk usage (by keeping a separate copy of a dependency in each project that uses it) should not be your major concern. Disk space is cheap compared with the amount of time lost dealing with the complexities of a Maven repository. In any case, a local Maven repository will consume far more disk space than just the dependencies you actually use.
Bullet 3: Maven will not save you time waiting for network traffic. The opposite is true. With your dependencies in source control, you do a checkout, then you switch from one branch to another. You will very rarely need to checkout the same jars again. If you do, it will take only minutes. The main reason Maven is a slow build tool is all the network access it does even when there is no need.
Bullet Point 4: Your point here is not an argument against storing jars in SCM and Maven is only easy once you have learned it and it is only efficient up to the point when something goes wrong. Then it becomes difficult and your efficiency gains can disappear quickly. In terms of efficiency, Maven has a small upside when things work correctly and a big downside when they don't.
Bullet Point 5: Version control systems like SVN do not keep a separate copy of every version of every file. It stores them efficiently as deltas. It is very unlikely that your SVN repository will grow to an 'unmanageable' size.
Bullet Point 6: Your point here is not an argument against storing files is SCM. The use case you mention can be handled just as easily by a custom Ant build.

Related

What are the advantages of jar (war) compression?

You can skip the wall of text and go straight to the questions listed below, if you are so inclined.
Some background:
I'm currently doing some work on a large scale, highly modular Spring application. The application consists of multiple stand-alone Maven projects which are built separately. When compiling the whole application, these projects are pulled in as dependencies and overlaid onto the resulting 'super WAR' file.
The issue:
The build process (shortly) described in the preceding paragraph works well, but is very slow, even when all dependencies are already compiled and can be fetched from the local maven repository.
Some simple testing reveals that build-time of the 'super WAR' is cut in ~half when jar-compression is turned off entirely, at the cost of a comparatively small (~10%) increase in file size.
This is no surprise, really, as the build requires all the dependencies to be built/compressed and later decompressed, overlaid, and then compressed again (as a huge, unified war file).
Adding to this, a fair few of the "sub-projects" are pure web applications which contain no Java code needing compilation (or compression) at all (only static resources).
Questions:
What are the advantages of jar (war, really) compression, except for the (negligibly) reduced file size?
In the case of Java EE or Spring web applications, are there other (performance) issues introduced when turning off compression entirely? I'd think it has the potential to help both build time and JVM-startup.
Any suggestions on how to handle the build process of non-java applications with maven more efficiently are welcome as well. I've considered bundling them as resources, but am not sure how to achieve this while ensuring they are still buildable as stand-alone war files.

Besides the sometimes negligible reduction in the file size and the simplicity of having to manage only one file instead of an entire directory tree, there are still a few advantages:
Reduced copy time, as per this answer: https://superuser.com/a/360532/145340 I can also back this up by personal experience, copying or moving many small files is much slower than copying or moving an equally large single file.
Portability: The JAR file format is clearly defined, leaving no room for incompatible implementations.
Security: You can digitally sign the contents of a JAR file, ensuring the integrity and authenticity of the contents.
Package Sealing: Enforce version consistency, since all classes defined in a package must be found in the same JAR file.
Package Versioning: hold data like like vendor and version information.

Is there any guideline of "What to share on Github" regarding RCP developments?

I recently started developing a plugin, which consist of several Eclipse Plugin-Projects. I use Maven/Yycho as a build tool and GitHub as version control system.
Now I was wondering what to push to my GitHub repositories. The POM files and Feature/Updatesites as well? It seems that this stuff is:
very User specific (the path are relative to the file structure on my computer)
do other developers need that stuff or should I give them the freedom of choosing their own build tools?
To clarify, I have right now 6 Eclipse projects:
*.plugin1
*.plugin1.tests
*.plugin2
*.releng
*.feature
*.p2updatesite
Would it be good practise to share everything? From my feeling I would say I will only share plugin1+tests & item # 2 (without the pom files) so that everyone can take care themselves about building.

You don't have to add to your repo:
anything that can easily be regenerated
anything with local paths specific to a user
Regarding building, ideally you should version at least a script able to generate the right pom, in order for a new contributor to be able to get going as fast as possible after cloning it.
If you can have config files with only relative paths (instead of absolute one), then you can include those, for others to use.
See for instance ".classpath and .project - check into version control or not?".

Is Nexus the right place to archive releases (jar-with-dependencies, WAR files, tar.gz, zip, etc.)?

The way I understand it, Nexus is responsible for storing JAR files that reference other dependency JARs via their pom. And, in turn, the original JAR files can be used as dependencies as well.
However, should we store release artifacts in Nexus? These are files that will never be used as dependencies. They include jar-with-dependencies, WAR files, zip/tar.gz files, etc. What's the right place to store them?
A simple file system HTTP server like http://archive.apache.org/dist/ seems to be the right idea. But Nexus is indeed just a manager on top of that.

Nexus is definitely a good place to store these artifacts, since it has long evolved beyond a pure Maven repository server. It gives you a nice UI for download, security and much more.
If you are already using Nexus I would definitely not waste time with yet another server or infrastructure component to store these artifacts. Especially also if you are building your artifacts with Maven. Deployment comes pretty much for free then..

I'm agree with Manser.
From my point of view, each time you release something, it should be in your Repository (Nexus, Archiva, Artifactory, whatever you want).
This is true for Enterprise Repositories,but especially true when you publish something on Internet, to keep a built version ready to use (and not ready to build ;), that's your SCM job). Even it's a jar with dependencies, its a convinient way to distribute your release.
The only bad point is that take a big place in repo, duplicating jars (for war, ear or jar-with-dependencies). But it's (at this moment) this only way to have a full version ready to execute.

Determining what minimal jars are needed for a feature

How do you determine what jars are needed for such and such feature of a framework? For example, what jars would be needed out of all those available for Spring in order to support only dependency injection?

There are tools that create minimal JARs by figuring out which classes are actually used in an application by statically analyzing the code, then creating a new JAR containing only those classes. (I recall using Zelix Classmaster to do this, but there are many alternatives.)
The problem with using these tools for a DI framework like Spring include:
The existing only trace static dependencies. If you dynamically load classes, you have to specifically tell the analyser about each one. DI frameworks in general, and Spring in particular is replete with dynamic loading, including dynamic loading that is opaque to application code.
The existing tools work by creating a new output JAR, not by telling you which of the input JARs are not used. While repackaging the JARs is OK if you are creating a shrink-wrapped application from a closed-source codebase, it is undesirable in general, and potentially problematic with some open-source licenses. Certainly you don't want to do this with Spring.
In theory, someone could write a tool to help. In practice, the tool would need to (for example) know how to extract dynamic class dependencies from Spring configurations expressed in annotations, XML and from bean descriptors created at runtime from higher order configuration (SpringSecurity does this for example). That is a big ask. And even then you have the problem that a "small" change to the wirings made on the installation platform could fail due to a required JARs having been left out by the JAR pruning process.
In my view, the more practical alternatives are:
If you use Maven / Ivy to manage your dependencies, look at the dependency graphs, strip out dependencies that appear to be no longer needed ... and test, test, test.
Manually strip out JARs that appear to be unused ... and test, test, test.
Don't worry about it. A moderate level of unused JAR cruft might add a second or three to deployment and webapp startup times, but that generally doesn't matter. (But if it does ... see above.)

This is why some older Java projects end up having 600 Jars and a 200 MB war file, for a 10,000 line application. Kind of a pain if you don't manage it carefully...

You should really ask the framework provider or read the documentation. Statically analyzing what jars are required might not be enough in some cases(dynamic loading) and sometimes you might end up with too many jars.
I once did some ftp helper stuff to a sort of "utility" library. It depended on some apache ftp jar. If you never used the ftp features in the library you would not need the ftp jar but statical analysis of the code might say you need it. This is something you should documents.

Do you follow any guidelines (java) in packaging?

Do you follow any design guidelines in java packaging?
is proper packaging is part of the design skill? are there any document about it?
Edit : How packages has to depend on each other?, is cyclic packages unavoidable?, not about jar or war files.

My approach that I try to follow normally looks like this:
Have packages of reasonable size. Less then 3 classes is strange. Less then 10 is good. More then 30 is not acceptable. I'm normally not very strict about this.
Don't have dependency cycles between packages. This one is tough since many developers have a hard time figuring out any way to keep the dependencies cycle free. BUT doing so teases out a lot of hidden structure in the code. It becomes easier to think about the structure of the code and easier to evolve it.
Define layer and modules and how they are represented in the code. Often I end up with something like <domain>.<application>.<module>.<layer>.<arbitrary substructure as needed> as the template for package names
No cycles between layers; no cycles between modules.
In order to avoid cycles one has to have checks. Many tools do that (JDepend, Sonar ...). Unfortunatly they don't help much with finding ways to fix cycles. That's why I started to work on Degraph which should help with that by visualizing dependencies between classes, packages, modules and layer.

Packaging is normally about release management, and the general guidelines are:
consistency: when you are releasing into integration, pre-production or production environment several deliveries, you want them organized (or "packaged") exactly the same way
small number of files: when you have to copy a set of files from one environment to another, you want to copy as many as possible, if their number is reasonable (10-20 max per component to deliver), you can just copy them (even if those files are important in size)
So you want to define a common structure for each delivery like:
aDelivery/
lib // all jar, ear, war, ...
bin // all scripts used to launch your application: sh, bat, ant files, ...
config // all properties files, config files
src // all sources zipped into jars
docs // javadoc zipped
...
Plus, all those common directory structures should be stored into one common repository (a VCS, or a maven repo, or...), in order to be queried, without having to rebuilt them every time you need them (you do not need that if you have only one or two delivery components, but when you have 40 to 60 of them... a full rebuilt is out of the question).

You can find a lot of information here:
What strategy do you use for package naming in Java projects and why?

The problem with packaging in Java is that it has very little relation to what you would like to do. For example, I like following the Eclipse convention of having packages marked internal, but then I can't define their classes with a "package" protection level.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.