I'm running Spark 3.3.0 on Windows 10 using Java 11. I'm not using Hadoop. Every time I run something, it gives errors like this:
java.lang.RuntimeException: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:735)
at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:270)
at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:286)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:978)
First of all, even the link https://wiki.apache.org/hadoop/WindowsProblems in the error message is broken. The update link is apparently https://cwiki.apache.org/confluence/display/HADOOP2/WindowsProblems, which basically says that Hadoop needs Winutils. But I'm not using Hadoop. I'm just using Spark to process some CSV files locally.
Secondly, I want my project to build with Maven and run with pure Java, without requiring the user to install some third-party software. If this Winutil stuff needs to be installed, it should be included in some Maven dependency.
Why is all this Hadoop/Winutils stuff needed if I'm not using Hadoop, and how do I get around it so that my project will build in Maven and run with pure Java like a Java project should?
TL;DR
I have created a local implementation of Hadoop FileSystem that bypasses Winutils on Windows (and indeed should work on any Java platform). The GlobalMentor Hadoop Bare Naked Local FileSystem source code is available on GitHub and can be specified as a dependency from Maven Central.
If you have an application that needs Hadoop local FileSystem support without relying on Winutils, import the latest com.globalmentor:hadoop-bare-naked-local-fs library into your project, e.g. in Maven for v0.1.0:
<dependency>
<groupId>com.globalmentor</groupId>
<artifactId>hadoop-bare-naked-local-fs</artifactId>
<version>0.1.0</version>
</dependency>
Then specify that you want to use the Bare Local File System implementation com.globalmentor.apache.hadoop.fs.BareLocalFileSystem for the file scheme. (BareLocalFileSystem internally uses NakedLocalFileSystem.) The following example does this for Spark in Java:
SparkSession spark = SparkSession.builder().appName("Foo Bar").master("local").getOrCreate();
spark.sparkContext().hadoopConfiguration().setClass("fs.file.impl", BareLocalFileSystem.class, FileSystem.class);
Note that you may still get warnings that "HADOOP_HOME and hadoop.home.dir are unset" and "Did not find winutils.exe". This is because the Winutils kludge permeates the Hadoop code and is hard-coded at a low-level, executed statically upon class loading, even for code completely unrelated to file access. More explanation can be found on the project page on GitHub. See also HADOOP-13223: winutils.exe is a bug nexus and should be killed with an axe.)
How Spark uses Hadoop FileSystem
Spark uses the Hadoop FileSystem API as a means for writing output to disk, e.g. for local CSV or JSON output. It pulls in the entire Hadoop client libraries (currently org.apache.hadoop:hadoop-client-api:3.3.2), containing various FileSystem implementations. These implementations use the Java service loader framework to automatically register several implementations for several schemes, including among others:
org.apache.hadoop.fs.LocalFileSystem
org.apache.hadoop.fs.viewfs.ViewFileSystem
org.apache.hadoop.fs.http.HttpFileSystem
org.apache.hadoop.fs.http.HttpsFileSystem
org.apache.hadoop.hdfs.DistributedFileSystem
…
Each of these file systems indicates which scheme it supports. In particular org.apache.hadoop.fs.LocalFileSystem indicates it supports the file scheme, and it is used by default to access the local file system. It in turn uses the org.apache.hadoop.fs.RawLocalFileSystem internally, which is the FileSystem implementation ultimately responsible for requiring Winutils.
But it is possible to override the Hadoop configuration and specify another FileSystem implementation. Spark creates a special Configuration for Hadoop in org.apache.spark.sql.internal.SessionState.newHadoopConf(…) ultimately combining all the sources core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, and __spark_hadoop_conf__.xml, if any are present. Then Hadoop's FileSystem.getFileSystemClass(String scheme, Configuration conf) looks for the FileSystem implementation to use by looking up a configuration for the scheme (in this case file) in the form fs.${scheme}.impl (i.e. fs.file.impl in this case).
Thus if you want to specify another local file system implementation to use, you'll need to somehow get fs.file.impl into the configuration. Rather than creating a local configuration file if you are accessing Spark programmatically, you can set it via the Spark session, as explained in the introduction.
Why Winutils
The Hadoop FileSystem API in large part assumes a *nix file system. The current Hadoop local FileSystem implementation uses native *nix libraries or opens shell processes and directly runs *nix commands. The current local FileSystem implementation for Windows limps along with a huge kludge: a set of binary artifacts called Winutils that a Hadoop contributor created, providing a special back-door subsystem on Windows that Hadoop can access instead of *nix libraries and shell commands. (See HADOOP-13223: winutils.exe is a bug nexus and should be killed with an axe.)
However detection and required support of Winutils is actually hard-coded in the Hadoop at a low-level—even in code that has nothing to do with the file system! For example when Spark starts up, even a simple Configuration initialization in the Hadoop code invokes StringUtils.equalsIgnoreCase("true", valueString), and the StringUtils class has a static reference to Shell, which has a static initialization block that looks for Winutils and produces a warning if not found. 🤦♂️ (In fact this is the source of the warnings that were the motivation for this Stack Overflow question in the first place.)
Workaround to use FileSystem without Winutils
Irrespective of the warnings, the larger issue is getting FileSystem to work without needing Winutils. This is paradoxically both a simpler and also a much more complex project than it would first appear. One one hand it is not too difficult to use updated Java API calls instead of Winutils to access the local file system; I have done that already in the GlobalMentor Hadoop Bare Naked Local FileSystem. But weeding out Winutils completely is much more complex and difficult. The current LocalFileSystem and RawLocalFileSystem implementations have evolved haphazardly, with halway-implemented features scattered about, special-case code for ill-documented corner cases, and implementation-specific assumptions permeating the design itself.
The example was already given above of Configuration accessing Shell and trying to pull in Winutils just upon classloading during startup. At the FileSystem level Winutils-related logic isn't contained to RawLocalFileSystem, which would have allowed it to easily be overridden, but instead relies on the static FileUtil class which is like a separate file system implementation that relies on Winutils and can't be modified. For example here is FileUtil code that would need to be updated, unfortunately independently of the FileSystem implementation:
public static String readLink(File f) {
/* NB: Use readSymbolicLink in java.nio.file.Path once available. Could
* use getCanonicalPath in File to get the target of the symlink but that
* does not indicate if the given path refers to a symlink.
*/
…
try {
return Shell.execCommand(
Shell.getReadlinkCommand(f.toString())).trim();
} catch (IOException x) {
return "";
}
Apparently there is a "new Stat based implementation" of many methods, but RawLocalFileSystem instead uses a deprecated implementations such as DeprecatedRawLocalFileStatus, which is full of workarounds and special-cases, is package-private so it can't be accessed by subclasses, yet can't be removed because of HADOOP-9652. The useDeprecatedFileStatus switch is hard-coded so that it can't be modified by a subclass, forcing a re-implementation of everything it touches. In other words, even the new, less-kludgey approach is switched off in the code, has been for years, and no one seems to be paying it any mind.
Summary
In summary, Winutils is hard-coded at a low-level throughout the code, even in logic unrelated to file access, and the current implementation is a hodge-podge of deprecated and undeprecated code switched on or off by hard-coded flags that were put in place when errors appeared with new changes. It's a mess, and it's been that way for years. No one really cares, and instead keeps building on unstable sand (ViewFs anyone?) rather than going back and fixing the foundation. If Hadoop can't even fix large swathes of deprecated file access code consolidated in one place, do you think they are going to fix the Winutils kludge that permeates multiple classes at a low level?
I'm not holding my breath. Instead I'll be content with the workaround I've written which writes to the file system via the Java API, bypassing Winutils for the most common use case (writing to the local file system without using symlinks and without the need for the sticky bit permission), which is sufficient to get Spark accessing the local file system on Windows.
There is a longstanding JIRA for this...for anyone running spark standalone on laptop there's no needed to provide those posix permissions. is there
LocalFS to support ability to disable permission get/set; remove need for winutils
This is related to HADOOP-13223
winutils.exe is a bug nexus and should be killed with an axe.
It is only people running spark on windows who hit this problem, and nobody is putting in the work to fix it. If someone was, I will help review/nurture in.
Spark is a replacement execution framework for mapreduce, not a "Hadoop replacement".
Spark uses Hadoop libraries for Filesystem access, including local filesystem. As shown in your error org.apache.hadoop.fs.RawLocalFileSystem
It also uses winutils as a sort of shim to implement Unix (POSIX?) chown/chmod commands to determine file permissions on top of Windows directories.
tell Spark to use a different file system implementation than RawLocalFileSystem?
Yes, use a different URI than default file://
E.g. spark.csv("nfs://path/file.csv")
Or s3a or install HDFS, or GlusterFS, etc. for a distributed filesystem. After all Spark is meant to be distributed processing engine; if you're only handling small local files, it's not the best tool.
How to resolve integrated files in Perforce JAVA API. I am trying to merge files from one branch to another. I am using following code to merge.
tmpClient.integrateFiles(file, toFile, "17.2", opts );
After executing this command, files need to be resolved. How can I do that?
Look at IClient.resolveFile() and IClient.resolveFilesAuto():
https://www.perforce.com/perforce/r15.1/manuals/p4java-javadoc/com/perforce/p4java/client/IClient.html
The specific method you use will depend on whether you expect that you will always be able to rely on auto-resolve (which is a bad expectation generally) and what sort of system you have in your app to handle conflicting merges.
I developed a transformation that generates a Kermeta XMI file.
My problem is that I would run this transformation (.kmt file) in the background, that is to say from a Java program.
I tried a lot of code but always without result.
Someone can help me?
Thank you in advance for your help.
Running a kermeta program depends on the version of Kermeta and about its execution context.
With Kermeta 2, the kmt code is compiled into scala. This scala code can be directly called from java thanks to scala compatibility with java.
You first need to declare at least one main operation in order to generate the appropriate initialization operation.
Let's say you have a method "transformFooToBar" in a class "org::example::FooToBar" in your kermeta code.
Declare it as a main entry point using the tags
#mainClass "org::example::FooToBar"
#mainOperation "transformFooToBar"
This main operation must have only string parameters.
This will generate an utility scala class org.example.MainRunner which contains useful initialization methods and a main operation that you can call from the command line.
Case 1/ With Kermeta 2 code called from an eclipse plugin:
This is the simpliest case, call the method org.example.MainRunner.init4eclipse()
before any other call. Then call your main method org.example.KerRichFactory.createFooToBar.transformFooToBar()
You can call any other classes or methods (ie. even with parameters that aren't String)
This is quite convinient for building tool for eclipse based on kermeta transformation.
Case 2/ Kermeta 2 code called from a standard java application (ie. not running in an eclipse plugin)
The initialiation method is then org.example.MainRunner.init()
Common trap for case 2: many transformations that run as standalone still need to manipulate models using eclipse uri scheme in their internal refencing system. Ie. platform:/plugin/... , platform:/resource/..., pathmap:/... , or even more complex uri mapping (typically using custom protocols ) (you can check that easily in by looking in the xmi files as text)
In that case, since Eclipse platform isn't running to provide the search mechanism, you need to provide manually the equivalent URI mapping to map these URI to your local system URI (Ie. to file:/... or jar:file:/ URIs)
One possibility is to use an urimap.properties file that provide such mapping. By default when running a kermeta program in eclipse, an urimap.properties is generated for the current eclipse configuration.
When deploying a kermeta program to another computer, or using a custom deployment packaging, you will have to provide/compute an equivalent file for the target computer.
For convinience, you can set the location of this urimap.properties files thanks to the system propery "urimap.file.location"
makefile that compiles all java files.
The way I have done this multiple times in past is to generate a Java file depending on the flag. If you are using ant then this code generation is very simple. Otherwise, you can use a template file with placeholder and do some shell-scripting or similar to generate the file.
In ant you can use the replace task to modify files as part of your build.
We do this in our builds, but we use it to modify a Java .properties file which the application will read for its configurable behavior.
I've written rather pleasant flag-controlled systems using a combination of Google Guice and Apache CLI to inject flag-controlled variables into constructors.
I have a web application using Java Servlets in which the user can upload files. What can I do to prevent malicious files and viruses from being uploaded?
The ClamAV antivirus team provide a very easy interface for integrating the clamd daemon into your own programs. It is sockets-based instead of API based, so you might need to write some convenience wrappers to make it look "natural" in your code, but the end result is they do not need to maintain a dozen or more language bindings.
Alternatively, if you have enough access to the machine in question, you could simply call a command line application to do the scanning. There is enough info on starting command line applications and most if not all locally installed virusscanners have a command line option. This has the advantage that not every IP packet has to pass through the scanner (but you will have to read and parse the output of the virusscanner). It also makes sure you got the info available in your Java application so you can warn the user.
You also need to protect from Path Traversal (making sure users cannot upload files to a place they do not belong, such as overwriting a JAR file in the classpath or a DLL in the path)