how to run mahout kmeans algorithm in local mode - java

Is it possible to run a mahout k mean java program on local, so that it will read the data from local and save it back to local file system instead of hdfs.
All examles on internet are working on hdfs.
https://github.com/tdunning/MiA/blob/master/src/main/java/mia/clustering/ch07/SimpleKMeansClustering.java

Yes, it is possible - checkout SequenceFileWriter. See the following code example, which writes clustered data points to a file. Here is a blog post that describes this in great detail:
public static void writePointsToFile(List<Vector> points,
String fileName,
FileSystem fs,
Configuration conf) throws IOException {
Path path = new Path(fileName);
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
path, LongWritable.class, VectorWritable.class);
long recNum = 0;
VectorWritable vec = new VectorWritable();
for (Vector point : points) {
vec.set(point);
writer.append(new LongWritable(recNum++), vec);
}
writer.close();
}

Related

Check if file exists on remote HDFS from local spark-submit

I'm working on a Java program dedicated to work with Spark on a HDFS filesystem (located at HDFS_IP).
One of my goals is to check whether a file exists on the HDFS at path hdfs://HDFS_IP:HDFS_PORT/path/to/file.json. While debugging my program in local, I figured out I can't access to this remote file using the following code
private boolean existsOnHDFS(String path) {
Configuration conf = new Configuration();
FileSystem fs;
Boolean fileDoesExist = false ;
try {
fs = FileSystem.get(conf);
fileDoesExist = fs.exists(new Path(path)) ;
} catch (IOException e) {
e.printStackTrace();
}
return fileDoesExist ;
}
Actually, fs.exists tries to look for the file hdfs://HDFS_IP:HDFS_PORT/path/to/file.json in my local FS and not on the HDFS. BTW letting the hdfs://HDFS_IP:HDFS_PORT prefix makes fs.existscrash and suppressing it answers false because /path/to/file.json does not exist locally.
What would be the appropriate configuration of fs to get things work properly in local and when executing the Java program from a Hadoop cluster ?
EDIT: I finally gave up and passed the bugfix to someone else in my team. Thanks to the people who tried to help me though !
The problem is you passing to FileSystem an empty conf file.
You should create your FileSystem like that:
FileSystem.get(spark.sparkContext().hadoopConfiguration());
when spark is the SparkSession object.
As you can see in the code of FileSystem:
/**
* Returns the configured filesystem implementation.
* #param conf the configuration to use
*/
public static FileSystem get(Configuration conf) throws IOException {
return get(getDefaultUri(conf), conf);
}
/** Get the default filesystem URI from a configuration.
* #param conf the configuration to use
* #return the uri of the default filesystem
*/
public static URI getDefaultUri(Configuration conf) {
return URI.create(fixName(conf.get(FS_DEFAULT_NAME_KEY, DEFAULT_FS)));
}
it creates the URI base on the configuration passed as parameter, it looks for the key FS_DEFAULT_NAME_KEY(fs.defaultFS) when the DEFAULT_FS is:
public static final String FS_DEFAULT_NAME_DEFAULT = "file:///";

Find file path in windows using Java

Is there a way to find a particular file path in windows. For example say I want to find the location of putty.exe in my local, is there any way I can get the full file path?
I tried the following code using apache commons-io utility , but it is taking a lot of time as there are a lot of files in the local.
File dir = new File("C:\\");
String[] extensions = new String[]{"exe"};
IOFileFilter filter = new SuffixFileFilter(extensions, IOCase.INSENSITIVE);
List<File> fileList = (List<File>) FileUtils.listFiles(dir, filter, DirectoryFileFilter.DIRECTORY);
System.out.println("file list size "+fileList.size());
for (File file : fileList) {
if (file.getName().toLowerCase().contains("putty")) {
System.out.println(file.getPath());
}
}
Is there another faster way?
Edit I want to find putty.exe in particular.
No program is ever expected to scan the entire disk looking for a file it needs.
Programs use one of the following techniques:
Look at the directories in the PATH environment variable, when looking for an executable
Require the user to provide the locations of the files at installation time and store them somewhere known. This could be a .ini file stored in the program's home directory; on Windows: the system registry or a User or System environment variable (which ends up in the registry as well).
The installer creates a launcher shell script that sets an application-specific environment variable which is read by the program.
There are probably several others I haven't thought of. The idea is to limit the places the program has to search.
When they installed the executable in a regular fashion, it is on the PATH variable.
public static Optional<Path> exePath(String exeName) {
String pathVar = System.getenv("PATH");
Pattern varPattern = Pattern.compile("%(\\w+)%");
boolean tryVars = true;
while (tryVars) {
tryVars = false;
Matcher m = varPattern.matcher(pathVar);
StringBuffer sb = new StringBuffer();
while (m.find()) {
tryVars = true;
m.appendReplacement(sb, System.getenv(m.group(1)));
}
m.appendTail(sb);
pathVar = sb.toString();
}
String[] dirs = pathVar.split("\\s*;\\s*");
for (String dir : dirs) {
Path path = Paths.get(dir, exeName);
if (Files.exists(path)) {
return Optional.of(path);
}
}
return Optional.empty();
}
System.out.println(exePath("java.exe"));
Probably the var pattern substitution is done automatically (%JAVA_HOME% and such)

file not adding to DistributedCache

I am running Hadoop on my local system, in eclipse environment.
I tried to put a local file from workspace into the distributed cache in driver function as:
DistributedCache.addCacheFile(new Path(
"/home/hduser/workspace/myDir/myFile").toUri(), conf);
but when I tried to access it from Mapper, it returns null.
Inside mapper, I checked to see whether file cached.
System.out.println("Cache: "+context.getConfiguration().get("mapred.cache.files"));
it prints "null", also
Path[] cacheFilesLocal = DistributedCache.getLocalCacheFiles(context.getConfiguration());
returns null.
What's going wrong?
It's because you can only add files to the Distributed Cache from HDFS not local file system. So the Path doesn't exist. Put the file on HDFS and use the HDFS path to refer to it when adding to the DistributedCache.
See DistributedCache for more information.
Add file:// in the path when you add cache file
DistributedCache.addCacheFile(new Path( "file:///home/hduser/workspace/myDir/myFile"), conf);
Try this
DRIVER Class
Path p = new Path(your/file/path);
FileStatus[] list = fs.globStatus(p);
for (FileStatus status : list) {
/*Storing file to distributed cache*/
DistributedCache.addCacheFile(status.getPath().toUri(), conf);
}
Mapper class
public void setup(Context context) throws IOException{
/*Accessing data in file */
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.get(conf);
URI[] cacheFiles = DistributedCache.getCacheFiles(conf);
/* Accessing 0 th cached file*/
Path getPath = new Path(cacheFiles[0].getPath());
/*Read data*/
BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath)));
String setupData = null;
while ((setupData = bf.readLine()) != null) {
/*Print file content*/
System.out.println("Setup Line "+setupData);
}
bf.close();
}
public void map(){
}

Append data to existing file in HDFS Java

I'm having trouble to append data to an existing file in HDFS. I want that if the file exists then append a line, if not, create a new file with the name given.
Here's my method to write into HDFS.
if (!file.exists(path)){
file.createNewFile(path);
}
FSDataOutputStream fileOutputStream = file.append(path);
BufferedWriter br = new BufferedWriter(new OutputStreamWriter(fileOutputStream));
br.append("Content: " + content + "\n");
br.close();
Actually this method writes into HDFS and create a file but as I mention is not appending.
This is how I test my method:
RunTimeCalculationHdfsWrite.hdfsWriteFile("RunTimeParserLoaderMapperTest2", "Error message test 2.2", context, null);
The first param is the name of the file, the second the message and the other two params are not important.
So anyone have an idea what I'm missing or doing wrong?
Actually, you can append to a HDFS file:
From the perspective of Client, append operation firstly calls append of DistributedFileSystem, this operation would return a stream object FSDataOutputStream out. If Client needs to append data to this file, it could calls out.write to write, and calls out.close to close.
I checked HDFS sources, there is DistributedFileSystem#append method:
FSDataOutputStream append(Path f, final int bufferSize, final Progressable progress) throws IOException
For details, see presentation.
Also you can append through command line:
hdfs dfs -appendToFile <localsrc> ... <dst>
Add lines directly from stdin:
echo "Line-to-add" | hdfs dfs -appendToFile - <dst>
Solved..!!
Append is supported in HDFS.
You just have to do some configurations and simple code as shown below :
Step 1: set dfs.support.append as true in hdfs-site.xml :
<property>
<name>dfs.support.append</name>
<value>true</value>
</property>
Stop all your daemon services using stop-all.sh and restart it again using start-all.sh
Step 2 (Optional): Only If you have a singlenode cluster , so you have to set replication factor to 1 as below :
Through command line :
./hdfs dfs -setrep -R 1 filepath/directory
Or you can do the same at run time through java code:
fsShell.setrepr((short) 1, filePath);
Step 3 : Code for Creating/appending data into the file :
public void createAppendHDFS() throws IOException {
Configuration hadoopConfig = new Configuration();
hadoopConfig.set("fs.defaultFS", hdfsuri);
FileSystem fileSystem = FileSystem.get(hadoopConfig);
String filePath = "/test/doc.txt";
Path hdfsPath = new Path(filePath);
fShell.setrepr((short) 1, filePath);
FSDataOutputStream fileOutputStream = null;
try {
if (fileSystem.exists(hdfsPath)) {
fileOutputStream = fileSystem.append(hdfsPath);
fileOutputStream.writeBytes("appending into file. \n");
} else {
fileOutputStream = fileSystem.create(hdfsPath);
fileOutputStream.writeBytes("creating and writing into file\n");
}
} finally {
if (fileSystem != null) {
fileSystem.close();
}
if (fileOutputStream != null) {
fileOutputStream.close();
}
}
}
Kindly let me know for any other help.
Cheers.!!
HDFS does not allow append operations. One way to implement the same functionality as appending is:
Check if file exists.
If file doesn't exist, then create new file & write to new file
If file exists, create a temporary file.
Read line from original file & write that same line to temporary file (don't forget the newline)
Write the lines you want to append to the temporary file.
Finally, delete the original file & move(rename) the temporary file to the original file.

How to create a copy of a file in the same directory in java?

I want all the features of file.renameTo in java but without the source file getting deleted.
For eg:
Say I have a file report.doc and I want to create the file report.xml without report.doc getting deleted. Also, the contents of both the files should be the same. (A simple copy)
How do I go about doing this?
I know this might be trivial but some basic searching didn't help.
For filesystem operations Apache Commons IO provides useful shortcuts.
See:
FileUtils.html
FileUtils.html#copyFile
You can create a new File with the same content of the original.
Use the java NIO (Java 1.4 or later) for that:
private static void copy(File source, File destination) throws IOException {
long length = source.length();
FileChannel input = new FileInputStream(source).getChannel();
try {
FileChannel output = new FileOutputStream(destination).getChannel();
try {
for (long position = 0; position < length; ) {
position += input.transferTo(position, length-position, output);
}
} finally {
output.close();
}
} finally {
input.close();
}
}
see the answers for this question for more

Categories

Resources