BlockMissingException while remotely reading HDFS file from Java in Hadoop 2 - java

I'm using Hadoop 2.6, and I have a cluster of Virtual Machines where I installed my HDFS. I'm trying to remotely read a file in my HDFS through some Java code running on my local, in the basic way, with a BufferedReader
FileSystem fs = null;
String hadoopLocalPath = "/path/to/my/hadoop/local/folder/etc/hadoop";
Configuration hConf = new Configuration();
hConf.addResource(new Path(hadoopLocalPath + File.separator + "core-site.xml"));
hConf.addResource(new Path(hadoopLocalPath + File.separator + "hdfs-site.xml"));
try {
fs = FileSystem.get(URI.create("hdfs://10.0.0.1:54310/"), hConf);
} catch (IOException e1) {
e1.printStackTrace();
System.exit(-1);
}
Path startPath = new Path("/user/myuser/path/to/my/file.txt");
FileStatus[] fileStatus;
try {
fileStatus = fs.listStatus(startPath);
Path[] paths = FileUtil.stat2Paths(fileStatus);
for(Path path : paths) {
BufferedReader br=new BufferedReader(new InputStreamReader(fs.open(path)));
String line = new String();
while ((line = br.readLine()) != null) {
System.out.println(line);
}
br.close();
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
The program can access correctly the HDFS (no exception are risen). If I ask to list the files and directories via code, it can read them without problems.
Now, the issue is that if I try to read a file (as in the code shown), it gets stuck while reading (in the while), until it rises the BlockMissingException
org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-2005327120-10.1.1.55-1467731650291:blk_1073741836_1015 file=/user/myuser/path/to/my/file.txt
at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:888)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:568)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:847)
at java.io.DataInputStream.read(DataInputStream.java:149)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at uk.ou.kmi.med.datoolkit.tests.access.HDFSAccessTest.main(HDFSAccessTest.java:55)
What I already know:
I tried the same code directly on the machine running the namenode, and it works perfectly
I already checked the log of the namenode, and added the user of my local machine to the group managing the HDFS (as suggested by this thread, and other related threads)
There should not be issues with fully-qualified domain names, as suggested by this thread, as I'm using static IPs. On the other hand, the "Your cluster runs in a VM and its virtualized network access to the client is blocked" can be an option. I would say that if it is like that, it shouldn't allow me to do any action on the HDFS (see next point)
The cluster run on a network with a firewall, and I have correctly open and forwarded the port 54310 (I can access the HDFS for other purposes, as creating files, directories, and listing their content). I wonder if there are other ports to open needed for file reading

Can you make sure that Datanode is also accessible from the client ? I had similar issue when connecting Hadoop configured in AWS . I am able to resolve the issue , by conforming connection between all datanodes and my client system

Related

Flink Job submission throws java.nio.file.NoSuchFileException while the file actually exists

I tried to submit a flink job that is already packaged in a JAR. Basically it consumes a kafka topic protected by SASL authentication, thus it requires a .jks file which I already include them in JAR and read in the code as:
try(InputStream resourceStream = loader.getResourceAsStream(configFile)){
properties.load(resourceStream);
properties.setProperty("ssl.truststore.location",
loader.getResource(properties.getProperty("ssl.truststore.location")).toURI().getPath());
}
catch(Exception e){
System.out.println("Failed to load config");
}
I tried to submit the job on two different (different VM specs) standalone server for the sake of testing. One server runs succesfully, but another throw a java.nio.file.NoSuchFileException, saying that my .jks file is not found. Can someone please point out the possible issue on it?
Here, the flink is deployed on a standalone cluster mode with the following version:
Flink version: 1.14.0
Java version: 11.0.13
I realize my question was really silly. This part actually returns null and trigger exception.
loader.getResource(properties.getProperty("ssl.truststore.location")).toURI().getPath()
The problem was that I submit the job through web UI thus I couldn't see the printed message. Thus, the filename resolves to the original one stored under the configFile, which is a relative path. Why one machine works and another one doesn't? Cause I previously somehow has the .jks on my homedir for another testing :).
For others to not jump into this mistake, here is the summary of what will .getResource() resolve if run from IDE (gradle run task) and jar, respectively.
// file:home/gradle-demo/build/resources/main/kafka-client.truststore.jks
// jar:file:home/gradle-demo/build/libs/gradle-demo-1.0-SNAPSHOT.jar!/kafka-client.truststore.jks
System.out.println(loader.getResource("kafka-client.trustore.jks").toString());
// home/gradle-demo/build/resources/main/kafka-client.truststore.jks
// file:home/gradle-demo/build/libs/gradle-demo-1.0-SNAPSHOT.jar!/kafka-client.truststore.jks
System.out.println(loader.getResource("kafka-client.trustore.jks").getPath());
// home/gradle-demo/build/resources/main/kafka-client.truststore.jks
// null
System.out.println(loader.getResource("kafka-client.trustore.jks").toURI().getPath());
// file:home/gradle-demo/build/resources/main/kafka-client.truststore.jks
// jar:file:home/gradle-demo/build/libs/gradle-demo-1.0-SNAPSHOT.jar!/kafka-client.truststore.jks
System.out.println(loader.getResource("kafka-client.trustore.jks").toURI());
kafka-client:2.4.1
org.apache.kafka.common.security.ssl.SslEngineBuilder#285
try (InputStream in = Files.newInputStream(Paths.get(path))) {
KeyStore ks = KeyStore.getInstance(type);
// If a password is not set access to the truststore is still available, but integrity checking is disabled.
char[] passwordChars = password != null ? password.value().toCharArray() : null;
ks.load(in, passwordChars);
return ks;
} catch (GeneralSecurityException | IOException e) {
throw new KafkaException("Failed to load SSL keystore " + path + " of type " + type, e);
}
It looks like we should put jks file in file system(nfs or hdfs) where task manager can access by absolute path.

In java is there a way to tell on which physical computer a file resides?

I have an eclipse RCP product that is run by multiple people at our company. All PCs are running some version of Windows. We have access to a shared PC which different people have mapped to different drive letters. That means the same file may be referred to in many different ways depending on the PC on which program is run. E.g.
\communalPC\Shared\foo.txt
Y:\Shared\foo.txt
Z:\Shared\foo.txt
I want to programmatically check if an arbitrary file is on the communnal PC. Is there a robust way to do this in java?
Our current solution below is a bit of a hack It is not robust due to people mapping to different drive letters, changing drive letters, not-portable etc.
private static boolean isOnCommunalPc(File file) {
if(file.getAbsolutePath().toLowerCase().startsWith("\\\\communalPC")) {
return true;
}
if(file.getAbsolutePath().toLowerCase().startsWith("y:")){
return true;
}
if(file.getAbsolutePath().toLowerCase().startsWith("z:")){
return true;
}
return false;
}
Java cannot tell the difference of which machine the file is on, as Windows abstracts that layer away from the JVM. You can, however be explicit with your connection.
Is there a reason why you couldn't have an ftp or http server (or even a custom java server!) on the communal pc, and to access it via a hostname or an ip? That way, it doesn't matter where the user has mapped the network drive, you connected via a static address.
Accessing a remote file in Java is as easy as:
URL remoteUrl = new URL(String.format("%s/%s", hostName, fileName));
InputStream remoteInputStream remoteUrl.openConnection().getInputStream();
//copyStreamToFile(remoteInputStream, new File(destinationPath), false);
If you need the file to be local for a library or code you would prefer not to change, you could:
void copyStreamToFile(InputStream in, File outputFile, boolean doDeleteOnExit) {
//Clean up file after VM exit, if needed.
if(doDeleteOnExit)
outputFile.deleteOnExit();
FileOutputStream outputStream = new FileOutputStream(outputFile);
ReadableByteChannel inputChannel = Channels.newChannel(in);
WritableByteChannel outputChannel = Channels.newChannel(outputStream);
ChannelTools.fastChannelCopy(inputChannel, outputChannel);
inputChannel.close();
outputChannel.close()
}
EDIT Accessing a remote file via Samba with JCIFS is as easy as:
domain = ""; //Your domain, only set if needed.
NtlmPasswordAuthentication npa = new NtlmPasswordAuthentication(domain, userName, password);
SmbFile remoteFile = new SmbFile(String.format("smb://%s/%s", hostName, fileName), npa);
//copyStreamToFile(new SmbFileInputStream(remoteFile), new File(destinationPath), false)
This will probably be the most pragmatic solution, as it requires the least amount of work on the Windows server. This plugs into the existing server framework in Windows, instead of installing more.

Is there a simple method to check if there are changes in a SFTP server?

My objective is to poll the SFTP server for changes. My first thought is to check if the number of files in the dir changed. Then maybe some additional checks for changes in the dir.
Currently I'm using the following:
try {
FileSystemOptions opts = new FileSystemOptions();
SftpFileSystemConfigBuilder.getInstance().setStrictHostKeyChecking(opts, "no");
SftpFileSystemConfigBuilder.getInstance().setUserDirIsRoot(opts, true);
SftpFileSystemConfigBuilder.getInstance().setTimeout(opts, 60000);
FileSystemManager manager = VFS.getManager();
FileObject remoteFile = manager.resolveFile(SFTP_URL, opts);
FileObject[] fileObjects = remoteFile.getChildren();
System.out.println(DateTime.now() + " --> total number of files: " + Objects.length);
for (FileObject fileObject : fileObjects) {
if (fileObject.getName().getBaseName().startsWith("zzzz")) {
System.out.println("found one: " + Object.getName().getBaseName());
}
}
} catch (Exception e) {
e.printStackTrace();
}
This is using apache commons vfs2 2.2.0. It works "fine", but when the server has too many files, it takes over minutes just to get the count(currently, it takes over 2 minutes to get the count for a server that has ~10k files). Any way to get the count or other changes on the server faster?
Unfortunately there's no simple way in the SFTP protocol to get the changes. If you can have some daemon running on the server OR if the source of the new files can create/update a helper file, creation of such file with the last modification time in its name or contents can be an option.
I know the SFTP protocol fairly well, having developed commercial SFTP clients and an SFTP server (CompleteFTP), and as far as I know there's no way within the protocol to get a count of files in a directory without listing it. Some servers, such as ours, provide ways of adding custom commands to servers that you can invoke from the client, so it would be possible to add a custom command that returns the number of files in a directory. CompleteFTP also allows you to write custom file-systems so you could potentially write one that only shows files that have changed after a given timestamp when you do a listing, which might be another approach. Our server only runs on Windows though, so that might be show-stopper for you.

Java code for connect alfresco through ftps connection

When I try to connect our alfresco through SFTP it is not able to connect alfresco. It hangs the explorer and no error goes in the logger file also.
public void FTPTest()throws SocketException, IOException, NoSuchAlgorithmException
{
FTPSClient ftp = new FTPSClient("SSL");
System.out.println("1");
ftp.connect("172.17.178.144",2121); // or "localhost" in your case
System.out.println("2"+ftp.getReplyString());
System.out.println("login: "+ftp.login("admin", "admin"));
System.out.println("3"+ ftp.getReplyString());
ftp.changeWorkingDirectory("/alfresco");
// list the files of the current directory
FTPFile[] files = ftp.listFiles();
System.out.println("Listed "+files.length+" files.");
for(FTPFile file : files) {
System.out.println(file.getName());
}
// lets pretend there is a JPEG image in the present folder that we want to copy to the desktop (on a windows machine)
ftp.setFileType(FTPClient.BINARY_FILE_TYPE); // don't forget to change to binary mode! or you will have a scrambled image!
FileOutputStream br = new FileOutputStream("C:\\Documents and Settings\\casonkl\\Desktop\\my_downloaded_image_new_name.jpg");
ftp.retrieveFile("name_of_image_on_server.jpg", br);
ftp.disconnect();
}
I got output in our console only
1
at the execution of ftp.connect("172.17.178.144",2121); this code system will be hang no error got in our console
I am able to connect to my Alfresco through SFTP with the Filezila FTP client software. Can any one help me resolve this issue?
If I'm not mistaken then Alfresco chose for FTPS.
So try it with the following code here: http://alvinalexander.com/java/jwarehouse/commons-net-2.2/src/main/java/examples/ftp/FTPSExample.java.shtml

Apache FTPClient - incomplete file retrieval on Linux, works on Windows

I have a java application on Websphere that is using Apache Commons FTPClient to retrieve files from a Windows server via FTP. When I deploy the application to Websphere running in a Windows environment, I am able to retrieve all of the files cleanly. However, when I deploy the same application to Webpshere on Linux, there are cases where I am getting an incomplete or corrupt files. These cases are consistent though, such that the same files will fail every time and give back the same number of bytes (usually just a few bytes less than what I should be getting). I would say that I can read approximately 95% of the files successfully on Linux.
Here's the relevant code...
ftpc = new FTPClient();
// set the timeout to 30 seconds
ftpc.enterLocalPassiveMode();
ftpc.setDefaultTimeout(30000);
ftpc.setDataTimeout(30000);
try
{
String ftpServer = CoreApplication.getProperty("ftp.server");
String ftpUserID = CoreApplication.getProperty("ftp.userid");
String ftpPassword = CoreApplication.getProperty("ftp.password");
log.debug("attempting to connect to ftp server = "+ftpServer);
log.debug("credentials = "+ftpUserID+"/"+ftpPassword);
ftpc.connect(ftpServer);
boolean login = ftpc.login(ftpUserID, ftpPassword);
if (login)
{
log.debug("Login success..."); }
else
{
log.error("Login failed - connecting to FTP server = "+ftpServer+", with credentials "+ftpUserID+"/"+ftpPassword);
throw new Exception("Login failed - connecting to FTP server = "+ftpServer+", with credentials "+ftpUserID+"/"+ftpPassword);
}
is = ftpc.retrieveFileStream(fileName);
ByteArrayOutputStream out = null;
try {
out = new ByteArrayOutputStream();
IOUtils.copy(is, out);
} finally {
IOUtils.closeQuietly(is);
IOUtils.closeQuietly(out);
}
byte[] bytes = out.toByteArray();
log.info("got bytes from input stream - byte[] size is "+ bytes.length);
Any assistance with this would be greatly appreciated.
Thanks.
I have a suspicion that the FTP might be using ASCII rather than binary transfer mode, and mapping what it thinks are Window end-of-line sequences in the files to Unix end-of-lines. For files that are really text, this will work. For files that are really binary, the result will be corruption and a slightly shorter file if the file contains certain sequences of bytes.
See FTPClient.setFileType(...).
FOLLOWUP
... so why this would work on Windows and not Linux remains a mystery for another day.
The mystery is easy to explain. You were FTP'ing files from a Windows machine to a Windows machine, so there was no need to change the end-of-line markers.

Categories

Resources