Hadoop File Empty after Write - java

We have an application that retrieves data from MongoDB and writes to Hadoop cluster.
The data is a list of strings that are converted to JSON and written to Hadoop using the following logic
˚
Configuration conf = new Configuration();
conf.addResource(new Path("/etc/hadoop/conf/core-site.xml"));
conf.addResource(new Path("/etc/hadoop/conf/hdfs-site.xml"));
conf.set("fs.defaultFS", HadoopConstants.HDFS_HOST + HadoopConstants.HDFS_DEFAULT_FS);
FSDataOutputStream out = null;
FileSystem fileSystem = null;
//Create Hadoop FS Path and Directory Structure
if (!fileSystem.exists(new Path(dir))) {
// Create new Directory
fileSystem.mkdirs(new Path(dir), FsPermission.getDefault());
out = fileSystem.create(new Path(filepath));
} else if (fileSystem.exists(new Path(dir))) {
if (!fileSystem.exists(new Path(filepath))) {
out = fileSystem.create(new Path(filepath));
} else if (fileSystem.exists(new Path(filepath))) {
//should not reach here .
fileSystem.delete(new Path(filepath), true);
out = fileSystem.create(new Path(filepath));
}
}
for (Iterator < String > it = list.iterator(); it.hasNext();) {
String node = it.next();
out.writeBytes(node.toString());
out.writeBytes("\n");
}
LOGGER.debug("Write to HDFS successful");
out.close();
The application works well for QA and Staging environments .
In production environment , which has an additional firewall in order to connect to it (This firewall has been opened now in order to grant access for write) , following error is seen .
The file is being created but the final Hadoop file is empty . ie. The size is 0 bytes.
The Hadoop fs –du and Hadoop fsck commands on the file being written is attached in the screenshot. The size after replication during write increases to 384M but then becomes 0 again .
Is this because out.close() in above code is not being called ?
This doesn’t explain QA data being written correctly.
Could it be a firewall issue ?
The file is being created correctly . Hence doesn’t seem to be connectivity issue . Unless after file is created and opened data is being written and not flushed correctly so as it is saved.
Following is file specifications during write
$ hadoop fs -du -h file.json
0 384M ...
The size after replication param above increases to 384M and changes to 0 after a while. Does this mean data is arriving but not being flushed correctly to disk?
$ hadoop fsck
What are some ways I could verify if data is being fetched and arriving from the Hadoop side?
**** UPDATE ****
Following exception is thrown in client logs during execution of following line:
out.close();
HDFSWriter ::Write Failed :: Could not get block locations. Source file "part-m-2017102304-0000.json" - Aborting...
Hadoop httpfs.out Logs has the following :
hadoop-httpfs ... INFO httpfsaudit: [/part-m-2017102304-0000.json] offset [0] len [204800]

It means that you have firewall access to the namenode (which can create the file), but not to the datanodes (which are needed to write data to the files).
Get the firewall rules updated so that you also have access to the datanodes.

Related

NoSuchMethodError exception when read the protobuf file from HDFS

I am writing a java program to count line of protobuff file stored in HDFS and execute the program with "hadoop -jar countLine.jar"
However, I get the exception
Exception in thread "main" java.lang.NoSuchMethodError: com.google.protobuf.CodedInputStream.shouldDiscardUnknownFields()Z at com.google.protobuf.GeneratedMessageV3.parseUnknownField(GeneratedMessageV3.java:290)
This only happens on some of the protobuf files. Files with different schema does not have this issue.
My protobuf file is gzipped pb.gz.
//Here is the code
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path path = new Path(<HDFS path to file>);
InputStream input = new GZIPInputStream(fs.open(path));
Message m;
while ((m = defaultMsg.getParserForType().parseDelimitedFrom(input)) != null) {
recordCount++;
}
If I put the file in local, everything works fine
InputStream input = new GZIPInputStream(new File(path_to_local_file));
Message m;
while ((m = defaultMsg.getParserForType().parseDelimitedFrom(input)) != null) {
recordCount++;
}
Does anyone have idea. Will the file size cause this issue?
Thanks
David
Thank you #jwismar for the hints.
The issue happens when I run the "hadoop jar countLine.jar" from command line. Hadoop classloader loads protobuf library came with it which has lower version than the protoc that I used to generate the java files. Once I down grade the protoc to lower version and re-generate the java files, the issue is gone.
Thanks
David

Load csv file in neo4j embedded in Java

I have a neo4j embedded DB in my application and I want to load a .csv file to fill the database. I've managed to create the .csv file in the /import folder but when I try to load it, I get a Couldn't load the external resource at: file:/csv_file.csv.
I've read it may be something about permissions but I cannot change them since it creates the whole folder structure every time I run my application (I try to change them after creating the file with Runtime.getRuntime().exec("chmod 777 /path_to_file/csv_file.csv") but it never works)
This is my code:
public void addCSVtoDB() {
try ( Transaction ignored = graphDb.beginTx();
Result result = graphDb.execute( "LOAD CSV WITH HEADERS FROM \"file:///csv_file.csv\" AS csvLine\n" )
{
}
I'm using MacOSX 10.11 so the / are supposed to be allright. Any idea?
Ok so I finllay solved it: everywhere it says that the .csv files had to be in the /import folder but that doesn't applies to embedded databases. The path_to_file is absolute so I put the .csv file in my /users/MY_USER/ folder and made the cypher query as follows:
public void addCSVtoDB() {
try ( Transaction ignored = graphDb.beginTx();
Result result = graphDb.execute( "LOAD CSV WITH HEADERS FROM \"file:////Users/My_User/csv_file.csv\" AS csvLine\n" )
{
}

Overwriting HDFS file/directory through Spark

Problem
I have a file saved in HDFS and all I want to do is to run my spark application, calculate a result javaRDD and use saveAsTextFile() in order to store the new "file" in HDFS.
However Spark's saveAsTextFile() does not work if the file already exists. It does not overwrite it.
What I tried
So I searched for a solution to this and I found that a possible way to make it work could be deleting the file through the HDFS API before trying to save the new one.
I added the Code:
FileSystem hdfs = FileSystem.get(new Configuration());
Path newFolderPath = new Path("hdfs://node1:50050/hdfs/" +filename);
if(hdfs.exists(newFolderPath)){
System.out.println("EXISTS");
hdfs.delete(newFolderPath, true);
}
filerdd.saveAsTextFile("/hdfs/" + filename);
When I tried to run my Spark application, the file was deleted but I get a FileNotFoundException.
Considering the fact, that this exception occurs when someone is trying to read a file from a path and the file does not exist, this makes no sense because after deleting the file, there is no code that tries to read it.
Part of my code
JavaRDD<String> filerdd = sc.textFile("/hdfs/" + filename) // load the file here
...
...
// Transformations here
filerdd = filerdd.map(....);
...
...
// Delete old file here
FileSystem hdfs = FileSystem.get(new Configuration());
Path newFolderPath = new Path("hdfs://node1:50050/hdfs/" +filename);
if(hdfs.exists(newFolderPath)){
System.out.println("EXISTS");
hdfs.delete(newFolderPath, true);
}
// Write new file here
filerdd.saveAsTextFile("/hdfs/" + filename);
I am trying to do the simplest thing here but I have no idea why this does not work. Maybe the filerdd is somehow connected to the path??
The problem is you use the same path for input and output. Spark's RDD will be executed lazily. It runs when you call saveAsTextFile. At this point, you have already deleted the newFolderPath. So filerdd will complain.
Anyway, you should not use the same path for input and output.

Reading files from a windows shared directory in parallel

I have a server that reads a list of text files from a windows shared directory and save it contents to the db before its starts to accepts user messages. This server will be running in multiple machines at one time.
I see that when I run the server in multiple machines, the server that starts processing the files, first processes all the files and the others keep waiting to get access to the files in that directory.
My code does this - (cannot post code due to security policy)
Get a list all files in the shared directory.
Sort them by modified date (its saving time series data)
While(true) till more files exist in directory
Get the first file in the list, and move it to InProgess folder and read
Save contents to database.
Move file to Archive directory.
Process the next file.
I see that when I run the same program in 2 different machines, one of them get holds of the files first and loads them all. The other one keeps waiting to get a handle on the files and if it does find a handle, they have already been processed. So it moves on.
My aim is to have the process when run in both or multiple machines to process all the file in parallel and finish faster. For now I am testing with 500 files on disk, but I can have more files on disk at any given time.
PseudoCode -
if(files exist on disk){
LOGGER.info("Files exist on disk. Lets process them up first....");
while (true) {
File dir = new File(directory);
List<File> fileList = new LinkedList<File>(Arrays.asList(dir.listFiles((FileFilter)FileFileFilter.FILE)));
LOGGER.info("No of files in this process: "+ sortedFileList.size());
if (fileList.size() > 0) {
Collections.sort(fileList, new Server().new FileComparator());
File file = fileList.get(0);
//If I cannot rename the file in the same directory, the file maybe open and I move to the next file
if(!file.renameTo(file.getAbsoluteFile())) {
LOGGER.info("Read next file...");
continue;
}
LOGGER.info("Get file handle...");
if (file.exists()) {
File inprogressFile = new File(dataDirName + FileBackupOnDisk.INPROGRESS + fileName);
saveToDB(inprogressFile);
if (savedToDB)
if(inprogressFile.renameTo(new File(dataDirName+ARCHIVE+fileName)))
LOGGER.info("Moved file to archive - " + fileName);
else
LOGGER.error("Move file " + fileName + " to failed directory!");
}
}
}
}
That's my file comparator code. This cannot be opening files -
final Map<File, Long> staticLastModifiedTimes = new HashMap<File,Long>();
for(final File f : sortedFileList) {
staticLastModifiedTimes.put(f, f.lastModified());
}
Collections.sort(sortedFileList, new Comparator<File>() {
#Override
public int compare(final File f1, final File f2) {
return
staticLastModifiedTimes.get(f1).compareTo(staticLastModifiedTimes.get(f2));
}
});
How do I make sure that both my servers/multiple servers running on different machines are able to access the shared directly in parallel. Right now it looks like the 2nd process find that files exist in the dir but hang at one point waiting to get a file handle.
Let me know if anyone has done this before and how?
I found out that my solution above works perfectly fine!!!!
Its just that running one instance from my eclipse with another from a m/c in the network was causing this latency issues.
If I run the program with 2 machines in the same network it works fine. Just that my computer was slower. Both the instances read the files when they get are able to get handle on it.
Thank you all for your help.

Copying HDFS directory to local node

I'm working on a single node Hadoop 2.4 cluster.
I'm able to copy a directory and all its content from HDFS using hadoop fs -copyToLocal myDirectory .
However, I'm unable to successfully do the same operations via this java code :
public void map Object key, Text value, Context context)
throws IOException, InterruptedException {
Configuration conf = new Configuration(true);
FileSystem hdfs = FileSystem.get(conf);
hdfs.copyToLocalFile(false, new Path("myDirectory"),
new Path("C:/tmp"));
}
This code only copies a part of myDirectory. I also receive some error messages :
14/08/13 14:57:42 INFO mapreduce.Job: Task Id : attempt_1407917640600_0013_m_000001_2, Status : FAILED
Error: java.io.IOException: Target C:/tmp/myDirectory is a directory
My guess is that multiple instances of the mapper are trying to copy the same file to the same node at the same time. However, I don't see why not all the content is copied.
Is that the reason of my errors, and how could I solve it ?
You can use DistributedCache (documentation) to copy your files on all datanodes, or you could try to copy files in the setup of your mapper.

Categories

Resources