Read and Write Parquet and HDFS files from java - java

currently I can read and write HDFS files in java, but I don't know how to read apache parquet files besides hdfs, my idea is to be able to read and write both files in java
package com.leerhdfs;
//import org.apache.commons.io.IOUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.util.Progressable;
import java.io.*;
import java.net.URI;
import java.nio.charset.StandardCharsets;
public class ReadWriteHDFSExample {
public static void main(String[] args) throws IOException {
String localsrc = args[0];
String destinosrc = args[1];
InputStream in = new BufferedInputStream(new FileInputStream(localsrc));
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(destinosrc), conf);
OutputStream out = fs.create(new Path(destinosrc), new Progressable() {
public void progress() {
System.out.println(".");
}
});
IOUtils.copyBytes(in, out, 4096, true);
}
}
Please help me!!!
Thanks!!

Related

Hadoop is complaining for nonexistent anonymous class (NoClassDefFoundError)

Consider a simple Java file which creates a BufferedInputStream to copy a local file 1400-8.txt to Hadoop HDFS and print some dots as a progress status. The example is Example 3-3 from the Hadoop book here.
// cc FileCopyWithProgress Copies a local file to a Hadoop filesystem, and shows progress
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.util.Progressable;
// vv FileCopyWithProgress
public class FileCopyWithProgress {
public static void main(String[] args) throws Exception {
String localSrc = args[0];
String dst = args[1];
InputStream in = new BufferedInputStream(new FileInputStream(localSrc));
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(dst), conf);
OutputStream out = fs.create(new Path(dst), new Progressable() {
public void progress() {
System.out.print(".");
}
});
IOUtils.copyBytes(in, out, 4096, true);
}
}
// ^^ FileCopyWithProgress
I compile the code and create the JAR file with
hadoop com.sun.tools.javac.Main FileCopyWithProgress.java
jar cf FileCopyWithProgress.jar FileCopyWithProgress.class
The above commands generate the files FileCopyWithProgress.class, FileCopyWithProgress$1.class and FileCopyWithProgress.jar. Then, I try to run it
hadoop jar FileCopyWithProgress.jar FileCopyWithProgress 1400-8.txt hdfs://localhost:9000/user/kostas/1400-8.txt
But, I receive the error
Exception in thread "main" java.lang.NoClassDefFoundError:
FileCopyWithProgress$1
To my understanding, the FileCopyWithProgress$1.class is due to the anonymous callback function the program declares. But since the file exists what is the issue here? Am I running the correct sequence of commands?
I found the issue so I am just posting in case it helps someone. I had to include the class FileCopyWithProgress$1.class in the JAR. The correct one should be
jar cf FileCopyWithProgress.jar FileCopyWithProgress*.class

Convert docx file to pdf in java..issue

I am developing a project which needs a docx file to be converted to pdf. I found same question already posted and used the code which was provided by "Kishan C S". It uses docx4J2.8.1
The code is working fine , pdf is generated but only problem I am facing is that the docx file contains logo.jpg (images header part) which are not converted. Only textual format is converted to pdf.
I am posting the code which I have used. Please let me know what how can I solve the problem
P.S: link I referred Convert docx file into PDF with Java
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.Collections;
import java.util.List;
import org.apache.log4j.Level;
import org.apache.log4j.LogManager;
import org.apache.log4j.Logger;
import org.docx4j.convert.out.pdf.viaXSLFO.PdfSettings;
import org.docx4j.fonts.IdentityPlusMapper;
import org.docx4j.fonts.Mapper;
import org.docx4j.fonts.PhysicalFont;
import org.docx4j.fonts.PhysicalFonts;
import org.docx4j.openpackaging.exceptions.Docx4JException;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
public class DocxConverter {
public static void main(String[] args) throws FileNotFoundException, Docx4JException, Exception {
InputStream is = new FileInputStream(new File("D:\\Test\\C_IN0004_AppointmentLetter.docx"));
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(is);
List sections = wordMLPackage.getDocumentModel().getSections();
for (int i = 0; i < sections.size(); i++) {
wordMLPackage.getDocumentModel().getSections().get(i).getPageDimensions();
}
Mapper fontMapper = new IdentityPlusMapper();
PhysicalFont font = PhysicalFonts.getPhysicalFonts().get("Comic Sans MS");//set your desired font
fontMapper.getFontMappings().put("Algerian", font);
wordMLPackage.setFontMapper(fontMapper);
PdfSettings pdfSettings = new PdfSettings();
org.docx4j.convert.out.pdf.PdfConversion conversion = new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(wordMLPackage);
//To turn off logger
List<Logger> loggers = Collections.<Logger> list(LogManager.getCurrentLoggers());
loggers.add(LogManager.getRootLogger());
for (Logger logger : loggers) {
logger.setLevel(Level.OFF);
}
OutputStream out = new FileOutputStream(new File("D:\\Test\\C_IN0004_AppointmentLetter.pdf"));
conversion.output(out, pdfSettings);
System.out.println("DONE!!");
}
}

Why Java Files.createFile with permissions doesn't work correctly?

I need to ceate file and set permissions(-rwxrw-r) to it, the permission of parent dir is (drwxrwxr--). The problem is that the write permission is
not set in created files. The user that ran this application is the owner of the parent dir.
Below is my test class that present the same problem. When I run this program, the permissions of generated file is (-rwxr--r--) though the class set permissions (-rwxrw-rw-). Why the write permission is not set
and I don't see any exception?
Any idea?
import java.io.BufferedInputStream;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStream;
import java.nio.file.Files;
import java.nio.file.attribute.FileAttribute;
import java.nio.file.attribute.PosixFileAttributes;
import java.nio.file.attribute.PosixFilePermission;
import java.nio.file.attribute.PosixFilePermissions;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Date;
import java.util.HashSet;
import java.util.Set;
import java.util.TimeZone;
import java.util.concurrent.TimeUnit;
public class TestPermission{
static String parentDir = "/tmp/test/";
static Set<PosixFilePermission> defaultPosixPermissions = null;
static {
defaultPosixPermissions = new HashSet<>();
defaultPosixPermissions.add(PosixFilePermission.OWNER_READ);
defaultPosixPermissions.add(PosixFilePermission.OWNER_WRITE);
defaultPosixPermissions.add(PosixFilePermission.OWNER_EXECUTE);
defaultPosixPermissions.add(PosixFilePermission.GROUP_READ);
defaultPosixPermissions.add(PosixFilePermission.GROUP_WRITE);
//Others have read permission so that ftp user who doesn't belong to the group can fetch the file
defaultPosixPermissions.add(PosixFilePermission.OTHERS_READ);
defaultPosixPermissions.add(PosixFilePermission.OTHERS_WRITE);
}
public static void createFileWithPermission(String fileName) throws IOException{
// File parentFolder = new File(parentDir);
// PosixFileAttributes attrs = Files.readAttributes(parentFolder.toPath(), PosixFileAttributes.class);
// System.out.format("parentfolder permissions: %s %s %s%n",
// attrs.owner().getName(),
// attrs.group().getName(),
// PosixFilePermissions.toString(attrs.permissions()));
// FileAttribute<Set<PosixFilePermission>> attr = PosixFilePermissions.asFileAttribute(attrs.permissions());
FileAttribute<Set<PosixFilePermission>> attr = PosixFilePermissions.asFileAttribute(defaultPosixPermissions);
File file = new File(fileName);
Files.createFile(file.toPath(), attr);
}
public static void main(String[] args) throws IOException{
String fileName = parentDir + "testPermission_" + System.currentTimeMillis();
createFileWithPermission(fileName);
}
}
I believe the catch here is
The check for the existence of the file and the creation of the new
file if it does not exist are a single operation that is atomic with
respect to all other filesystem activities that might affect the
directory.
as mentioned in Class Files
This might be because of the OS operations that happen after a file is being created. The following modification in code should get things work fine:
File file = new File(fileName);
Files.createFile(file.toPath(), attr);
Files.setPosixFilePermissions(file.toPath(), defaultPosixPermissions); //Assure the permissions again after the file is created
It turns out that the reason is that my os has a umask as 0027(u=rwx,g=rx,o=) which means application has no way to
set permission for others group.
Files.createFile(file.toPath(), attr);
in the above line instead of using Files.createFile
use file.createNewFile() and if it returns true then the file is created.

how to access the texts from a text file present in hadoop cache

I have 2 files which needs to be accessed by the hadoop cluster. Those two files are good.txt and bad.txt respectively.
Firstly since both these files needs to be accessed from different nodes i place these two files in distributed cache in driver class as follows
Configuration conf = new Configuration();
DistributedCache.addCacheFile(new URI("/user/training/Rakshith/good.txt"),conf);
DistributedCache.addCacheFile(new URI("/user/training/Rakshith/bad.txt"),conf);
Job job = new Job(conf);
Now both good and bad files are placed in distributed cache. I access the distributed cache in mapper class as follows
public class LetterMapper extends Mapper<LongWritable,Text,LongWritable,Text> {
private Path[]files;
#Override
protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException {
files=DistributedCache.getLocalCacheFiles(new Configuration(context.getConfiguration()));
}
I need to check if a word is present in a good.txt or bad.txt. So i use the something like this
File file=new File(files[0].toString()); //to access good.txt
BufferedReader br=new BufferedReader(new FileReader(file));
StringBuider sb=new StringBuilder();
String input=null;
while((input=br.readLine())!=null){
sb.append(input);
}
input=sb.toString();
iam supposed to get the content of good file in my input variable. But i dont get it. Have i missed anything??
Does job finish successfully? The maptask may fail because you are using JobConf in this line
files=DistributedCache.getLocalCacheFiles(new JobConf(context.getConfiguration()));
If you change it like this it should work, I don't see any problem with remaining code you posted in question.
files=DistributedCache.getLocalCacheFiles(context.getConfiguration());
or
files=DistributedCache.getLocalCacheFiles(new Configuration(context.getConfiguration()));
#rVr these is my driver class
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class AvgWordLength {
public static void main(String[] args) throws Exception {
if (args.length !=2) {
System.out.printf("Usage: AvgWordLength <input dir> <output dir>\n");
System.exit(-1);
}
Configuration conf = new Configuration();
DistributedCache.addCacheFile(new URI("/user/training/Rakshith/good.txt"),conf);
DistributedCache.addCacheFile(new URI("/user/training/Rakshith/bad.txt"),conf);
Job job = new Job(conf);
job.setJarByClass(AvgWordLength.class);
job.setJobName("Average Word Length");
FileInputFormat.setInputPaths(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
job.setMapperClass(LetterMapper.class);
job.setMapOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
}
And my mapper class is
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Properties;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
public class LetterMapper extends Mapper<LongWritable,Text,LongWritable,Text> {
private Path[]files;
#Override
protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException {
files=DistributedCache.getLocalCacheFiles(new Configuration(context.getConfiguration()));
System.out.println("in setup()"+files.toString());
}
#Override
public void map(LongWritable key, Text value, Context context)throws IOException,InterruptedException{
int i=0;
System.out.println("in map----->>"+files.toString());//added just to view logs
HashMap<String,String> h=new HashMap<String,String>();
String negword=null;
String input=value.toString();
if(isPresent(input,files[0].toString()){
h.put(input,"good");
}
else
if(isPresent(input,files[1].toString()){
h.put(input,"bad");
}
}
public static boolean isPresent(String n,Path files2) throws IOException{
File file=new File(files2.toString());
BufferedReader br=new BufferedReader(new FileReader(file));
StringBuilder sb=new StringBuilder();
String input=null;
while((input=br.readLine().toString())!=null){
sb.append(input.toString());
}
input=sb.toString();
//System.out.println(input);
Pattern pattern=Pattern.compile(n);
Matcher matcher=pattern.matcher(input);
if(matcher.find()){
return true;
}
else
return false;
}
}

copying directory from local system to hdfs java code

I'm having a problem trying to copy a directory from my local system to HDFS using java code. I'm able to move individual files but can't figure out a way to move an entire directory with sub-folders and files. Can anyone help me with that? Thanks in advance.
Just use the FileSystem's copyFromLocalFile method. If the source Path is a local directory it will be copied to the HDFS destination:
...
Configuration conf = new Configuration();
conf.addResource(new Path("/home/user/hadoop/conf/core-site.xml"));
conf.addResource(new Path("/home/user/hadoop/conf/hdfs-site.xml"));
FileSystem fs = FileSystem.get(conf);
fs.copyFromLocalFile(new Path("/home/user/directory/"),
new Path("/user/hadoop/dir"));
...
Here is the full working code to read and write in to HDFS. It takes two arguments
Input path ( local / HDFS )
Output path(HDFS)
I used Cloudera sandbox.
package hdfsread;
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
public class ReadingAFileFromHDFS {
public static void main(String[] args) throws IOException {
String uri = args[0];
InputStream in = null;
Path pt = new Path(uri);
Configuration myConf = new Configuration();
Path outputPath = new Path(args[1]);
myConf.set("fs.defaultFS","hdfs://quickstart.cloudera:8020");
FileSystem fSystem = FileSystem.get(URI.create(uri),myConf);
OutputStream os = fSystem.create(outputPath);
try{
InputStream is = new BufferedInputStream(new FileInputStream(uri));
IOUtils.copyBytes(is, os, 4096, false);
}
catch(IOException e){
e.printStackTrace();
}
finally{
IOUtils.closeStream(in);
}
}
}

Categories

Resources