I want to combine several small bzip2 files into a sequence file .I saw a code to create sequence file and tried it. But it gives strange output as below. Is this because it is unable to read bzip2 files?
SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text
�*org.apache.hadoop.io.compress.DefaultCodec����gWŒ‚ÊO≈îbº¡vœÖ��� ���
.DS_StorexúÌò±
¬0EÔ4.S∫a�6∞¢0P∞=0ì·‡/d)ÄDï˛ì¨w≈ù7÷ùØ›⁄ÖüO;≥X¬`’∂µóÆ Æâ¡=Ñ B±lP6Û˛ÜbÅå˜C¢3}ª‘�Lp¥oä"ùËL?jK�&:⁄”Åét¢3]Î
º∑¿˘¸68§ÄÉùø:µ√™*é-¿fifi>!~¯·0Ùˆú ¶ eõ¯c‡ÍÉa◊':”ÍÑòù;I1•�∂©���00.json.bz2xúL\gWTK∞%
,Y
ä( HJFêúsŒ\PrRrŒ9ÁCŒ9√0ÃZUÏÌÊΩÔ≤Ù‚Ãô”’UªvÌÍÓ3£oˆä2ä<˝”-”ãȧπË/d;u¥Û£üV;ÀÒÛ¯Ú˜ˇ˚…≥2¢5Í0‰˝8M⁄,S¸¢`f•†`O<ëüD£≈tÃ¥ó`•´D˚~aº˝«õ˜v'≠)(F|§fiÆÕ ?y¬àœTÒÊYåb…U%E?⁄§efiWˇÒY#üÛÓÓ‚
⁄è„ÍåÚÊU5‡ æ‚Â?q‘°�À{©?íWyü÷ÈûF<[˘éŒhãd>x_ÅÁ
fiÒ_eâ5-—|-M)˙)¸R·ªCÆßs„F>UŒ©ß{o„uÔ&∫˚˚Ÿ?Ä©ßW,”◊Ê∫â«õxã¸[yûgÈñFmx|‡ªÍ¶”¶‡Óp-∆ú§ı
<JN t «F4™#Àä¥Jœ¥‰√|E„‘œ„&º§#g|ˆá{iõOx
The code is
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.util.GenericOptionsParser;
public class cinput {
/**
* #param args
* #throws IOException
* #throws IllegalAccessException
* #throws InstantiationException
*/
public static void main(String[] args) throws IOException,
InstantiationException, IllegalAccessException {
// TODO Auto-generated method stub
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
String[] otherArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
Path inputFile = new Path(otherArgs[0]);
Path outputFile = new Path(otherArgs[1]);
FSDataInputStream inputStream;
Text key = new Text();
Text value = new Text();
SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf,
outputFile, key.getClass(), value.getClass());
FileStatus[] fStatus = fs.listStatus(inputFile);
for (FileStatus fst : fStatus) {
String str = "";
System.out.println("Processing file : " + fst.getPath().getName() + " and the size is : " + fst.getPath().getName().length());
inputStream = fs.open(fst.getPath());
key.set(fst.getPath().getName());
while(inputStream.available()>0) {
str = str+inputStream.readLine();
// System.out.println(str);
}
value.set(str);
writer.append(key, value);
}
fs.close();
IOUtils.closeStream(writer);
System.out.println("SEQUENCE FILE CREATED SUCCESSFULLY........");
}
}
The input I am passing is Json.bzip2 files. Could someone please point out why I am getting strange output.
Related
folder structure is here
console output is here
I'd like to write a test class for the 2 methods below
package jfe;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.commons.compress.archivers.sevenz.SevenZArchiveEntry;
import org.apache.commons.compress.archivers.sevenz.SevenZFile;
import org.apache.commons.compress.archivers.tar.TarArchiveEntry;
import org.apache.commons.compress.archivers.tar.TarArchiveInputStream;
import org.apache.commons.compress.utils.IOUtils;
public class JThreadFile {
/**
* uncompresses .tar file
* #param in
* #param out
* #throws IOException
*/
public static void decompressTar(String in, File out) throws IOException {
try (TarArchiveInputStream tin = new TarArchiveInputStream(new FileInputStream(in))){
TarArchiveEntry entry;
while ((entry = tin.getNextTarEntry()) != null) {
if (entry.isDirectory()) {
continue;
}
File curfile = new File(out, entry.getName());
File parent = curfile.getParentFile();
if (!parent.exists()) {
parent.mkdirs();
}
IOUtils.copy(tin, new FileOutputStream(curfile));
}
}
}
/**
* uncompresses .7z file
* #param in
* #param destination
* #throws IOException
*/
public static void decompressSevenz(String in, File destination) throws IOException {
//#SuppressWarnings("resource")
SevenZFile sevenZFile = new SevenZFile(new File(in));
SevenZArchiveEntry entry;
while ((entry = sevenZFile.getNextEntry()) != null){
if (entry.isDirectory()){
continue;
}
File curfile = new File(destination, entry.getName());
File parent = curfile.getParentFile();
if (!parent.exists()) {
parent.mkdirs();
}
FileOutputStream out = new FileOutputStream(curfile);
byte[] content = new byte[(int) entry.getSize()];
sevenZFile.read(content, 0, content.length);
out.write(content);
out.close();
}
sevenZFile.close();
}
}
I use testNG and try to read from a folder, filter the folder for certain extensions (.tar and .7z) feed those files to the uncompress methods and compare the result to the actualOutput folder with AssertEquals. I manage to read the file names from the folder (see console output) but can't feed them to decompressTar(String in, File out). Is this because "result" is a String array and I need a String? I have no clue how DataProvider of TestNG handles data. Any help would be appreciated :) Thank you :)
import java.io.File;
import java.io.FilenameFilter;
import java.io.IOException;
import org.testng.Assert;
import org.testng.annotations.DataProvider;
import org.testng.annotations.Test;
public class JThreadFileTest {
protected static final File ACT_INPUT = new File("c:\\Test\\TestInput\\"); //input directory
protected static final File EXP_OUTPUT = new File("c:\\Test\\ExpectedOutput\\"); //expected output directory
protected static final File TEST_OUTPUT = new File("c:\\Test\\TestOutput\\");
#DataProvider(name = "tarJobs")
public Object[] getTarJobs() {
//1
String[] tarFiles = ACT_INPUT.list(new FilenameFilter()
{
public boolean accept(File dir, String name)
{
return name.endsWith(".tar");
}
});
//2
Object[] result = new Object[tarFiles.length];
int i = 0;
for (String filename : tarFiles) {
result[i] = filename;
i++;
}
return result;
}
#Test(dataProvider = "tarJobs")
public void testTar(String result) throws IOException {
System.out.println("Running test" + result);
--> JThreadFile.decompressTar(result, TEST_OUTPUT);
Assert.assertEquals(TEST_OUTPUT, EXP_OUTPUT);
}
}
I’m currently using PDFBox to read the text of a set of pdfs that I’ve inherited.
I’m only interested in reading the text, not making any changes to the file.
The code that works for most of the files is:
File pdfFile = myPath.toFile();
PDDocument document = PDDocument.load(pdfFile );
Writer sw = new StringWriter();
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage( 1 );
stripper.writeText( document, sw );
String documentText = sw.toString()
For most files, I wind up with the text in the documentText field.
But, for 3 of 24 files, the documentText content for the first file is “\r\n”, for the second “\r\n\r\n”, and for the third “\r\n\r\n\r\n:, But the three files are not consecutive. Multiple good files are between each of these files.
The File is derived from a java.nio.Path. The WindowsFileAttribute that is part of the Path has a size of 279K, so the file is not empty on disk.
I can open the file and view the data, and it looks like the other files that my code reads.
I’m using Java 8.0.121, and PDFBox 2.0.4. (this is the latest version, I believe.)
Any suggestions? Is there a better way to read the text? (I’m not interested in the formatting, or fonts used, just the text.)
Thanks.
Reading multiple PDF docs using pdfbox in java
package readwordfile;
import java.io.BufferedReader;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
import java.io.ByteArrayOutputStream;
import java.io.DataInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;
/**
* This is an example on how to extract words from PDF document
*
* #author saravanan
*/
public class GetWordsFromPDF extends PDFTextStripper {
static List<String> words = new ArrayList<String>();
public GetWordsFromPDF() throws IOException {
}
/**
* #param args
* #throws IOException If there is an error parsing the document.
*/
public static void main(String[] args) throws IOException {
String files;
// FileWriter fs = new FileWriter("C:\\Users\\saravanan\\Desktop\\New Text Document (2).txt");
// FileInputStream fstream1 = new FileInputStream("C:\\Users\\saravanan\\Desktop\\New Text Document (2).txt");
// DataInputStream in1 = new DataInputStream(fstream1);
// BufferedReader br1 = new BufferedReader(new InputStreamReader(in1));
String path = "C:\\Users\\saravanan\\Desktop\\New folder\\"; //local folder path name
File folder = new File(path);
File[] listOfFiles = folder.listFiles();
for (int i = 0; i < listOfFiles.length; i++) {
if (listOfFiles[i].isFile()) {
files = listOfFiles[i].getName();
if (files.endsWith(".pdf") || files.endsWith(".PDF")) {
String nfiles = "C:\\Users\\saravanan\\Desktop\\New folder\\";
String fileName1 = nfiles + files;
System.out.print("\n\n" + files+"\n");
PDDocument document = null;
try {
document = PDDocument.load(new File(fileName1));
PDFTextStripper stripper = new GetWordsFromPDF();
stripper.setSortByPosition(true);
stripper.setStartPage(0);
stripper.setEndPage(document.getNumberOfPages());
Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);
int x = 0;
System.out.println("");
for (String word : words) {
if (word.startsWith("xxxxxx")) { //here you can give your pdf doc starting word
x = 1;
}
if (x == 1) {
if (!(word.endsWith("YYYYYY"))) { //here you can give your pdf doc ending word
System.out.print(word + " ");
// fs.write(word);
} else {
x = 0;
break;
}
}
}
} finally {
if (document != null) {
document.close();
words.clear();
}
}
}
}
}
}
/**
* Override the default functionality of PDFTextStripper.writeString()
*
* #param str
* #param textPositions
* #throws java.io.IOException
*/
#Override
protected void writeString(String str, List<TextPosition> textPositions) throws IOException {
String[] wordsInStream = str.split(getWordSeparator());
if (wordsInStream != null) {
for (String word : wordsInStream) {
words.add(word); //store the pdf content into the List
}
}
}
}
I have a requirement where in the Map Reduce code should read the local file system in each node. The program will be running on HDFS and I cannot change the FileSystem property for hadoop in xml files for configuration.
I have tried the following solutions, but none gave me results.
Approach 1
Configuration config = new Configuration();
FileSystem localFileSystem = FileSystem.get(config);
localFileSystem.set("fs.defaultFS", "file:///");
BufferedReader bufferRedaer = new BufferedReader(new InputStreamReader(localFileSystem.open(new Path("/user/input/localFile"))));
Approach 2
Configuration config = new Configuration();
LocalFileSystem localFileSystem = FileSystem.getLocal(config);
BufferedReader bufferRedaer = new BufferedReader(new InputStreamReader(localFileSystem.open(new Path("/user/input/localFile"))));
Approach 3
Configuration config = new Configuration();
LocalFileSystem localFileSystem = FileSystem.getLocal(config);
localFileSystem.set("fs.defaultFS", "file:///");
BufferedReader bufferRedaer = new BufferedReader(new InputStreamReader(localFileSystem.open(new Path("/user/input/localFile"))));
Approach 4
Configuration config = new Configuration();
LocalFileSystem localFileSystem = FileSystem.getLocal(config);
BufferedReader bufferRedaer = new BufferedReader(new InputStreamReader(localFileSystem.getRaw().open(new Path("/user/input/localFile"))));
This did not work either
[Reading HDFS and local files in Java
Each of them gave the error: No such file exists
Error Stack
attempt_201406050021_0018_m_000000_2: java.io.FileNotFoundException: File /home/cloudera/sftp/id_rsa does not exist
attempt_201406050021_0018_m_000000_2: at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:468)
attempt_201406050021_0018_m_000000_2: at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:380)
attempt_201406050021_0018_m_000000_2: at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:231)
attempt_201406050021_0018_m_000000_2: at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:183)
attempt_201406050021_0018_m_000000_2: at org.apache.hadoop.fs.LocalFileSystem.copyFromLocalFile(LocalFileSystem.java:81)
attempt_201406050021_0018_m_000000_2: at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1934)
attempt_201406050021_0018_m_000000_2: at com.skanda.ecomm.sftp.FTPMapper.configure(FTPMapper.java:91)
I am hoping to get a positive solution here. Let me know where I am going wrong.
Main class (Driver class)
/*
* #SFTPClient.java #May 20, 2014
*
*
*/
package com.skanda.ecomm.sftp;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
/**
*
* <p>
* SFTPClient Class
* </p>
*
* #author skanda
* #version 1.0
*
*/
public class SFTPClient extends Configured implements Tool {
public int run(String[] args) throws Exception {
Configuration config = getConf();
String inputPath = config.get(ApplicationConstants.INPUT_PATH);
String outputPath = config.get(ApplicationConstants.OUTPUT_PATH);
String configPath = config.get(ApplicationConstants.CONFIG_PATH);
int reducers = Integer.parseInt(config.get(ApplicationConstants.REDUCERS));
if(outputPath == null || inputPath == null || configPath == null) {
throw new Exception("Usage: \n" + "-D configPath=<configPath> -D inputPath=<inputPath> -D reducers=<reducers" +
"-D outputPath=<path>");
}
JobConf conf = new JobConf(SFTPClient.class);
conf.setJobName("SFTP Injection client");
DistributedCache.addCacheFile(new URI(configPath),conf);
conf.setMapperClass(FTPMapper.class);
conf.setReducerClass(FTPReducer.class);
conf.setMapOutputKeyClass(IntWritable.class);
conf.setMapOutputValueClass(Text.class);
conf.setOutputKeyClass(IntWritable.class);
conf.setOutputValueClass(IntWritable.class);
// configuration should contain reference to your namenode
FileSystem fs = FileSystem.get(new Configuration());
fs.delete(new Path(outputPath), true); // true stands for recursively, deleting the folder you gave
conf.setStrings(ApplicationConstants.INPUT_PATH, inputPath);
conf.setStrings(ApplicationConstants.OUTPUT_PATH, outputPath);
FileInputFormat.setInputPaths(conf, new Path(inputPath));
FileOutputFormat.setOutputPath(conf, new Path(outputPath));
conf.setNumReduceTasks(reducers);
conf.setInt(ApplicationConstants.NUNBER_OF_REDUCERS, reducers);
JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new SFTPClient(), args);
System.exit(exitCode);
}
}
Mapper
/*
* #FTPMapper.java #May 20, 2014
*
*
*/
package com.skanda.ecomm.sftp;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.InetAddress;
import java.net.URI;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.LocalFileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import com.ftp.mapreduce.CommonUtility;
import com.ftp.mapreduce.RetrieveFileNames;
import com.jcraft.jsch.hm.Channel;
/**
*
* <p>
* FTP Mapper Class
* </p>
*
* #author skanda
* #version 1.0
*
*/
#SuppressWarnings("unused")
public class FTPMapper extends MapReduceBase implements Mapper<LongWritable, Text, IntWritable, Text> {
private URI[] localFiles;
private String userName;
private String hostName;
private String folderPath;
private int reducers;
private byte[] pvtKey;
private String fileName;
private String startDate;
private String endDate;
private String sshKeyPath;
private String password;
public void configure(JobConf job) {
Properties properties = new Properties();
try {
localFiles = DistributedCache.getCacheFiles(job);
if (localFiles != null && localFiles.length == 1) {
Configuration conf = new Configuration();
FileSystem fileSystem = FileSystem.get(localFiles[0], conf);
BufferedReader bufferRedaer=new BufferedReader(new InputStreamReader(fileSystem.open(new Path(localFiles[0]))));
properties.load(bufferRedaer);
userName = properties.getProperty(ApplicationConstants.USER_NAME);
reducers = job.getInt(ApplicationConstants.NUNBER_OF_REDUCERS, 30);
hostName = properties.getProperty(ApplicationConstants.SFTP_SERVER_HOST);
folderPath = properties.getProperty(ApplicationConstants.HOSTFILE_DIRECTORY_PATH);
fileName = properties.getProperty(ApplicationConstants.FILE_NAME_PATTERN);
startDate = properties.getProperty(ApplicationConstants.FILE_START_DATE);
endDate = properties.getProperty(ApplicationConstants.FILE_END_DATE);
sshKeyPath = properties.getProperty(ApplicationConstants.SSH_KEY_PATH);
password = properties.getProperty(ApplicationConstants.PASSWORD);
System.out.println("--------------------------------------------------");
/*FileSystem fs = FileSystem.getLocal(conf);
//Path inputPath = fs.makeQualified(new Path(sshKeyPath));
String inputPath = new Path("file:///home/cloudera/"+sshKeyPath).toUri().getPath();
fs.copyFromLocalFile(new Path(inputPath), new Path("outputSFTP/idFile") );*/
try{
Configuration conf1 = new Configuration();
Path pt = new Path("file:///home/cloudera/.ssh/id_rsa");
FileSystem fs = FileSystem.get( new URI("file:///home/cloudera/.ssh/id_rsa"), conf);
LocalFileSystem localFileSystem = fs.getLocal(conf1);
BufferedReader bufferRedaer1 = new BufferedReader(new InputStreamReader(localFileSystem.open(pt)));
String str = null;
while ((str = bufferRedaer1.readLine())!= null)
{
System.out.println("-----------");
System.out.println(str);
}
}catch(Exception e){
System.out.println("failed again");
String computername=InetAddress.getLocalHost().getHostName();
System.out.println(computername);
e.printStackTrace();
}
System.out.println("--------------------------------------------------");
Configuration config = new Configuration();
config.set("fs.defaultFS", "file:////");
LocalFileSystem localFileSystem = FileSystem.getLocal(config);
bufferRedaer = new BufferedReader(new InputStreamReader(localFileSystem.open(new Path(sshKeyPath))));
/*Configuration config = new Configuration();
//config.set("fs.defaultFS", "file:///home/cloudera/.ssh/id_rsa");
LocalFileSystem fileSystm = FileSystem.getLocal(config);
Path path = fileSystm.makeQualified(new Path("/home/cloudera/.ssh/id_rsa"));*/
//FileInputFormat.setInputPaths(job, path);
//bufferRedaer = new BufferedReader(new InputStreamReader(fileSystem.open(path)));
String key = "";
try {
String line = "";
while ((line = bufferRedaer.readLine()) != null) {
key += line + "\n";
}
pvtKey = key.getBytes();
} catch(Exception e){
e.printStackTrace();
} finally {
//fileSystem.close();
//bufferRedaer.close();
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
public void map(LongWritable key, Text value, OutputCollector<IntWritable, Text> output, Reporter reporter)
throws IOException {
List<String> filterFileNamesList = new ArrayList<String>();
Channel channel = CommonUtility.connectSFTP(userName, hostName, pvtKey);
Map<String, String> fileNamesMap = CommonUtility.getFileNames(channel, folderPath);
List<String> filterFileNameList_output = RetrieveFileNames.FILTER_BY_NAME.retrieveFileNames(fileNamesMap, filterFileNamesList,
fileName, startDate, endDate);
for (int i = 0; i < filterFileNameList_output.size(); i++) {
int keyGroup = i % reducers;
output.collect(new IntWritable(keyGroup), new Text(filterFileNameList_output.get(i)));
}
}
}
This code is working for me when program runs on hdfs and my txt file is in this location:
/home/Rishi/Documents/RishiFile/r.txt
public class HadoopRead {
public static void main(String[] args) {
try{
Configuration conf = new Configuration();
Path pt = new Path("/home/Rishi/Documents/RishiFile/r.txt");
FileSystem fs = FileSystem.get( new URI("/home/Rishi/Documents/RishiFile"), conf);
LocalFileSystem localFileSystem = fs.getLocal(conf);
BufferedReader bufferRedaer = new BufferedReader(new InputStreamReader(localFileSystem.open(pt)));
String str = null;
while ((str = bufferRedaer.readLine())!= null)
{
System.out.println("-----------");
System.out.println(str);
}
}catch(Exception e){
e.printStackTrace();
}
}
}
Word Count Example for reading local file on hdfs
my main class
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class FileDriver extends Configured implements Tool {
public static void main(String[] args) {
try{
ToolRunner.run(new Configuration(), new FileDriver(), args);
System.exit(0);
}catch(Exception e){
e.printStackTrace();
}
}
public int run(String[] arg0) throws Exception {
Configuration conf = new Configuration();
Path pt = new Path("file:///home/winoria/Documents/Ri/r");
Job job = new Job(conf, "new Job");
job.setJarByClass(FileDriver.class);
job.setMapperClass(FileMapper.class);
job.setReducerClass(FileReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, pt);
FileSystem.get(job.getConfiguration()).delete(new Path("Output2"), true);
FileOutputFormat.setOutputPath(job, new Path("Output2"));
job.waitForCompletion(true);
return 0;
}
}
mapper class :
public class FileMapper extends Mapper<LongWritable, Text, Text, Text> {
protected void map(LongWritable key, Text value,Context context) throws java.io.IOException ,InterruptedException {
String str[] = value.toString().split(" ");
for(int i =0; i<str.length;i++){
context.write(new Text(str[i]), new Text());
}
};
}
Reducer Class:
public class FileReducer extends Reducer<Text, Text, Text, Text> {
protected void reduce(Text key,Iterable<Text> value,Context context) throws java.io.IOException ,InterruptedException {
int count=0;
for (Text text : value) {
count++;
}
context.write(key, new Text(count+""));
};
}
Hello everyone,
I am trying to to use Apache Lucene in my log parser project. Now i have used the bellow code but i am not able to import org.apache.lucene.analysis.SimpleAnalyzer
I have included the lucene4.1 jar
/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/
package logsearchengine;
import java.io.File;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.FSDirectory;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.FSDirectory;
/**
*
* #author maclean
*/
public class LogSearchEngine {
private int index(File indexDir, File dataDir, String suffix) throws Exception {
IndexWriter indexWriter = new IndexWriter(FSDirectory.open(indexDir), new SimpleAnalyzer(),true,IndexWriter.MaxFieldLength.LIMITED);
indexWriter.setUseCompoundFile(false);
indexDirectory(indexWriter, dataDir, suffix);
int numIndexed = indexWriter.maxDoc();
indexWriter.optimize();
indexWriter.close();
return numIndexed;
}
private void indexDirectory(IndexWriter indexWriter, File dataDir,
String suffix) throws IOException {
File[] files = dataDir.listFiles();
for (int i = 0; i < files.length; i++) {
File f = files[i];
if (f.isDirectory()) {
indexDirectory(indexWriter, f, suffix);
}
else {
indexFileWithIndexWriter(indexWriter, f, suffix);
}
}
}
private void indexFileWithIndexWriter(IndexWriter indexWriter, File f,
String suffix) throws IOException {
if (f.isHidden() || f.isDirectory() || !f.canRead() || !f.exists()) {
return;
}
if (suffix!=null && !f.getName().endsWith(suffix)) {
return;
}
System.out.println("Indexing file " + f.getCanonicalPath());
Document doc = new Document();
doc.add(new Field("contents", new FileReader(f)));
doc.add(new Field("filename", f.getCanonicalPath(),
Field.Store.YES, Field.Index.ANALYZED));
indexWriter.addDocument(doc);
}
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
// TODO code application logic here
File indexDir = new File("C:/index/");
File dataDir = new File("C:/programs/eclipse/workspace/");
String suffix = "java";
SimpleFileIndexer indexer = new SimpleFileIndexer();
int numIndex = indexer.index(indexDir, dataDir, suffix);
System.out.println("Total files indexed " + numIndex);
}
}
I added only one jar file that is lucene-core-4.1.0.jar. I am not able to import sample analyzer As a result of which I cannot create index writer. What am i missing. I am new to lucene
You also need lucene-analyzers-common-4.1.0.jar.
How can i attach a file to existing pdf including a link to the attachment on the main page?
I am using Itext and so far managed to attach on the document level only.
Use the below code to create an attachments,
please go throw the code as I have spent only few minutes, understand
it and remove the unnecessary codes.
There is a RESOURCE variable which points all the
pdf's or files you want to attach.
here in this case, it is public static final String RESOURCE = "chapter16/%s.pdf";
Where the chapter16 folder has the pdf's you want to attached
and %s will be replaced with the name of the file that you provide and
the file format is .pdf(if you are attaching a pdf file)
It works and tested in my system. I refered this code from here
// I create the list which has the list of files names i want to attach
ArrayList<String> strList = new ArrayList<String>();
strList.add("Strings"); // its the same file I'm attaching four times
strList.add("Strings"); // where "Strings" is the name of the file
strList.add("Strings");
strList.add("Strings");
import java.io.ByteArrayOutputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.sql.SQLException;
import java.util.ArrayList;
import com.itextpdf.text.Chunk;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.List;
import com.itextpdf.text.ListItem;
import com.itextpdf.text.Paragraph;
import com.itextpdf.text.pdf.PRStream;
import com.itextpdf.text.pdf.PdfAnnotation;
import com.itextpdf.text.pdf.PdfArray;
import com.itextpdf.text.pdf.PdfDictionary;
import com.itextpdf.text.pdf.PdfName;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.PdfWriter;
/**
* Creates a PDF listing of attachments
* The attachments can be extracted.
*/
public class AttachFiles {
/** Path to the resources. */
public static final String RESOURCE = "chapter16/%s.pdf";
/** The filename of the resulting PDF. */
public static final String FILENAME = "ResultingFile.pdf";
/** The path to the resulting PDFs */
public static final String PATH = "chapter16/%s";
/** The filename of the PDF */
public static final String RESULT = String.format(PATH, FILENAME);
/**
* Creates a PDF listing
* #throws IOException
* #throws DocumentException
* #throws SQLException
*/
public static void main(String[] args) throws IOException, DocumentException, SQLException {
AttachFiles attachFiles = new AttachFiles();
FileOutputStream os = new FileOutputStream(RESULT);
os.write(attachFiles.createPdf());
os.flush();
os.close();
attachFiles.extractAttachments(RESULT);
}
/**
* Extracts attachments from an existing PDF.
* #param src the path to the existing PDF
* #param dest where to put the extracted attachments
* #throws IOException
*/
public void extractAttachments(String src) throws IOException {
PdfReader reader = new PdfReader(src);
PdfArray array;
PdfDictionary annot;
PdfDictionary fs;
PdfDictionary refs;
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
array = reader.getPageN(i).getAsArray(PdfName.ANNOTS);
if (array == null) continue;
for (int j = 0; j < array.size(); j++) {
annot = array.getAsDict(j);
if (PdfName.FILEATTACHMENT.equals(annot.getAsName(PdfName.SUBTYPE))) {
fs = annot.getAsDict(PdfName.FS);
refs = fs.getAsDict(PdfName.EF);
for (PdfName name : refs.getKeys()) {
FileOutputStream fos
= new FileOutputStream(String.format(PATH, fs.getAsString(name).toString()));
fos.write(PdfReader.getStreamBytes((PRStream)refs.getAsStream(name)));
fos.flush();
fos.close();
}
}
}
}
}
/**
* Creates the PDF.
* #return the bytes of a PDF file.
* #throws DocumentExcetpion
* #throws IOException
* #throws SQLException
*/
public byte[] createPdf() throws DocumentException, IOException, SQLException {
// step 1
Document document = new Document();
// step 2
ByteArrayOutputStream baos = new ByteArrayOutputStream();
PdfWriter writer = PdfWriter.getInstance(document, baos);
// step 3
document.open();
// step 4
document.add(new Paragraph("This is a list pdf attachments."));
PdfAnnotation annot;
ListItem item;
Chunk chunk;
List list = new List();
// I create the list which has the list of files names i want to attach
ArrayList<String> strList = new ArrayList<String>();
strList.add("Strings"); // its the same file I'm attaching four times
strList.add("Strings");
strList.add("Strings");
strList.add("Strings");
for (String strWord : strList) {
annot = PdfAnnotation.createFileAttachment(
writer, null, "Name", null,
String.format(RESOURCE, strWord), String.format("%s.pdf", strWord));
item = new ListItem("Name");
item.add("\u00a0\u00a0");
chunk = new Chunk("\u00a0\u00a0\u00a0\u00a0");
chunk.setAnnotation(annot);
item.add(chunk);
list.add(item);
}
document.add(list);
// step 5
document.close();
return baos.toByteArray();
}
}