Hadoop map reduce whole file input format - java

I am trying to use the hadoop map reduce, but instead of mapping each line at a time in my Mapper, I would like to map a whole file at once.
So I have found these two classes
(https://code.google.com/p/hadoop-course/source/browse/HadoopSamples/src/main/java/mr/wholeFile/?r=3)
That suppose to help me do this.
And I got a compilation error that says :
The method setInputFormat(Class) in the type
JobConf is not applicable for the arguments
(Class) Driver.java /ex2/src line 33 Java
Problem
I changed my Driver class to be
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.InputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import forma.WholeFileInputFormat;
/*
* Driver
* The Driver class is responsible of creating the job and commiting it.
*/
public class Driver {
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(Driver.class);
conf.setJobName("Get minimun for each month");
conf.setOutputKeyClass(IntWritable.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
// previous it was
// conf.setInputFormat(TextInputFormat.class);
// And it was changed it to :
conf.setInputFormat(WholeFileInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,new Path("input"));
FileOutputFormat.setOutputPath(conf,new Path("output"));
System.out.println("Starting Job...");
JobClient.runJob(conf);
System.out.println("Job Done!");
}
}
What am I doing wrong?

Make sure your wholeFileInputFormat class has correct imports. You are using old MapReduce Api in your job Driver. I think you imported new API FileInputFormat in your WholeFileInputFormat class. If i am right, You should import org.apache.hadoop.mapred.FileInputFormat in your wholeFileInputFormat class instead of org.apache.hadoop.mapreduce.lib.input.FileInputFormat .
Hope this helps.

Easiest way to do this is to gzip your input file. This will make FileInputFormat.isSplitable() to return false.

We too ran into something similar and had an alternative out-of-box approach.
Let say you need to process 100 large files (f1, f2,...,f100) such that you need to read one file wholly in the map function. So instead of using the "WholeInputFileFormat" reader approach we created equivalent 10 text files (p1, p2,...,p10) each file containing either the HDFS URL or web URL of the f1-f100 files.
Thus p1 will contain url for f1-f10, p2 will urls for f11-f20 and so on.
These new files p1 thru p10 are then used as input to mappers. Thus the mapper m1 processing file p1 will open file f1 thru f10 one at a time and process it wholly.
This approach allowed us to control the number of mappers and write more exhaustive and complex application logic in map-reduce application. E.g we could run NLP on PDF files using this approach.

Related

a very simple case does not write pdf file by using itext in IBM i

i'm trying to use itext (5.5.13) in IBM i (AKA iseries, Power, long ago AS/400). It could be done embedding java code into RPG ILE procedures, or executing plain java. We use Apache POI for Excel for a while, and it works well. We are testing itext now, but some issue persist yet.
Given that, I'm trying to test itext in plain java into IBM i. I prepared a very simple example, taken from listing 1.1 of "Itext in action", and run it. It seems to work well, but nothing is generated. No pdf file results. And no error appears while running.
am i forgetting something? are there some other aspects to take in account?
here is the code:
package QOpenSys.CONSUM.Testjeu;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.Paragraph;
import com.itextpdf.text.pdf.PdfWriter;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
public class test1{
public static final String filePdf = "/QOpenSys/MyFolder/Testjeu/PdfRead1.pdf";
public static void main(String[] args)
throws DocumentException, IOException
{
///QOpenSys/MyFolder/Test/WrkBookRead1.pdf
//pdfDocument = new_DocumentVoid()
Document pdfDocument = new Document();
//pdfWriter = get_PdfWriter( pdfDocument: pdfFilePath);
PdfWriter.getInstance(pdfDocument, new FileOutputStream( filePdf ));
// jItxDocumentOpen( pdfDocument );
pdfDocument.open();
//pdfParagraph = new_PdfParagraphStr( PhraseString );
Paragraph jItxParagraph = new Paragraph("Hola, pdf");
//addToDocPg = jItxDocumentAddParagraph( pdfDocument: pdfParagraph );
pdfDocument.add(jItxParagraph);
//jItxDocumentClose( pdfDocument );
pdfDocument.close();
}
}
Solved. As said before, there was a first issue: it seems java function ran well because not errors/warnings were visible at qshell. It was false: errors were sent to outq, and were available at spool file. Being reviewed, it was a simple classpath issue. It required a full day to figure out what failed locating classpath.
Now it works, and pdf is created. I ran it on qshell, declaring environment variables for java_home (three jvm are executed concurrently by several applications), for classpath, and a couple required for tracing. Classpath declares first my class and secondly itext classes. Remaining classes comes from JRE. I have a full list of classes loaded by class loader. I hope it will help to find what fails in our embedded RPG ILE call to itext.

JMeter Bean Shell Sampler error "...Static method get( java.lang.String ) not found in class'java.nio.file.Paths" when copying files

I am attempting to copy and rename a file on my local machine (Win 7) using Bean Shell Sampler in JMeter 3.0 (Java v1.8). The idea is to create the new file with a unique name and have the name saved as a variable that can be used in place of the file name in an FTP PUT request.
Here is the code I am using for the copy and rename:
import java.text.*;
import java.nio.file.StandardCopyOption.*;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
String filename = new SimpleDateFormat("dd-MM-yyyy_hh:mm:ss").format(new Date())+".xlsx";
log.info(filename);
Path source = Paths.get("C:/dropfile/qatp/QATP_GuestRecords.xlsx");
Path target = Paths.get("C:/dropfile/qatp/"+filename);
Files.copy(source, target, REPLACE_EXISTING);
The error I am receiving in the log:
ERROR - jmeter.util.BeanShellInterpreter: Error invoking bsh method:
eval Sourced file: inline evaluation of: ``import java.text.; import
java.nio.file.StandardCopyOption.; import java.io.IO . . . '' : Typed
variable declaration : Error in method invocation: Static method get(
java.lang.String ) not found in class'java.nio.file.Paths'
I have been searching for an answer to this issue and came across a solution where the suggestion was:
"My guess is that the problem is that it's not populating the varargs parameter. Try:
Path target = Paths.get(filename, new String[0]);"
I tried this solution by modifying my code like so:
import java.text.*;
import java.nio.file.StandardCopyOption.*;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
String filename = new SimpleDateFormat("dd-MM-yyyy_hh:mm:ss").format(new Date())+".xlsx";
log.info(filename);
Path source = Paths.get("C:/dropfile/qatp/QATP_GuestRecords.xlsx", new String[0]);
Path target = Paths.get("C:/dropfile/qatp/"+filename, new String[0]);
Files.copy(source, target, REPLACE_EXISTING);
And received this error:
ERROR - jmeter.util.BeanShellInterpreter: Error invoking bsh method:
eval Sourced file: inline evaluation of: ``import java.text.; import
java.nio.file.StandardCopyOption.; import java.io.IO . . . '' : Typed
variable declaration : Method Invocation Paths.get
Does anyone know why I am hitting this error and how to get around it?
Even in plain old Java this is a misleading use of Paths.get, which takes an URI, or an array of strings (varargs). See javadoc.
In Java what you tried works because the static typing allow the compiler to determine that you are passing an array of a single String. Apparently BeanShell does not and gets confused. The trick suggested in the other answer is not a good one in my opinion: again in Java it would work, by joining the two strings (2nd one is empty, so result is 1st string, which is what you want), but it confuses BeanShell all the same because there is another static get method that takes 2 arguments.
If you already have the path as a single String, try this instead:
Path source = new File("C:/dropfile/qatp/QATP_GuestRecords.xlsx").toPath();
Alternatively, you could use Paths.get like this:
Path source = Paths.get("C:", "dropfile", "qatp", "QATP_GuestRecords.xlsx");
Or like this (varargs is syntaxic sugar to help pass an array):
Path source = Paths.get(new String [] { "C:/dropfile/qatp/QATP_GuestRecords.xlsx" });
It's perfectly valid to pass fragments of path as arguments, or the entire path string as single argument, but that seems to trip BeanShell, so, better avoid Paths.get in BeanShell, unless you pass an array explicitly as in last example.
Beanshell != Java, it doesn't support all the Java features (think about it as about Java 1.5 and amend your code appropriately.
So I would recommend switching to JSR223 Sampler and Groovy language, Groovy is much more Java-compliant and performs much better.
Also be aware that you can use FileUtils.copyFile() method which will work for both Beanshell and/or Groovy
import org.apache.commons.io.FileUtils;
import java.text.SimpleDateFormat;
String filename = new SimpleDateFormat("dd-MM-yyyy_hh:mm:ss").format(new Date()) + ".xlsx";
FileUtils.copyFile(new File("/tmp/QATP_GuestRecords.xlsx"), new File("/tmp/" + filename));
See Groovy is the New Black article for more information on using Groovy language in JMeter test scripts.

Java print multiple copies but only one ends up at the printer

I am trying to print multiple copies of a pdf document. After googling around a bit I found that I have to put a Copies in a PrintRequestAttributeSet. But after doing this only 1 copy is printed instead of the amount I provided.
During debugging I can see that the print object changes its copies variable from 0 to 2, so I would assume I do everything correctly. I've also been playing around a bit with the collation and multipledocumenthandling variables, but the end result stays the same.
Does anyone know how I can get it to print the correct number of copies?
My code:
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.InputStream;
import javax.print.Doc;
import javax.print.DocFlavor;
import javax.print.DocPrintJob;
import javax.print.PrintService;
import javax.print.PrintServiceLookup;
import javax.print.SimpleDoc;
import javax.print.attribute.HashPrintRequestAttributeSet;
import javax.print.attribute.PrintRequestAttributeSet;
import javax.print.attribute.standard.Copies;
import javax.print.attribute.standard.MultipleDocumentHandling;
import javax.print.attribute.standard.SheetCollate;
public class PrintTest {
public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub
InputStream is = new BufferedInputStream(
new FileInputStream(
"<Insert pdf file here>"));
DocFlavor flavor = DocFlavor.INPUT_STREAM.AUTOSENSE;
Copies copies = new Copies(2);
SheetCollate collate = SheetCollate.COLLATED;
MultipleDocumentHandling handling = MultipleDocumentHandling.SEPARATE_DOCUMENTS_COLLATED_COPIES;
PrintRequestAttributeSet pras = new HashPrintRequestAttributeSet();
pras.add(copies);
pras.add(collate);
pras.add(handling);
PrintService service = PrintServiceLookup.lookupDefaultPrintService();
DocPrintJob printJob = service.createPrintJob();
Doc doc = new SimpleDoc(is, flavor, null);
printJob.print(doc, pras);
}
}
So I've been playing around a bit more. I've added a few sysout statements using which I found out you have something called Fidelity which can be used to force it to reject the print job if it cannot be printed exactly as specified. But there are some issues with this. After adding the fidelity setting to it I end up with the following output:
[class javax.print.attribute.standard.JobName, class javax.print.attribute.standard.RequestingUserName, class javax.print.attribute.standard.Copies, class javax.print.attribute.standard.Destination, class javax.print.attribute.standard.OrientationRequested, class javax.print.attribute.standard.PageRanges, class javax.print.attribute.standard.Media, class javax.print.attribute.standard.MediaPrintableArea, class javax.print.attribute.standard.Fidelity, class javax.print.attribute.standard.SheetCollate, class sun.print.SunAlternateMedia, class javax.print.attribute.standard.Chromaticity, class javax.print.attribute.standard.Sides, class javax.print.attribute.standard.PrinterResolution]
[]
Exception in thread "main" sun.print.PrintJobAttributeException: unsupported attribute: collated
at sun.print.Win32PrintJob.getAttributeValues(Win32PrintJob.java:667)
at sun.print.Win32PrintJob.print(Win32PrintJob.java:332)
at net.pearlchain.print.distribute.jasper.PrintTest.main(PrintTest.java:52)
The unsupported attribute is different with each execution but always one of the attributes I have set. I have tried running it using java 6 and java 7 and the only difference I get is the line on which the exception is thrown. On java 6 it is on line 667 and on java 7 it is line 685. Looking at the code found at grepcode I can see the exception being thrown but the actual reason is unclear.
Ok, I've found out why this happens, the flavor I've selected does not support multiple copies. Setting it to pdf leads me to getting a flavornotsupported exception because I have no printers installed which support printing from a pdf source.
It's been a long time and I forgot to post my solution here for future visitors.
I solved this by adding a 3rd party pdf library (Apache PDFBox) which provided me with an inputstream which I could send to the printer with all the settings I required.
http://pdfbox.apache.org/
I no longer have access to the code but this could be useful for future visitors. :)

How to use .jar in a pig file

I have two input files smt.txt and smo.txt. The jar file reads the text files and split the data according to some rule which is described in java file. And the pig file takes these data put into output files with doing mapreduce.
register 'maprfs:///user/username/fl.jar';
DEFINE FixedLoader fl();
mt = load 'maprfs:///user/username/smt.txt' using FixedLoader('-30','30-33',...........) AS (.........);
mo = load 'maprfs:///user/username/smo.txt*' using FixedLoader('-30','30-33',.....) AS (......);
store mt into 'maprfs:///user/username/mt_out' using JsonStorage();
store mo into 'maprfs:///user/username/mo_out' using JsonStorage();
and a part of java code like in the following. (The content of methods are not neccessary I believe):
package com.mapr.util;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.io.*;
import org.apache.pig.*;
import org.apache.pig.data.*;
import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.*;
import java.util.*;
import java.io.*;
public class FixedLoader extends LoadFunc
{
............
}
When I run this pig program in a teminal with the command "pig -x mapreduce sample.pig", I gave an Error message:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve com.mapr.util.FixedLoader using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]
How can I import these into my project or are there any suggestions/solutions to run this program?
You need to define FixedLoader with its full package name:
register 'maprfs:///user/username/fl.jar';
DEFINE FixedLoader com.mapr.util.FixedLoader();
...
Also register all of the 3rd party dependency jars that are used in your custom UDF.

Can not use the XmlInputFormat extends TextInputFormat in Java

I am trying to do a WordCount using Hadoop. I want to use XmlInputFormat.class to split the file base on XML tag. The XmlInputFormat.class is here
XmlInputFormat.class is extends from TextInputFormat.class
Job job = new Job(getConf());
job.setInputFormatClass(XmlInputFormat.class);
It shows the error
The method setInputFormatClass(Class) in the type Job is not applicable for the arguments (Class)
But it's OK when I use
Job job = new Job(getConf());
job.setInputFormatClass(TextInputFormat.class);
Why can't we use the extends one? Or did I do something wrong?
That looks like an issue with your Hadoop version. Did you check that the XMLInputFormat class you are using is actually the right one for your Hadoop version?
I think the hadoop tutorial using mapred library is outdated, and should take a look at:
http://wiki.apache.org/hadoop/WordCount
I could successfully run XMLInputFormat after slight modification of the code above.
Please ignore this answer. I think the cause is because I was using deprecated version of map reduce which use mapred.*.
I had the same problem, and it's resolved when I modified one of imports:
From:
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
To:
import org.apache.hadoop.mapred.TextInputFormat;
May be you are importing the wrong XmlInputFormat.class in your code. the same happened to me with the TextInputFormat.class, to see I was using the wrong import of the class which eclipse automatically pulled out. The correct class to import was:
org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

Categories

Resources