Java UDF Date Regex Extractor for Pig?

Java UDF Date Regex Extractor for Pig? - java

I am trying to create a UDF for importing into Pig that matches a Regex pattern on a date. The Regex has been tested and works accordingly, but I am having trouble with the following code:
package com.date.format;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class DATERANGE extends EvalFunc<String> {
#Override
public String exec(Tuple arg0) throws IOException {
try
{
String pattern = "(Oct\\W(?:1[5-9]|2[0-3])\\W(?:(?:0?9|10):\\d{2}:\\d{2}|11:00:00))";
Pattern pat = Pattern.compile(pattern);
Matcher match = pat.matcher((String) arg0.get(0));
if(match.find())
{
return match.group(0);
}
else return "none";
}
catch(Exception e)
{
throw new IOException("Caught exception processing input row ", e);
}
}
}
After compiling the above java code and exporting it as a jar and running it inside Hadoop using the following Pig script:
register 'DATEFormat.jar';
ld = LOAD 'dates/date_data_three' AS (date:chararray);
loop = foreach ld generate com.date.format.DATERANGE(date) as d:chararray;
dump loop;
I get the following error:
ERROR 2078: Caught error from UDF: com.date.format.DATERANGE [Caught exception
processing input row ]
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator
for alias loop
at org.apache.pig.PigServer.openIterator(PigServer.java:912)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:752)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:372)
at org.apache.pig.tools.grunt.GruntParser.loadScript(GruntParser.java:566)
at org.apache.pig.tools.grunt.GruntParser.processScript(GruntParser.java:513)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.
Script(PigScriptParser.java:1014)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:550)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:228)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:203)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
at org.apache.pig.Main.run(Main.java:542)
at org.apache.pig.Main.main(Main.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store alias loop
at org.apache.pig.PigServer.storeEx(PigServer.java:1015)
at org.apache.pig.PigServer.store(PigServer.java:974)
at org.apache.pig.PigServer.openIterator(PigServer.java:887)
... 16 more
The data file contains dates as shown below:
Wed Oct 15 09:26:09 BST 2014
Wed Oct 15 19:26:09 BST 2014
Wed Oct 18 08:26:09 BST 2014
Wed Oct 23 10:26:09 BST 2014
Sun Oct 05 09:26:09 BST 2014
Wed Nov 20 19:26:09 BST 2014
Does anybody know the correct way to implement a Java UDF for Pig that would work with the Regex I have provided?
Thanks

I recommend you to use REGEX_EXTRACT build-in command, this will be very easy instead of writing UDF.
ld = LOAD 'input.txt' AS (date:chararray);
loop = foreach ld generate REGEX_EXTRACT(date,'(Oct\\W(?:1[5-9]|2[0-3])\\W(?:(?:0?9|10):\\d{2}:\\d{2}|11:00:00))',1) as d:chararray;
C = FILTER loop by d is not null;
D = FOREACH C GENERATE $0;
DUMP D;
Output:
(Oct 15 09:26:09)
(Oct 23 10:26:09)
Your Regex UDF also working fine for me. i just copied your input and java code and executed locally. It works perfectly. Please see the below output that i got from your UDF code. I guess you may need to check your classpath are properly set or not.
(Oct 15 09:26:09)
(none)
(none)
(Oct 23 10:26:09)
(none)
(none)

Even better, you could use ToDate:
load your data into filtered_raw_financings_csvs with close_date as a chararray:
financings_csvs = FOREACH filtered_raw_financings_csvs
GENERATE name,
city,
state,
(close_date==''?NULL:ToDate(close_date, 'dd-MMM-yy')) AS close_date
;
Build your date format string as described here:
http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html
This snippet is shown in context here:
http://nathan.vertile.com/blog/2015/04/17/handling-dates-in-hadoop-pig/

Related

Has there anyway to ignore errors of google-closure JSCompiler?

I tried to compress Inputmask.js file. Here is my code to work
public class JSFileMinifyTest {
public static void main(String[] args) throws Exception {
String sourceFileName = "D:\\temp\\jquery.inputmask.bundle.js";
String outputFilename = "D:\\temp\\combined.min.js";
com.google.javascript.jscomp.Compiler.setLoggingLevel(Level.INFO);
com.google.javascript.jscomp.Compiler compiler = new com.google.javascript.jscomp.Compiler();
CompilerOptions options = new CompilerOptions();
CompilationLevel.WHITESPACE_ONLY.setOptionsForCompilationLevel(options);
options.setAggressiveVarCheck(CheckLevel.OFF);
options.setRuntimeTypeCheck(false);
options.setCheckRequires(CheckLevel.OFF);
options.setCheckProvides(CheckLevel.OFF);
options.setReserveRawExports(false);
WarningLevel.VERBOSE.setOptionsForWarningLevel(options);
// To get the complete set of externs, the logic in
// CompilerRunner.getDefaultExterns() should be used here.
SourceFile extern = SourceFile.fromCode("externs.js", "function alert(x) {}");
SourceFile jsFile = SourceFile.fromFile(sourceFileName);
compiler.compile(extern, jsFile, options);
for (JSError message : compiler.getWarnings()) {
System.err.println("Warning message: " + message.toString());
}
for (JSError message : compiler.getErrors()) {
System.err.println("Error message: " + message.toString());
}
FileWriter outputFile = new FileWriter(outputFilename);
outputFile.write(compiler.toSource());
outputFile.close();
}
}
But error occured while compressing this js file.
Apr 09, 2018 11:29:18 AM com.google.javascript.jscomp.parsing.ParserRunner parse
INFO: Error parsing D:\temp\jquery.inputmask.bundle.js: Compilation produced 3 syntax errors. (D:\temp\jquery.inputmask.bundle.js#1)
Apr 09, 2018 11:29:18 AM com.google.javascript.jscomp.LoggerErrorManager println
SEVERE: D:\temp\jquery.inputmask.bundle.js:1002: ERROR - Parse error. identifier is a reserved word
static || null !== test.fn && void 0 !== testPos.input ? static && null !== test.fn && void 0 !== testPos.input && (static = !1,
^
Apr 09, 2018 11:29:18 AM com.google.javascript.jscomp.LoggerErrorManager println
SEVERE: D:\temp\jquery.inputmask.bundle.js:1003: ERROR - Parse error. syntax error
maskTemplate += "</span>") : (static = !0, maskTemplate += "<span class='im-static''>");
^
Apr 09, 2018 11:29:18 AM com.google.javascript.jscomp.LoggerErrorManager println
SEVERE: D:\temp\jquery.inputmask.bundle.js:1010: ERROR - Parse error. missing variable name
var maskTemplate = "", static = !1;
^
Apr 09, 2018 11:29:18 AM com.google.javascript.jscomp.LoggerErrorManager printSummary
WARNING: 3 error(s), 0 warning(s)
Error message: JSC_PARSE_ERROR. Parse error. identifier is a reserved word at D:\temp\jquery.inputmask.bundle.js line 1002 : 16
Error message: JSC_PARSE_ERROR. Parse error. syntax error at D:\temp\jquery.inputmask.bundle.js line 1003 : 29
Error message: JSC_PARSE_ERROR. Parse error. missing variable name at D:\temp\jquery.inputmask.bundle.js line 1010 : 39
Has there anyways to skip syntax validations or ignore errors and continue to compress ?

The compiler by default, parses code as 'use strict'. 'static' is a reserved word in strict mode. You can either change the language mode or simply renamed the variable from 'strict' to something else.
The strict mode reserve words are documented here:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Strict_mode#Paving_the_way_for_future_ECMAScript_versions

How to display omdb function returned value in webpage

Can we store the value returned from omdb function searchOneMovie(moviename) so that it can be displayed as output in JSP page when button is clicked? I am developing a Web-Application where on clicking the button I should get the details of movie as searched by user but by using searchOneMovie() I am getting the output in console not in web page. Please tell how to display searchOneMovie() value in JSP page.
CODE:
public void search(HttpServletResponse response, String moviename) throws OmdbSyntaxErrorException, OmdbConnectionErrorException, OmdbMovieNotFoundException, IOException {
Omdb o = new Omdb();
try {
PrintWriter out = response.getWriter();
out.print(o.searchOneMovie(moviename));
} catch (Exception ignore) {
}
}
OUTPUT: This output is coming on console but i wan't it in webpage please help.
Mar 13, 2017 8:13:51 PM com.omdbapi.RestClient execute
INFO: executing GET http://www.omdbapi.com/?t=star+wars HTTP/1.1
Mar 13, 2017 8:13:52 PM com.omdbapi.Omdb resultToJson
INFO: received {"Title":"Star Wars: Episode IV - A New Hope","Year":"1977","Rated":"PG","Released":"25 May 1977","Runtime":"121 min","Genre":"Action, Adventure, Fantasy","Director":"George Lucas","Writer":"George Lucas","Actors":"Mark Hamill, Harrison Ford, Carrie Fisher, Peter Cushing","Plot":"Luke Skywalker joins forces with a Jedi Knight, a cocky pilot, a wookiee and two droids to save the galaxy from the Empire's world-destroying battle-station, while also attempting to rescue Princess Leia from the evil Darth Vader.","Language":"English","Country":"USA","Awards":"Won 6 Oscars. Another 50 wins & 28 nominations.","Poster":"https://images-na.ssl-images-amazon.com/images/M/MV5BYzQ2OTk4N2QtOGQwNy00MmI3LWEwNmEtOTk0OTY3NDk2MGJkL2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyNjc1NTYyMjg#._V1_SX300.jpg","Metascore":"92","imdbRating":"8.7","imdbVotes":"963,318","imdbID":"tt0076759","Type":"movie","Response":"True"}

OutputSream.write is too slow

I am encountering with a senerior like this:
My project has a servlet to catch a request from perl. The request is to download a file. The request is a multipartRequest.
#RequestMapping(value = "/*", method = RequestMethod.POST)
public void tdRequest(#RequestHeader("Authorization") String authenticate,
HttpServletResponse response,
HttpServletRequest request) throws Exception
{
if (ServletFileUpload.isMultipartContent(request))
{
ServletFileUpload sfu = new ServletFileUpload();
FileItemIterator items = sfu.getItemIterator(request);
while (items.hasNext())
{
FileItemStream item = items.next();
if (("action").equals(item.getFieldName()))
{
InputStream stream = item.openStream();
String value = Streams.asString(stream);
if (("upload").equals(value))
{
uploadRequest(items, response);
return;
}
else if (("download").equals(value))
{
downloadRequest(items, response);
return;
}
The problem is not here, it appears on the downloadRequest() function.
void downloadRequest(FileItemIterator items,
HttpServletResponse response) throws Exception
{
log.info("Start downloadRequest.......");
OutputStream os = response.getOutputStream();
File file = new File("D:\\clip.mp4");
FileInputStream fileIn = new FileInputStream(file);
//while ((datablock = dataOutputStreamServiceImpl.readBlock()) != null)
byte[] outputByte = new byte[ONE_MEGABYE];
while (fileIn.read(outputByte) != -1)
{
System.out.println("--------" + (i = i + 1) + "--------");
System.out.println(new Date());
//dataContent = datablock.getContent();
System.out.println("Start write " + new Date());
os.write(outputByte, 0,outputByte.length);
System.out.println("End write " + new Date());
//System.out.println("----------------------");
}
os.close();
}
}
I try to read and write blocks of 1MB from the file. However, it takes too long for downloading the whole file. ( my case is 20mins for file of 100MB)
I try to sysout and I saw a result like this:
The first few blocks can read, write data realy fast:
--------1--------
Mon Dec 07 16:24:20 ICT 2015
Start write Mon Dec 07 16:24:20 ICT 2015
End write Mon Dec 07 16:24:21 ICT 2015
--------2--------
Mon Dec 07 16:24:21 ICT 2015
Start write Mon Dec 07 16:24:21 ICT 2015
End write Mon Dec 07 16:24:21 ICT 2015
--------3--------
Mon Dec 07 16:24:21 ICT 2015
Start write Mon Dec 07 16:24:21 ICT 2015
End write Mon Dec 07 16:24:21 ICT 2015
But the next block is slower than the previous
--------72--------
Mon Dec 07 16:29:22 ICT 2015
Start write Mon Dec 07 16:29:22 ICT 2015
End write Mon Dec 07 16:29:29 ICT 2015
--------73--------
Mon Dec 07 16:29:29 ICT 2015
Start write Mon Dec 07 16:29:29 ICT 2015
End write Mon Dec 07 16:29:37 ICT 2015
--------124--------
Mon Dec 07 16:38:22 ICT 2015
Start write Mon Dec 07 16:38:22 ICT 2015
End write Mon Dec 07 16:38:35 ICT 2015
--------125--------
Mon Dec 07 16:38:35 ICT 2015
Start write Mon Dec 07 16:38:35 ICT 2015
End write Mon Dec 07 16:38:48 ICT 2015
The problem is in the os.write()
I realy cannot understand how the outputStream write, why it take such a long time like that? or I made some mistakes?
Sorry for my bad english. I realy need your support. Thank in advance!
This is the perl code from the client side
# ----- get connected to download the file
#
$Response = $ua->request(POST $remoteHost ,
Content_Type => 'form-data',
Authorization => $Authorization,
'Proxy-Authorization' => $Proxy_Authorization ,
Content => [ DOS => 1 ,
action => 'download' ,
first_run => 0 ,
dl_filename => $dl_filename ,
delivery_dir => $delivery_dir ,
verbose => $Verbose ,
debug => $debug ,
version => $VERSION
]
);
unless ($Response->is_success) {
my $Msg = $Response->error_as_HTML;
# Remove HTML tags - we're in a DOS shell!
$Msg =~ s/<[^>]+>//g;
print "ERROR! SERVER RESPONSE:\n$Msg\n";
print "$remoteHost\n\n" if $Options{'v'};
Error "Could not connect to " . $remoteHost ;
}
my $Result2 = $Response->content();
Error "Abnormal termination...\n$Result2" if $Result2 =~ /_APP_ERROR_/;
open(F, ">$dl_filename") or Error "Could not open '$dl_filename'!";
binmode F; # unless $dl_filename =~ /\.txt$|\.htm$/;
print F $Result2;
close F;
print "received.\n";
}

One problem is that fileIn.read(outputByte) can read random number of bytes, not only full outputByte. You read few KB, then you store full 1MB, and very fast you are running out of space on disk. Try this, notice the "readed" parameter.
void downloadRequest(FileItemIterator items,
HttpServletResponse response) throws Exception
{
log.info("Start downloadRequest.......");
OutputStream os = response.getOutputStream();
File file = new File("D:\\clip.mp4");
FileInputStream fileIn = new FileInputStream(file);
//while ((datablock = dataOutputStreamServiceImpl.readBlock()) != null)
byte[] outputByte = new byte[ONE_MEGABYE];
int readed =0;
while ((readed =fileIn.read(outputByte)) != -1)
{
System.out.println("--------" + (i = i + 1) + "--------");
System.out.println(new Date());
//dataContent = datablock.getContent();
System.out.println("Start write " + new Date());
os.write(outputByte, 0,readed );
System.out.println("End write " + new Date());
//System.out.println("----------------------");
}
os.close();
}
}

It looks like your download performance gets slower and slower, the further you are getting into the download. You start out at one or less seconds per block, by block 72 it is 7+ seconds per block and by block 128 it is 13 seconds per block.
There is nothing on the server side to explain this. Rather, it has the "smell" of the client side doing something wrong. My guess is that the client side is reading the data from the socket into an in-memory data structure, and that data structure (maybe just a String or StringBuffer or StringBuilder) is getting larger and larger. Either the time take to expand it is getting larger, or your memory footprint is growing and the GC is taking longer and longer. (Or both.)
If you showed us the client-side code .....
UPDATE
As I suspected, this line of code will be reading the entire content into the Perl equivalent of a string builder before turning it into a string.
my $Result2 = $Response->content();
Depending on how it is implemented under the hood, this will lead to repeated copying of the data as the builder runs out of buffer space and needs to be expanded. Depending on the buffer expansion strategy that Perl employs for this, it could give O(N^2) behavior, where N is the size of the file you are transferring. (The evidence is that you are not getting O(N) behavior ...)
If you want a faster downloads, you need to stream the data on the client side. Read the response content in chunks and write them to the output file. (I'm not a Perl expert, so I can't offer you code.) This will also reduce the memory footprint on the client side ... which could be important if your file sizes increase.

Opening OrientDB Database in Eclipse Maven project throws errors

The error I am getting is as follows:
" Aug 25, 2015 1:47:41 PM com.orientechnologies.common.log.OLogManager log
INFO: OrientDB auto-config DISKCACHE=4,161MB (heap=1,776MB os=7,985MB disk=416,444MB)
Aug 25, 2015 1:47:41 PM com.orientechnologies.common.log.OLogManager log
WARNING: segment file 'database.ocf' was not closed correctly last time
Exception in thread "main" com.orientechnologies.common.exception.OException: Error on creation
of shared resource
at com.orientechnologies.common.concur.resource.OSharedContainerImpl.getResource(OSharedContainerImpl.java:55)
at com.orientechnologies.orient.core.metadata.OMetadataDefault.init(OMetadataDefault.java:175)
at com.orientechnologies.orient.core.metadata.OMetadataDefault.load(OMetadataDefault.java:77)
at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.initAtFirstOpen(ODatabaseDocumentTx.java:2633)
at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.open(ODatabaseDocumentTx.java:254)
at arss.db.main(db.java:17)
Caused by: com.orientechnologies.orient.core.exception.ORecordNotFoundException:
The record with id '#0:1' not found
at com.orientechnologies.orient.core.record.ORecordAbstract.reload(ORecordAbstract.java:266)
at com.orientechnologies.orient.core.record.impl.ODocument.reload(ODocument.java:665)
at com.orientechnologies.orient.core.type.ODocumentWrapper.reload(ODocumentWrapper.java:91)
at com.orientechnologies.orient.core.type.ODocumentWrapperNoClass.reload(ODocumentWrapperNoClass.java:73)
at com.orientechnologies.orient.core.metadata.schema.OSchemaShared.load(OSchemaShared.java:786)
at com.orientechnologies.orient.core.metadata.OMetadataDefault$1.call(OMetadataDefault.java:180)
at com.orientechnologies.orient.core.metadata.OMetadataDefault$1.call(OMetadataDefault.java:175)
at com.orientechnologies.common.concur.resource.OSharedContainerImpl.getResource(OSharedContainerImpl.java:53)
... 5 more
Caused by: com.orientechnologies.orient.core.exception.ODatabaseException: Error
on retrieving record #0:1 (cluster: internal)
at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.executeReadRecord(ODatabaseDocumentTx.java:1605)
at com.orientechnologies.orient.core.tx.OTransactionNoTx.loadRecord(OTransactionNoTx.java:80)
at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.reload(ODatabaseDocumentTx.java:1453)
at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.reload(ODatabaseDocumentTx.java:117)
at com.orientechnologies.orient.core.record.ORecordAbstract.reload(ORecordAbstract.java:260)
... 12 more
Caused by: java.lang.NoSuchMethodError: com.orientechnologies.common.concur.lock.ONewLockManager.tryAcquireSharedLock(Ljava/lang/Object;J)Z
at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.acquireReadLock(OAbstractPaginatedStorage.java:1301)
at com.orientechnologies.orient.core.tx.OTransactionAbstract.lockRecord(OTransactionAbstract.java:120)
at com.orientechnologies.orient.core.id.ORecordId.lock(ORecordId.java:282)
at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.lockRecord(OAbstractPaginatedStorage.java:1784)
at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.readRecord(OAbstractPaginatedStorage.java:1424)
at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.readRecord(OAbstractPaginatedStorage.java:697)
at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.executeReadRecord(ODatabaseDocumentTx.java:1572)
... 16 more"
The code that I use is:
package arss;
import com.orientechnologies.orient.core.config.OGlobalConfiguration;
import com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx;
import com.orientechnologies.orient.core.record.impl.ODocument;
import com.orientechnologies.orient.core.serialization.serializer.record.ORecordSerializerFactory;
import com.orientechnologies.orient.core.serialization.serializer.record.binary.ORecordSerializerBinary;
import com.orientechnologies.orient.core.serialization.serializer.record.string.ORecordSerializerSchemaAware2CSV;
public class db {
public static void main(String[] args) {
ODatabaseDocumentTx db = new ODatabaseDocumentTx("plocal:C:/AR/AR/Newfolder/orientdb-community-2.0.3_S/databases/GratefulDeadConcerts").open("admin", "admin");
try {
// CREATE A NEW DOCUMENT AND FILL IT
ODocument doc = new ODocument("Person");
doc.field( "name", "Luke" );
doc.field( "surname", "Skywalker" );
doc.field( "city", new ODocument("City").field("name","Rome").field("country", "Italy") );
// SAVE THE DOCUMENT
doc.save();
db.close();
} finally {
db.close();
}
}
}"

Not sure if you need .flush with ODocuments, you should look that up. (or if save is an equivalent and its okay)
from those two errors lines:
WARNING: segment file 'database.ocf' was not closed correctly last time
and
com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.executeReadRecord(ODatabaseDocumentTx.java:1572) ... 16 more"
I think it has something to do with the database.ocf file itself. dont know if this helps, but try opening it manually, preferable with/without admin and close it again. ("Have you tried turning it off and on again?")
If there is still an erorr, check if it is a different one.

Hadoop MapReduce job starts but can not find Map class?

My MapReduce app counts usage of field values in a Hive table. I managed to build and run it from Eclipse after including all jars from /usr/lib/hadood, /usr/lib/hive and /usr/lib/hcatalog directories. It works.
After many frustrations I have also managed to compile and run it as Maven project:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.bigdata.hadoop</groupId>
<artifactId>FieldCounts</artifactId>
<packaging>jar</packaging>
<name>FieldCounts</name>
<version>0.0.1-SNAPSHOT</version>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.hcatalog</groupId>
<artifactId>hcatalog-core</artifactId>
<version>0.11.0</version>
</dependency>
</dependencies>
</project>
To run the job from command line the following script has to be used :
#!/bin/sh
export LIBJARS=/usr/lib/hcatalog/share/hcatalog/hcatalog-core.jar,/usr/lib/hive/lib/hive-exec-0.12.0.2.0.6.1-101.jar,/usr/lib/hive/lib/hive-metastore-0.12.0.2.0.6.1-101.jar,/usr/lib/hive/lib/libfb303-0.9.0.jar,/usr/lib/hive/lib/jdo-api-3.0.1.jar,/usr/lib/hive/lib/antlr-runtime-3.4.jar,/usr/lib/hive/lib/datanucleus-api-jdo-3.2.1.jar,/usr/lib/hive/lib/datanucleus-core-3.2.2.jar
export HADOOP_CLASSPATH=${HADOOP_CLASSPATH}:.:/usr/lib/hcatalog/share/hcatalog/hcatalog-core.jar:/usr/lib/hive/lib/hive-exec-0.12.0.2.0.6.1-101.jar:/usr/lib/hive/lib/hive-metastore-0.12.0.2.0.6.1-101.jar:/usr/lib/hive/lib/libfb303-0.9.0.jar:/usr/lib/hive/lib/jdo-api-3.0.1.jar:/usr/lib/hive/lib/antlr-runtime-3.4.jar:/usr/lib/hive/lib/datanucleus-api-jdo-3.2.1.jar:/usr/lib/hive/lib/datanucleus-core-3.2.2.jar
hadoop jar FieldCounts-0.0.1-SNAPSHOT.jar com.bigdata.hadoop.FieldCounts -libjars ${LIBJARS} simple simpout
Now Hadoop creates and starts the job that next fails because Hadoop can not find Map class:
14/03/26 16:25:58 INFO mapreduce.Job: Running job: job_1395407010870_0007
14/03/26 16:26:07 INFO mapreduce.Job: Job job_1395407010870_0007 running in uber mode : false
14/03/26 16:26:07 INFO mapreduce.Job: map 0% reduce 0%
14/03/26 16:26:13 INFO mapreduce.Job: Task Id : attempt_1395407010870_0007_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.bigdata.hadoop.FieldCounts$Map not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1720)
at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:186)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:721)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: java.lang.ClassNotFoundException: Class com.bigdata.hadoop.FieldCounts$Map not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1626)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1718)
... 8 more
Why this happen? Job jar contains all classes including Map:
jar tvf FieldCounts-0.0.1-SNAPSHOT.jar
0 Wed Mar 26 15:51:06 MSK 2014 META-INF/
121 Wed Mar 26 15:51:04 MSK 2014 META-INF/MANIFEST.MF
0 Wed Mar 26 14:29:58 MSK 2014 com/
0 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/
0 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/
3992 Fri Mar 21 17:29:22 MSK 2014 hive-site.xml
4093 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts.class
2961 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts$Reduce.class
1621 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/TableFieldValueKey.class
4186 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts$Map.class
0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/
0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/com.bigdata.hadoop/
0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/
1030 Wed Mar 26 14:28:22 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/pom.xml
123 Wed Mar 26 14:30:02 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/pom.properties
[hdfs#localhost target]$ jar tvf FieldCounts-0.0.1-SNAPSHOT.jar
0 Wed Mar 26 15:51:06 MSK 2014 META-INF/
121 Wed Mar 26 15:51:04 MSK 2014 META-INF/MANIFEST.MF
0 Wed Mar 26 14:29:58 MSK 2014 com/
0 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/
0 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/
3992 Fri Mar 21 17:29:22 MSK 2014 hive-site.xml
4093 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts.class
2961 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts$Reduce.class
1621 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/TableFieldValueKey.class
4186 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts$Map.class
0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/
0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/com.bigdata.hadoop/
0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/
1030 Wed Mar 26 14:28:22 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/pom.xml
123 Wed Mar 26 14:30:02 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/pom.properties
What is wrong? Should I put Map and Reduce classes in separate files?
MapReduce code:
package com.bigdata.hadoop;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.util.*;
import org.apache.hcatalog.mapreduce.*;
import org.apache.hcatalog.data.*;
import org.apache.hcatalog.data.schema.*;
import org.apache.log4j.Logger;
public class FieldCounts extends Configured implements Tool {
public static class Map extends Mapper<WritableComparable, HCatRecord, TableFieldValueKey, IntWritable> {
static Logger logger = Logger.getLogger("com.foo.Bar");
static boolean firstMapRun = true;
static List<String> fieldNameList = new LinkedList<String>();
/**
* Return a list of field names not containing `id` field name
* #param schema
* #return
*/
static List<String> getFieldNames(HCatSchema schema) {
// Filter out `id` name just once
if (firstMapRun) {
firstMapRun = false;
List<String> fieldNames = schema.getFieldNames();
for (String fieldName : fieldNames) {
if (!fieldName.equals("id")) {
fieldNameList.add(fieldName);
}
}
} // if (firstMapRun)
return fieldNameList;
}
#Override
protected void map( WritableComparable key,
HCatRecord hcatRecord,
//org.apache.hadoop.mapreduce.Mapper
//<WritableComparable, HCatRecord, Text, IntWritable>.Context context)
Context context)
throws IOException, InterruptedException {
HCatSchema schema = HCatBaseInputFormat.getTableSchema(context.getConfiguration());
//String schemaTypeStr = schema.getSchemaAsTypeString();
//logger.info("******** schemaTypeStr ********** : "+schemaTypeStr);
//List<String> fieldNames = schema.getFieldNames();
List<String> fieldNames = getFieldNames(schema);
for (String fieldName : fieldNames) {
Object value = hcatRecord.get(fieldName, schema);
String fieldValue = null;
if (null == value) {
fieldValue = "<NULL>";
} else {
fieldValue = value.toString();
}
//String fieldNameValue = fieldName+"."+fieldValue;
//context.write(new Text(fieldNameValue), new IntWritable(1));
TableFieldValueKey fieldKey = new TableFieldValueKey();
fieldKey.fieldName = fieldName;
fieldKey.fieldValue = fieldValue;
context.write(fieldKey, new IntWritable(1));
}
}
}
public static class Reduce extends Reducer<TableFieldValueKey, IntWritable,
WritableComparable, HCatRecord> {
protected void reduce( TableFieldValueKey key,
java.lang.Iterable<IntWritable> values,
Context context)
//org.apache.hadoop.mapreduce.Reducer<Text, IntWritable,
//WritableComparable, HCatRecord>.Context context)
throws IOException, InterruptedException {
Iterator<IntWritable> iter = values.iterator();
int sum = 0;
// Sum up occurrences of the given key
while (iter.hasNext()) {
IntWritable iw = iter.next();
sum = sum + iw.get();
}
HCatRecord record = new DefaultHCatRecord(3);
record.set(0, key.fieldName);
record.set(1, key.fieldValue);
record.set(2, sum);
context.write(null, record);
}
}
public int run(String[] args) throws Exception {
Configuration conf = getConf();
args = new GenericOptionsParser(conf, args).getRemainingArgs();
// To fix Hadoop "META-INFO" (http://stackoverflow.com/questions/17265002/hadoop-no-filesystem-for-scheme-file)
conf.set("fs.hdfs.impl",
org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
conf.set("fs.file.impl",
org.apache.hadoop.fs.LocalFileSystem.class.getName());
// Get the input and output table names as arguments
String inputTableName = args[0];
String outputTableName = args[1];
// Assume the default database
String dbName = null;
Job job = new Job(conf, "FieldCounts");
HCatInputFormat.setInput(job,
InputJobInfo.create(dbName, inputTableName, null));
job.setJarByClass(FieldCounts.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
// An HCatalog record as input
job.setInputFormatClass(HCatInputFormat.class);
// Mapper emits TableFieldValueKey as key and an integer as value
job.setMapOutputKeyClass(TableFieldValueKey.class);
job.setMapOutputValueClass(IntWritable.class);
// Ignore the key for the reducer output; emitting an HCatalog record as
// value
job.setOutputKeyClass(WritableComparable.class);
job.setOutputValueClass(DefaultHCatRecord.class);
job.setOutputFormatClass(HCatOutputFormat.class);
HCatOutputFormat.setOutput(job,
OutputJobInfo.create(dbName, outputTableName, null));
HCatSchema s = HCatOutputFormat.getTableSchema(job);
System.err.println("INFO: output schema explicitly set for writing:"
+ s);
HCatOutputFormat.setSchema(job, s);
return (job.waitForCompletion(true) ? 0 : 1);
}
public static void main(String[] args) throws Exception {
String classpath = System.getProperty("java.class.path");
System.out.println("*** CLASSPATH: "+classpath);
int exitCode = ToolRunner.run(new FieldCounts(), args);
System.exit(exitCode);
}
}

As I found out the problem was in directory permissions where MapReduce jar was located. This jar was built in a home directory of a regular, not hdfs user. As long as this MRD job outputs results of its work directly into Hive table, it should be run under hdfs user. In case such a job is run under regular user it has no permission to write data into Hive table!
On the other hand a home directory of a regular user in CentOS has 700 permissions. So when you run hadoop jar ... command under user different from the user owning this home directory access to MRD jar gets denied somewhere in the process of loading classes by Hadoop. That's why under hdfs user this job results in java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.bigdata.hadoop.MyMap not found.
Recursively changing permission of home directory from 700 to 755 where MRD jar was built solves this problem.
Yet a more important problem remains: How to run a job under regular user so it has permission to write data into a Hive table?

I've found the following
Set hadoop system user for client embedded in Java webapp
that allowed me to connect as the expected hadoop user, but jar is not yet uploaded nor executed... ClassNotFound remains

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.