How to execute a perl program inside Map Reduce in Hadoop? - java

I have a perl program which will take a input file and process it and produce an output file as result. Now I need to use this perl program on hadoop. So that the perl program will run on data chunks stored on edge nodes thing is I shouldn't modify the perl code. I didn't know how to start this . Can someone please give me any advice ir suggestions.
Can I write a java program , in the mapper class call the perl program using process builder and combine the results in reducer class ??
Is there any other way to achieve this ?

I believe you can do this with hadoop streaming.
As per tom white, author of hadoop definitive guide, 3rd edition. Page # 622, Appendix C.
He used hadoop to execute a bash shell script as a mapper.
In your case you need to use perl script instead of that bash shell script.
Use Case: He has a lot of small files(one big tar file input), his shell script converts them into few big files(one big tar file output).
He used hadoop to process them in parallel by giving bash shell script as mapper. Therefore this mapper works with input files parallely and produce results.
example hadoop command:(copy pasted)
hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-D mapred.reduce.tasks=0 \
-D mapred.map.tasks.speculative.execution=false \
-D mapred.task.timeout=12000000 \
-input ncdc_files.txt \
-inputformat org.apache.hadoop.mapred.lib.NLineInputFormat \
-output output \
-mapper load_ncdc_map.sh \
-file load_ncdc_map.sh
Replace load_ncdc_map.sh with your xyz.perl in both places(last 2 lines in command).
Replace ncdc_files.txt with another text file which contains the list of your input files to be processed.(5th line from bottom)
Assumptions Taken: You have a fully functional hadoop cluster running and your perl script is error free.
Please try and let me know.

Process builder in any java program is used to call non-java applications or scripts. Process builder should work when called from the mapper class. You need to make sure that the perl script, the perl executable and the perl libraries are available for all mappers.

Bit late to the party...
I'm about to start using Hadoop::Streaming. This seems to be the consensus module to use.

Related

Mahout: where can I find the java class executed by a bash shell script?

I'm tring to write a java program using some functions from Mahout. I know that I can execute some Mahout functions with command line but I also want to know where I can find those functions in the .java files.
https://chimpler.wordpress.com/2013/04/17/generating-eigenfaces-with-mahout-svd-to-recognize-person-faces/
It seems like I can execute a java class with this command: $ mahout cleansvd -ci covariance.seq -ei output -o output2
So I checked the bash file and found this:
exec "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH" $CLASS "$#"
However I cannot find any definition or assignment of $CLASS, and I don't know where the "cleansvd" class is.
Also, I can execute this command to perform a Singular Value Decomposition with 5 arguments:
$ mahout svd --input covariance.seq --numRows 150 --numCols 150 --rank 50 --output output
And I did find class SingularValueDecomposition in the source file, which takes only one argument and cannot reduce rank.
I really want to know what happened and how shell scripts locate java classes.
first of all, that's a very old blog post.
I wrote this one to use with "new mahout".
https://rawkintrevo.org/2016/11/10/deep-magic-volume-3-eigenfaces/
It uses Scala, not Java, but the code is very simple and straight forward. You could easily make a jar and import it into a Java program.
The blog also shows you how the whole eigenfaces thing works- you basically just need to do SVD / DS-SVD on a matrix of faces-as-vectors

How to install Java application to my linux system

I have written a Java application that analyzes my phone bills and calculates the average. At the moment I execute it like:
$ java -jar analyze.jar bill_1.pdf bill_2.pdf
But I want to install the application to my system. So I can just fire up a terminal type in the name of the application and the arguments and hit enter. Like any other "normal" program.
$ analyze bill_1.pdf bill_2.pdf bill_3.pdf
I know I can write a shell script and install it to "/usr/bin/" but I can't believe that there is no "native" way.
So please help, sorry for dump question.
Thank's in advance
One neat little trick is that you can append a basic shell script to the start of the jar file which will run it appropriately. See here for the full example but the basics are:
stub.sh
#!/bin/sh
MYSELF=`which "$0" 2>/dev/null`
[ $? -gt 0 -a -f "$0" ] && MYSELF="./$0"
java=java
if test -n "$JAVA_HOME"; then
java="$JAVA_HOME/bin/java"
fi
exec "$java" $java_args -jar $MYSELF "$#"
exit 1
Then do...
cat stub.sh helloworld.jar > hello.run && chmod +x helloworld.run
And you should be all set! Now you can just call the script-ified jar directly.
./helloworld.run
What you did so far is basically "the native" way.
You have to keep in mind: Java applications are compiled to byte code. There simply is no binary for your application that you could invoke. You do need this detour of calling some JVM installation with a pointer to the main class you want to run. In owhther words; that is what the vast majority of java applications are doing.
Theoretically, there are products out that there that you could use to actually create a "true" binary from your application; but that isn't an easy path (see here for first starters); and given your statement that your just looking for more "convenience" it is definitely inappropriate here.

How do I directly execute a jar in linux?

I just want to be able to do ./whatever.jar instead of java -jar whatever.jar.
I've found a way:
#!/bin/bash
java -jar $0 $*
exit
# jar goes here...
but it doesn't work. Java just complains that it's an invalid/corrupt jarfile.
I also tried piping:
#!/bin/bash
tail -n +4 $0 | java -jar
exit
# jar goes here...
but this doesn't work.
One way to do it is to somehow split the file into two separate parts (the script part and the jar part), and then execute the jar, but that'd be redundant. You'd might as well make a script that executes the jar and execute that script.
So I need to figure out how to somehow tail it and fake the file.
I thought I could do it using /dev/stdout:
#!/bin/bash
java -jar /dev/stdout
tail -n +5 $0
exit
# jar goes here...
That doesn't work either. It just prints the contents of the jar and java complains that it's invalid. (I figured out later that there's nothing to read in /dev/stdout)
So I need to read from stdout some other way. I really wish I could pipe it though. It would make things SO much easier :)
You need a service called jexec some linux distros come with this installed check for /etc/init.d/jexec. My CentOS 5.5 definitely does.
What it does is register the jexec interpreter with the binfmt system.
For more information you might what to have a quick read of binfmt_misc.
Assuming you have the kernel source code installed, check out /usr/src/linux/Documentation/java.txt for a way to run Java code directly using the kernel's BINFMT_MISC support (assuming it's compiled into the version of the kernel you're running, but I think it is on most major distros). If you don't have the source installed, you should be able to find it online easy enough (here's one example).
FYI, if you wanted to do it your original way it would go like this:
$ cat jar.sh
#!/usr/bin/env bash
java -jar <(tail -n +4 "$0")
exit
$ cat jar.sh runme.jar > works.jar
$ chmod a+x works.jar
$ ./works.jar
Presuming a recent bash with support for <()
java -jar does not work with stdin, apparently it does some seeks rather than straight reads.
On a system you can't mod, you have to use a tmp. for example.
#!/bin/bash
JF=/tmp/junk$$.jar
(uudecode -o /dev/stdout >$JF;java -jar $JF;unlink $JF) <<JAR
begin-base64 644 junk.jar
UEsDBBQACAAIAEaBz0AAAAAAAAAAAAAAAAAJAAQATUVUQS1JTkYv/soAAAMA
UEsHCAAAAAACAAAAAAAAAFBLAwQUAAgACABGgc9AAAAAAAAAAAAAAAAAFAAA
AE1FVEEtSU5GL01BTklGRVNULk1G803My0xLLS7RDUstKs7Mz7NSMNQz4OVy
LkpNLElN0XWqBAmY6RnEG5koaASX5in4ZiYX5RdXFpek5hYreOYl62nycvkm
ZubpOuckFhdbKWSV5mXzcvFyAQBQSwcIBHn3CVkAAABZAAAAUEsDBBQACAAI
AEd+z0AAAAAAAAAAAAAAAAAKAAAAanVuay5jbGFzc21QTUvDQBB926RJE6Ot
ramfhXoQogcDXiteBPFQVIjowdOmXcrGZCMxEfxZelDw4A/wR4mzUShCF3Zn
9s2bt2/26/vjE8ARBi4stB10sNpC10UPazZ8G30G61gqWZ4wGMH+DYN5mk8F
Q3sslbioslgU1zxOCTEzLhVDP7gbJ/yJhylXszAqC6lmI93oRnlVTMSZ1GQn
qdT9oeZ5sNGyse5hA5sM3rlI03x4mxfpdNfGlodt7JC45jN05sqXcSIm5T8o
en4sRUZG84oK/q8NmYdX5KEkJ4JnI4beApjBftC3lAbwg0X+MUSTvkivBm3y
DJqCsgFFRrF58A72QglNSqdVg5qyBO+PuketGnVe0egabzDndLdWNUjVJGS5
fmXlB1BLBwjDUWL/IAEAAJ4BAABQSwECFAAUAAgACABGgc9AAAAAAAIAAAAA
AAAACQAEAAAAAAAAAAAAAAAAAAAATUVUQS1JTkYv/soAAFBLAQIUABQACAAI
AEaBz0AEefcJWQAAAFkAAAAUAAAAAAAAAAAAAAAAAD0AAABNRVRBLUlORi9N
QU5JRkVTVC5NRlBLAQIUABQACAAIAEd+z0DDUWL/IAEAAJ4BAAAKAAAAAAAA
AAAAAAAAANgAAABqdW5rLmNsYXNzUEsFBgAAAAADAAMAtQAAADACAAAAAA==
====
JAR
Or I could just install the jarwrapper (Ubuntu) package.
Write a separate shell script:
whatever.sh
#!/bin/bash
java -jar whatever.jar $*
You can't make the JAR file directly executable because it's not an executable file. It's Java bytecode which can't be read directly by the machine nor any standard shell interpreter that I know of.

Embed a Executable Binary in a shell script

First, I already googled but only found examples where a compressed file (say a .tar.gz) is embedded into a shell script.
Basically if I have a C program (hello.c) that prints a string, say Hello World!.
I compile it to get an executable binary
gcc hello.c -o hello
Now I have a shell script testEmbed.sh
What I am asking is if it is possible to embed the binary (hello) inside the shell script so that when I run
./testEmbed.sh
it executes the binary to print Hello World!.
Clarification:
One alternative is that I compress the executable into an archive and then extract it when the script runs. What I am asking is if it is possible to run the program without that.
Up until now, I was trying the method here. But it does not work for me. I guess the author was using some other distribution on another architecture. So, basically this did not work for me. :P
Also, if the workflow for a C program differs from a Java jar, I would like to know that too!
Yes, this can be done. It's actually quite similar in concept to your linked article. The trick is to use uuencode to encode the binary into text format then tack it on to the end of your script.
Your script is then written in such a way that it runs uudecode on itself to create a binary file, change the permissions then execute it.
uuencode and uudecode were originally created for shifting binary content around on the precursor to the internet, which didn't handles binary information that well. The conversion into text means that it can be shipped as a shell script as well. If, for some reason your distribution complains when you try to run uuencode, it probably means you have to install it. For example, on Debian Squeeze:
sudo aptitude install sharutils
will get the relevant executables for you. Here's the process I went through. First create and compile your C program hello.c:
pax> cat hello.c
#include <stdio.h>
int main (void) {
printf ("Hello\n");
return 0;
}
pax> gcc -o hello hello.c
Then create a shell script testEmbed.sh, which will decode itself:
pax> cat testEmbed.sh
#!/bin/bash
rm -f hello
uudecode $0
./hello
rm -f hello
exit
The first rm statement demonstrates that the hello executable is being created anew by this script, not left hanging around from your compilation. Since you need the payload in the file as well, attach the encoded executable to the end of it:
pax> uuencode hello hello >>testEmbed.sh
Afterwards, when you execute the script testEmbed.sh, it extracts the executable and runs it.
The reason this works is because uudecode looks for certain marker lines in its input (begin and end) which are put there by uuencode, so it only tries to decode the encoded program, not the entire script:
pax> cat testEmbed.sh
#!/bin/bash
rm -f hello
uudecode $0
./hello
rm -f hello
exit
begin 755 hello
M?T5,1#$!`0````````````(``P`!````$(,$"#0```#`!#```````#0`(``'
M`"#`'#`;``8````T````-(`$"#2`!`C#````X`````4````$`````P```!0!
: : :
M:&%N9&QE`%]?1%1/4E]%3D1?7P!?7VQI8F-?8W-U7VEN:70`7U]B<W-?<W1A
M<G0`7V5N9`!P=71S0$!'3$E"0U\R+C``7V5D871A`%]?:38X-BYG971?<&-?
4=&AU;FLN8G#`;6%I;#!?:6YI=```
`
end
There are other things you should probably worry about, such as the possibility that your program may require shared libraries that don't exist on the target system, but the process above is basically what you need.
The process for a JAR file is very similar, except that the way you run it is different. It's still a single file but you need to replace the line:
./hello
with something capable of running JAR files, such as:
java -jar hello.jar
I think makeself is what you're describing.
The portable way to do this is with the printf command and octal escapes:
printf '\001\002\003'
to print bytes 1, 2, and 3. Since you probably don't want to write that all by hand, the od -b command can be used to generate an octal dump of the file, then you can use a sed script to strip off the junk and put the right backslashes in place.

Shebang line parsing problems in Ubuntu

What is the accepted, portable way to include interpreter options in the shebang line, ie. how can I do something like
#!/usr/bin/env python -c
or (more importantly) something like
#!/usr/bin/env java -cp "./jars/*:./src" -Xmn1G -Xms1G -server
and get it to be parsed correctly? Right now ubuntu seems to just glom the whole thing together, although other systems will parse this with no problem.
http://en.wikipedia.org/wiki/Shebang_%28Unix%29
describes the problem but offers no solution.
There's no good solution, as different unices treat multi-word #! lines differently. Portable #! use limits you to at most one argument to the interpreter on the #! line, and no whitespace in the interpreter or argument.
If the language allows it, you can make the script a shell script which takes care of loading the interpreter with whatever command line it likes. For example, in Perl, from the perl manual:
#!/bin/sh -- # -*- perl -*- -p
eval 'exec perl -wS "$0" ${1+"$#"}'
if $running_under_some_shell;
The shell stops processing after the second line, and Perl sees lines 2–3 as an instruction that does nothing. Some lisp/scheme dialects make #!...!# a comment, allowing you to write
#!/bin/sh
exec guile -s "$0" "$#"
!# ;; scheme code starts here
In general, the only solutions involve two files. You can write #!/usr/bin/env mywrapper where mywrapper is a program (it can be a script) that calls the actual interpreter with whatever argument it wants. Or you can make the executable itself the wrapper script and keep the interpreted file separate. The second solution has the advantage of working even if the interpreter doesn't accept a leading #! line.

Categories

Resources