Optimization of Pig Script - java

I have written a ‘Pig Script’ which is processing Sequence files given as input.
It is working fine but there is one problem mentioned below.
I have repetitive statements in my pig script, as shown below:
Filtered_Data _1= FILTER BagName BY ($0 matches 'RegEx-1');
Filtered_Data_2 = FILTER BagName BY ($0 matches 'RegEx-2');
Filtered_Data_3 = FILTER BagName BY ($0 matches 'RegEx-3');
So on…
Question :
So is there any way by which I can have above statement written once and
then loop through all possible “RegEx” and substitute in Pig script.
For Example:
Filtered_Data _X = FILTER BagName BY ($0 matches 'RegEx'); ( have this statement once )
( loop through all possible RegEx and substitute value in the statement )
Right now I am calling Pig script from a shell script, so any way from shell script will be also be welcome or even Java wrapper...
Thanks in advance.
Happy Pigging!!!!

Related

Why is ANTLR not printing set of tokens correctly?

I am testing to see if ANTLR-4.7.1 is working properly by using a sample, provided by my professor, to match these results for the same printed set of tokens:
% java -jar ./antlr-4.7.1-complete.jar HelloExample.g4
% javac -cp antlr-4.7.1-complete.jar HelloExample*.java
% java -cp .:antlr-4.7.1-complete.jar org.antlr.v4.gui.TestRig HelloExample greeting helloworld.greeting -tokens
[#0,0:4='Hello',<1>,1:0]
[#1,6:10='World',<3>,1:6]
[#2,12:12='!',<2>,1:12]
[#3,14:13='<EOF>',<-1>,2:0]
(greeting Hello World !)
However, after getting to the 3rd command, my output was instead:
[#0,0:4='Hello',<'Hello'>,1:0]
[#1,6:10='World',<Name>,1:6]
[#2,12:12='!',<'!'>,1:12]
[#3,13:12='<EOF>',<EOF>,1:13]
In my output, there are no numbers inside < >, which I believe should be defined from the HelloExample.tokens file that contain:
Hello=1
Bang=2
Name=3
WS=4
'Hello'=1
'!'=2
I get no error information and antlr seemed to have generated all the files I needed, so I don't know where I should be looking to resolve this, please help. And I'm not sure if it'll be of use, but my working directory started with helloworld.greeting and HelloExample.g4 and final directory now contains
helloworld.greeting
HelloExample.g4
HelloExample.interp
HelloExample.tokens
HelloExampleBaseListener.class
HelloExampleBaseListener.java
HelloExampleLexer.class
HelloExampleLexer.inerp
HelloExampleLexer.java
HelloExampleLexer.tokens
HelloExampleListener.class
HelloExampleListener.java
HelloExampleParser$GreetingContext.class
HelloExampleParser.class
HelloExampleParser.java
As rici already pointed out in the comments, getting the actual rule names instead of their numbers in the token output is a feature and shouldn't worry you.
In order to get the (greeting Hello World !) output at the end, you'll want to add the -tree flag after -tokens.

Get specific java version with powershell

I have some issues with getting the java version out as a string.
In a batch script I have done it like this:
for /f tokens^=2-5^ delims^=.-_^" %%j in ('%EXTRACTPATH%\Java\jdk_extract\bin\java -fullversion 2^>^&1') do set "JAVAVER=%%j.%%k.%%l_%%m"
The output is: 1.8.0_121
Now I want to do this for PowerShell, but my output is: 1.8.0_12, I miss one "1" in the end Now I have tried it with trim and split but nothing gives me the right output can someone help me out?
This is what I've got so var with PowerShell
$javaVersion = (& $extractPath\Java\jdk_extract\bin\java.exe -fullversion 2>&1)
$javaVersion = "$javaVersion".Trim("java full version """).TrimEnd("-b13")
The full output is: java full version "1.8.0_121-b13"
TrimEnd() works a little different, than you might expect:
'1.8.0_191-b12'.TrimEnd('-b12')
results in: 1.8.0_19 and so does:
'1.8.0_191-b12'.TrimEnd('1-b2')
The reason is, that TrimEnd() removes a trailing set of characters, not a substring. So .TrimEnd('-b12') means: remove all occurrences of any character of the set '-b12' from the end of the string. And that includes the last '1' before the '-'.
A better solution in your case would be -replace:
'java full version "1.8.0_191-b12"' -replace 'java full version "(.+)-b\d+"','$1'
Use a regular expression for matching and extracting the version number:
$javaVersion = if (& java -fullversion 2>&1) -match '\d+\.\d+\.\d+_\d+') {
$matches[0]
}
or
$javaVersion = (& java -fullversion 2>&1 | Select-String '\d+\.\d+\.\d+_\d+').Matches[0].Groups[0].Value

Velocity parser crashes when parsing java code template

When trying to use a java source code as template for Velocity, it crashes at this line of the template:
/* #see panama.form.Validator#validate(java.lang.Object) */
with this Exception:
Exception in thread "main" org.apache.velocity.exception.ParseErrorException: Lexical error, Encountered: "l" (108), after : "." at *unset*[line 23, column 53]
at org.apache.velocity.runtime.RuntimeInstance.evaluate(RuntimeInstance.java:1301)
at org.apache.velocity.runtime.RuntimeInstance.evaluate(RuntimeInstance.java:1265)
at org.apache.velocity.app.VelocityEngine.evaluate(VelocityEngine.java:199)
Apparently it takes the #validate for a macro and crashes when it tries to parse the arguments for the macro. Is there anything one could do about this?
I'm using Velocity 1.7.
Edit
I know I could escape the # characters in the template files, but there are quite a number of them which also might change now and then, so I would prefer a way that would not require manual changes on the files.
First option
Try this solution from here: Escaping VTL Directives
VTL directives can be escaped with the backslash character ("\") in a manner similar to valid VTL references.
## #include( "a.txt" ) renders as <contents of a.txt>
#include( "a.txt" )
## \#include( "a.txt" ) renders as #include( "a.txt" )
\#include( "a.txt" )
## \\#include ( "a.txt" ) renders as \<contents of a.txt>
\\#include ( "a.txt" )
Second option
You have this tool [EscapeTool][2].
Tool for working with escaping in Velocity templates.
It provides methods to escape outputs for Java, JavaScript, HTML, XML and SQL. Also provides methods to render VTL characters that otherwise needs escaping.
Third option:
You may also try this workaround, I didn't use it but it should work:
You can at the beginning read your template as a String and then pre-parse it. For example replace all # with \#, or add to the beginning of file
#set( $H = '#' )
$H$H
see this answer: How to escape a # in velocity And then from that pre-parsed String create Template by using this answer: How to use String as Velocity Template?

remove shell controll and non-printable characters from String (linux output)

In a web scanner application, i need to parse some script's output to get some informations, but the problem is that i don't get the same output in linux shell and in java output, let me describe it (this example is done with whatweb on one of the websites i need to scan at work, but i also have this problem whenever i have a colored output in shell):
Here is what i get from linux's output (with some colors):
http://www.ceris-ingenierie.com [200] Apache[2.2.9], Cookies[ca67a6ac78ebedd257fb0b4d64ce9388,jfcookie,jfcookie%5Blang%5D,lang], Country[EUROPEAN UNION][EU], HTTPServer[Fedora Linux][Apache/2.2.9 (Fedora)], IP[185.13.64.116], Joomla[1.5], Meta-Author[Administrator], MetaGenerator[Joomla! 1.5 - Open Source Content Management], PHP[5.2.6,], Plesk[Lin], Script[text/javascript], Title[Accueil ], X-Powered-By[PHP/5.2.6, PleskLin]
And here is what i get from Java:
[1m[34mhttp://www.ceris-ingenierie.com[0m [200] [1m[37mApache[0m[[1m[32m2.2.9[0m], [1m[37mCookies[0m[[1m[33mca67a6ac78ebedd257fb0b4d64ce9388,jfcookie,jfcookie%5Blang%5D,lang[0m], [1m[37mCountry[0m[[1m[33mEUROPEAN UNION[0m][[1m[35mEU[0m], [1m[37mHTTPServer[0m[[1m[31mFedora Linux[0m][[1m[36mApache/2.2.9 (Fedora)[0m], [1m[37mIP[0m[[1m[33m185.13.64.116[0m], [1m[37mJoomla[0m[[1m[32m1.5[0m], [1m[37mMeta-Author[0m[[1m[33mAdministrator[0m], [1m[37mMetaGenerator[0m[[1m[33mJoomla! 1.5 - Open Source Content Management[0m], [1m[37mPHP[0m[[1m[32m5.2.6,[0m], [1m[37mPlesk[0m[[1m[33mLin[0m], [1m[37mScript[0m[[1m[33mtext/javascript[0m], [1m[37mTitle[0m[[32mAccueil [0m], [1m[37mX-Powered-By[0m[[1m[33mPHP/5.2.6, PleskLin[0m]
My guess is that colors in linux's shell are generated by those unknown characters, but they are really a pain for parsing in java.
I get this output by running the script in a new thread, and doing raw_data+=data;(where raw_data is a String) whenever i have a new line in my output, to finally send raw_data to my parser.
How can i do to avoid getting those annoying chars and so, to get a more friendly output like i get in linux's shell?
In your Java code, where you are executing the shell script, you can add an extra sed filter to filter out the shell-control characters.
# filter out shell control characters
./my_script | sed -r "s/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[m|K]//g"
Use tr -dc '[[:print:]]' to remove non-printable characters, like this:
# filter out shell control characters
./my_script | \
sed -r "s/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[m|K]//g" | \
tr -dc '[[:print:]]'
You could even add a wrapper script around the original script to do this. And call the wrapper script. This allows you to do any other pre-processing, before feeding it into the Java program and keeps it clean of all unnecessary code and you can focus on the core logic of the application.
If you can't add a wrapper script for any reason and would like to add the filter in Java, Java doesn't support pipes in the command, directly. You'll have to call your command as an argument to bash it like this:
String[] cmd = {
"/bin/sh",
"-c",
"./my_script | sed -r 's/\\x1B\\[([0-9]{1,2}(;[0-9]{1,2})?)?[m|K]//g'"
};
Process p = Runtime.getRuntime().exec(cmd);
Don't forget to escape all the '\' when you use the regex in Java.
Source and description for the sed filter: http://www.commandlinefu.com/commands/view/3584/remove-color-codes-special-characters-with-sed
You can use a regex here:
String raw_data= ...;
String cleaned_raw_data = raw_data.replaceAll("\\[\\d+m", "");
This will remove any sequence of characters starting with a \\[, ending with a m and having between them one or more digit (\\d+).
Note that [ is preceded by a \\ because [ has a special meaning for regular expressions (it's a meta-character).
Description

error 1200 mismatched input 'as' expecting SEMI_COLON when using DayExtractor in Pig

I'm trying to follow this tutorial to analyze Apache access log files using Pig:
http://venkatarun-n.blogspot.com/2013/01/analyzing-apache-logs-with-pig.html
And i'm stuck with this Pig script:
grpd = GROUP logs BY DayExtractor(dt) as day;
When i execute that in grunt terminal, i get the following error:
ERROR 1200: mismatched input 'as' expecting
SEMI_COLON Failed to parse: mismatched input 'as'
expecting SEMI_COLON
Function DayExtractor is defined from piggybank.jar in this manner:
DEFINE DayExtractor
org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor('yyyy-MM-dd');
Ideas anyone?
I've been searching for awhile about this. Any help would be greatly be appreciated.
I am not sure how the author of the blog post got it to work, but as far as I know, you cannot use as in GROUP BY in pig. Also, I don't think you cannot use UDFs in GROUP BY. May be the author had a different version of pig that supported such operations. To get the same effect, you can split it into two steps:
logs_day = FOREACH logs GENERATE ....., DayExtractor(dt) as day;
grpd = GROUP logs_day BY day;

Categories

Resources