Converting Python Function to Hive UDAF

Converting Python Function to Hive UDAF - java

How do I convert the following Python function, longToDigitArray to a HiveQL UDAF? I am not familiar with Java.
# convert source (e.g. 2305843012434919424, into a list of source flags)
# Desired behavior:
# Input: longToDigitArray(2305843012434919424)
# Output: [31, 32, 62]
def longToDigitArray(x):
a=[]
i=1
try:
x=long(x)
except:
return(a)
while (x!=0):
if (x & 1): # "Bitwise AND" &: Returns a 1 in each bit position for which the corresponding bits of both operands are 1's.
a.append(i)
x = (x >> 1) # bitwise right-shift 1
i = i+1
return(a)
Any insight is appreciated.

Related

Can I perform a regex search on Redis values?

I tried using RedisSearch but there you can perform a fuzzy search, but I need to perform a regex search like:
key: "12345"
value: { name: "Maruti"}
searching "aru" will give the result "Mumbai", basically the regex formed is *aru*. Can anyone help me out how can I achieve it using Redis ?

This can be done, but I do not recommend it - performance will be greatly impacted.
If you must, however, you can use RedisGears for ad-hoc regex queries like so:
127.0.0.1:6379> HSET mykey name Maruti
(integer) 1
127.0.0.1:6379> HSET anotherkey name Moana
(integer) 1
127.0.0.1:6379> RG.PYEXECUTE "import re\np = re.compile('.*aru.*')\nGearsBuilder().filter(lambda x: p.match(x['value']['name'])).map(lambda x: x['key']).run()"
1) 1) "mykey"
2) (empty array)
Here's the Python code for readability:
import re
p = re.compile('.*aru.*')
GearsBuilder() \
.filter(lambda x: p.match(x['value']['name'])) \
.map(lambda x: x['key']) \
.run()

Origin Destination matrix with Open Trip Planner scripting in Jython / Python

I'm trying to write a Python script to build an Origin Destination matrix using Open Trip Planner (OTP) but I'm very new to Python and OTP.
I am trying to use OTP scripting in Jython / Python to build an origin-destination matrix with travel times between pairs of locations. In short, the idea is to launch a Jython jar file to call the test.py python script but I'm struggling to get the the python script to do what I want.
A light and simple reproducible example is provided here And here is the python Script I've tried.
Python Script
#!/usr/bin/jython
from org.opentripplanner.scripting.api import *
# Instantiate an OtpsEntryPoint
otp = OtpsEntryPoint.fromArgs(['--graphs', 'C:/Users/rafa/Desktop/jython_portland',
'--router', 'portland'])
# Get the default router
router = otp.getRouter('portland')
# Create a default request for a given time
req = otp.createRequest()
req.setDateTime(2015, 9, 15, 10, 00, 00)
req.setMaxTimeSec(1800)
req.setModes('WALK,BUS,RAIL')
# Read Points of Origin
points = otp.loadCSVPopulation('points.csv', 'Y', 'X')
# Read Points of Destination
dests = otp.loadCSVPopulation('points.csv', 'Y', 'X')
# Create a CSV output
matrixCsv = otp.createCSVOutput()
matrixCsv.setHeader([ 'Origin', 'Destination', 'min_time' ])
# Start Loop
for origin in points:
print "Processing: ", origin
req.setOrigin(origin)
spt = router.plan(req)
if spt is None: continue
# Evaluate the SPT for all points
result = spt.eval(dests)
# Find the time to other points
if len(result) == 0: minTime = -1
else: minTime = min([ r.getTime() for r in result ])
# Add a new row of result in the CSV output
matrixCsv.addRow([ origin.getStringData('GEOID'), r.getIndividual().getStringData('GEOID'), minTime ])
# Save the result
matrixCsv.save('traveltime_matrix.csv')
The output should look something like this:
GEOID GEOID travel_time
1 1 0
1 2 7
1 3 6
2 1 10
2 2 0
2 3 12
3 1 5
3 2 10
3 3 0
ps. I've tried to create a new tag opentripplanner in this question but I don't have enough reputation for doing that.

Laurent Grégoire has kindly answered the quesiton on Github, so I reproduce here his solution.
This code works but still it would take a long time to compute large OD matrices (say more than 1 million pairs). Hence, any alternative answers that improve the speed/efficiency of the code would be welcomed!
#!/usr/bin/jython
from org.opentripplanner.scripting.api import OtpsEntryPoint
# Instantiate an OtpsEntryPoint
otp = OtpsEntryPoint.fromArgs(['--graphs', '.',
'--router', 'portland'])
# Start timing the code
import time
start_time = time.time()
# Get the default router
# Could also be called: router = otp.getRouter('paris')
router = otp.getRouter('portland')
# Create a default request for a given time
req = otp.createRequest()
req.setDateTime(2015, 9, 15, 10, 00, 00)
req.setMaxTimeSec(7200)
req.setModes('WALK,BUS,RAIL')
# The file points.csv contains the columns GEOID, X and Y.
points = otp.loadCSVPopulation('points.csv', 'Y', 'X')
dests = otp.loadCSVPopulation('points.csv', 'Y', 'X')
# Create a CSV output
matrixCsv = otp.createCSVOutput()
matrixCsv.setHeader([ 'Origin', 'Destination', 'Walk_distance', 'Travel_time' ])
# Start Loop
for origin in points:
print "Processing origin: ", origin
req.setOrigin(origin)
spt = router.plan(req)
if spt is None: continue
# Evaluate the SPT for all points
result = spt.eval(dests)
# Add a new row of result in the CSV output
for r in result:
matrixCsv.addRow([ origin.getStringData('GEOID'), r.getIndividual().getStringData('GEOID'), r.getWalkDistance() , r.getTime()])
# Save the result
matrixCsv.save('traveltime_matrix.csv')
# Stop timing the code
print("Elapsed time was %g seconds" % (time.time() - start_time))

Convert from clojure.lang.LazySeq to type org.apache.spark.api.java.JavaRDD

I developed a function in clojure to fill in an empty column from the last non-empty value, I'm assuming this works, given
(:require [flambo.api :as f])
(defn replicate-val
[ rdd input ]
(let [{:keys [ col ]} input
result (reductions (fn [a b]
(if (empty? (nth b col))
(assoc b col (nth a col))
b)) rdd )]
(println "Result type is: "(type result))))
Got this:
;=> "Result type is: clojure.lang.LazySeq"
The question is how do I convert this back to type JavaRDD, using flambo (spark wrapper)
I tried (f/map result #(.toJavaRDD %)) in the let form to attempt to convert to JavaRDD type
I got this error
"No matching method found: map for class clojure.lang.LazySeq"
which is expected because result is of type clojure.lang.LazySeq
Question is how to I make this conversion, or how can I refactor the code to accomodate this.
Here is a sample input rdd:
(type rdd) ;=> "org.apache.spark.api.java.JavaRDD"
But looks like:
[["04" "2" "3"] ["04" "" "5"] ["5" "16" ""] ["07" "" "36"] ["07" "" "34"] ["07" "25" "34"]]
Required output is:
[["04" "2" "3"] ["04" "2" "5"] ["5" "16" ""] ["07" "16" "36"] ["07" "16" "34"] ["07" "25" "34"]]
Thanks.

First of all RDDs are not iterable (don't implement ISeq) so you cannot use reductions. Ignoring that a whole idea of accessing previous record is rather tricky. First of all you cannot directly access values from an another partition. Moreover only transformations which don't require shuffling preserve order.
The simplest approach here would be to use Data Frames and Window functions with explicit order but as far as I know Flambo doesn't implement required methods. It is always possible to use raw SQL or access Java/Scala API but if you want to avoid this you can try following pipeline.
First lets create a broadcast variable with last values per partition:
(require '[flambo.broadcast :as bd])
(import org.apache.spark.TaskContext)
(def last-per-part (f/fn [it]
(let [context (TaskContext/get) xs (iterator-seq it)]
[[(.partitionId context) (last xs)]])))
(def last-vals-bd
(bd/broadcast sc
(into {} (-> rdd (f/map-partitions last-per-part) (f/collect)))))
Next some helper for the actual job:
(defn fill-pair [col]
(fn [x] (let [[a b] x] (if (empty? (nth b col)) (assoc b col (nth a col)) b))))
(def fill-pairs
(f/fn [it] (let [part-id (.partitionId (TaskContext/get)) ;; Get partion ID
xs (iterator-seq it) ;; Convert input to seq
prev (if (zero? part-id) ;; Find previous element
(first xs) ((bd/value last-vals-bd) part-id))
;; Create seq of pairs (prev, current)
pairs (partition 2 1 (cons prev xs))
;; Same as before
{:keys [ col ]} input
;; Prepare mapping function
mapper (fill-pair col)]
(map mapper pairs))))
Finally you can use fill-pairs to map-partitions:
(-> rdd (f/map-partitions fill-pairs) (f/collect))
A hidden assumption here is that order of the partitions follows order of the values. It may or may not be in general case but without explicit ordering it is probably the best you can get.
Alternative approach is to zipWithIndex, swap order of values and perform join with offset.
(require '[flambo.tuple :as tp])
(def rdd-idx (f/map-to-pair (.zipWithIndex rdd) #(.swap %)))
(def rdd-idx-offset
(f/map-to-pair rdd-idx
(fn [t] (let [p (f/untuple t)] (tp/tuple (dec' (first p)) (second p))))))
(f/map (f/values (.rightOuterJoin rdd-idx-offset rdd-idx)) f/untuple)
Next you can map using similar approach as before.
Edit
Quick note on using atoms. What is the problem there is lack of referential transparency and that you're leveraging incidental properties of a given implementation not a contract. There is nothing in the map semantics that requires elements to be processed in a given order. If internal implementation changes it may be no longer valid. Using Clojure
(defn foo [x] (let [aa #a] (swap! a (fn [&args] x)) aa))
(def a (atom 0))
(map foo (range 1 20))
compared to:
(def a (atom 0))
(pmap foo (range 1 20))

Regex to find tokens - Java Scanner or another alternative

Hi I'm trying to write a class that transfers some text into well defined tokens.
The strings are somewhat similar to code like: (brown) "fox" 'c';. What I would like to get is (either a token from Scanner or an array after slitting I think both would work just fine) ( , brown , ) , "fox" , 'c' , ; separately (as they are potential tokens) which include:
quoted text with ' and "
number with or without a decimal point
parenthesis, braces , semicolon , equals, sharp, ||,<=,&&
Currently I'm doing it with a Scanner, I've had some problems with the delimiter not being able to give me () etc. separately so I've used the following delimiter \s+|(?=[;\{\}\(\)]|\b) the thing now I would get " and ' as separate tokens as well ans I'd really like to avoid it, I've tried adding some negative lookaheads for variations of " but no luck.
I've tried to using StreamTokenizer but it does not keep the different quotes..
P.S.
I did search the site and tried to google it but even though there are many Scanner related/Regex related questions, I couldn't find something that will solve my problem.
EDIT 1:
So far I came up with \s+|^|(?=[;{}()])|(?<![.\-/'"])(?=\b)(?![.\-/'"])
I might have been not clear enough but when
I have some thing like:
"foo";'bar')(;{
gray fox=-56565.4546;
foo boo="hello"{
I'd like to get:
"foo" ,; ,'bar',) , (,; ,{
gray,fox,=,-56565.4546,;
foo,boo,=,"hello",{
But instead I have:
"foo" ,;'bar',) , (,; ,{
gray,fox,=-56565.4546,;
foo,boo,="hello",{
Note that when there are spaces betwen the = and the rest e.g : gray fox = -56565.4546; leads to:
gray,fox,=,-56565.4546,;
What I'm doing with the above mentioned regex is :
Scanner scanner = new Scanner(line);
scanner.useDelimiter(MY_MENTIONED_REGEX_HERE);
while (scanner.hasNext()) {
System.out.println("Got: `" + scanner.next() +"`");
//Some work here
}

Description
Since you are looking for all alphanumeric text which might include a decimal point, why not just "ignore" the delimiters? The following regex will pull all the alphanumeric with decimal point chunks from your input string. This works because your sample text was:
"foo";'bar')(;{
gray fox=-56565.4546;
foo boo="hello"{
Regex: (?:(["']?)[-]?[a-z0-9-.]*\1|(?<=[^a-z0-9])[^a-z0-9](?=(?:[^a-z0-9]|$))|(?<=[a-z0-9"'])[^a-z0-9"'](?=(?:[^a-z0-9]|['"]|$)))
Summary
The regex has three paths which are:
(["']?)[-]?[a-z0-9-.]*\1 capture an open quote, followed by a minus sign if it exists, followed by some text or numbers, this continues until it reaches the close quote. This captures any text or numbers with a decimal point. The numbers are not validated so 12.32.1 would match. If your input text also contained numbers prefixed with a plus sign, then change [-] to [+-].
(?<=[^a-z0-9])[^a-z0-9](?=(?:[^a-z0-9]|$)) lookbehind for a non alphanumeric if the previous character is a symbol, and the this character is a symbol, the next character is also a symbol or end of string, then grab the current symbol. This captures any free floating symbols which are not quotes, or multiple symbols in a row like )(;{.
(?<=[a-z0-9"'])[^a-z0-9"'](?=(?:[^a-z0-9]|['"]|$))) if the current character is not an alphanumeric or quote, then lookbehind for an alphanumeric or quote symbol and look ahead for non alphanumeric, non quote or end of line. This captures any symbols after a quote which would not be captured by the previous expressions, like the { after "Hello".
Full Explanation
(?: start a non group capture statement. Inside this group each alternative is separated by an or | character
1st alternative: (["']?)[-]?[a-z0-9-.]*\1
1st Capturing group (["']?)
Char class ["'] 1 to 0 times matches one of the following chars: "'
Char class [-] 1 to 0 times matches one of the following chars: -
Char class [a-z0-9-.] infinite to 0 times matches one of the following chars: a-z0-9-.
\1 Matches text saved in BackRef 1
2nd alternative: (?<=[^a-z0-9])[^a-z0-9](?=(?:[^a-z0-9]|$))
(?<=[^a-z0-9]) Positive LookBehind
Negated char class [^a-z0-9] matches any char except: a-z0-9
Negated char class [^a-z0-9] matches any char except: a-z0-9
(?=(?:[^a-z0-9]|$)) Positive LookAhead, each sub alternative is seperated by an or | character
Group (?:[^a-z0-9]|$)
1st alternative: [^a-z0-9]
Negated char class [^a-z0-9] matches any char except: a-z0-9
2nd alternative: $End of string
3rd alternative: (?<=[a-z0-9"'])[^a-z0-9"'](?=(?:[^a-z0-9]|['"]|$))
(?<=[a-z0-9"']) Positive LookBehind
Char class [a-z0-9"'] matches one of the following chars: a-z0-9"'
Negated char class [^a-z0-9"'] matches any char except: a-z0-9"'
(?=(?:[^a-z0-9]|['"]|$)) Positive LookAhead, each sub alternative is seperated by an or | character
Group (?:[^a-z0-9]|['"]|$)
1st alternative: [^a-z0-9]
Negated char class [^a-z0-9] matches any char except: a-z0-9
2nd alternative: ['"]
Char class ['"] matches one of the following chars: '"
3rd alternative: $End of string
) end the non group capture statement
Groups
Group 0 gets the entire matched string, whereas group 1 gets the quote delimiter if it exists to ensure it'll match a close quote.
Java Code Example:
Note some of the empty values in the array are from the new line character, and some are introduced from the expression. You can apply the expression and some basic logic to ensure your output array only has non empty values.
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
public static void main(String[] asd){
String sourcestring = "\"foo\";'bar')(;{
gray fox=-56565.4546;
foo boo=\"hello\"{";
Pattern re = Pattern.compile("(?:(["']?)[-]?[a-z0-9-.]*\1|(?<=[^a-z0-9])[^a-z0-9](?=(?:[^a-z0-9]|$))|(?<=[a-z0-9"'])[^a-z0-9"'](?=(?:[^a-z0-9]|['"]|$)))",Pattern.CASE_INSENSITIVE);
Matcher m = re.matcher(sourcestring);
int mIdx = 0;
while (m.find()){
for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
}
mIdx++;
}
}
}
$matches Array:
(
[0] => Array
(
[0] => "foo"
[1] =>
[2] => ;
[3] => 'bar'
[4] =>
[5] => )
[6] =>
[7] => (
[8] =>
[9] => ;
[10] =>
[11] => {
[12] =>
[13] =>
[14] =>
[15] => gray
[16] =>
[17] => fox
[18] =>
[19] => =
[20] => -56565.4546
[21] =>
[22] => ;
[23] =>
[24] =>
[25] =>
[26] => foo
[27] =>
[28] => boo
[29] =>
[30] => =
[31] => "hello"
[32] =>
[33] => {
[34] =>
)
[1] => Array
(
[0] => "
[1] =>
[2] =>
[3] => '
[4] =>
[5] =>
[6] =>
[7] =>
[8] =>
[9] =>
[10] =>
[11] =>
[12] =>
[13] =>
[14] =>
[15] =>
[16] =>
[17] =>
[18] =>
[19] =>
[20] =>
[21] =>
[22] =>
[23] =>
[24] =>
[25] =>
[26] =>
[27] =>
[28] =>
[29] =>
[30] =>
[31] => "
[32] =>
[33] =>
[34] =>
)
)

The idea is to start from particular cases to general. Try this expression:
Java string:
"([\"'])(?:[^\"']+|(?!\\1)[\"'])*\\1|\\|\\||<=|&&|[()\\[\\]{};=#]|[\\w.-]+"
Raw pattern:
(["'])(?:[^"']+|(?!\1)["'])*\1|\|\||<=|&&|[()\[\]{};=#]|[\w.-]+
The goal here isn't to split with an hypotetic delimiter, but to match entity by entity. Note that the order of alternatives define the priority ( you can't put = before => )
example with your new specifications (need to import Pattern & Matcher):
String s = "(brown) \"fox\" 'c';foo bar || 55.555;\"foo\";'bar')(;{ gray fox=-56565.4546; foo boo=\"hello\"{";
Pattern p = Pattern.compile("([\"'])(?:[^\"']+|(?!\\1)[\"'])*\\1|\\|\\||<=|&&|[()\\[\\]{};=#]|[\\w.-]+");
Matcher m = p.matcher(s) ;
while (m.find()) {
System.out.println("item = `" + m.group() + "`");
}

Your problem is largely that you are trying to do too much with one regular expression, and consequently not able to understand the interactions of the part. As humans we all have this trouble.
What you are doing has a standard treatment in the compiler business, called "lexing". A lexer generator accepts a regular expression for each individual token of interest to you, and builds a complex set of states that will pick out the individual lexemes, if they are distinguishable. Seperate lexical definitons per token makes them easy and un-confusing to write individually. The lexer generator makes it "easy" and efficient to recognize all the members. (If you want to define a lexeme that has specific quotes included, it is easy to do that).
See any of the parser generators widely available; they all all include lexing engines, e.g., JCup, ANTLR, JavaCC, ...

Perhaps using a scanner generator such as JFLex it will be easier to achieve your goal than with a regular expression.
Even if you prefer to write the code by hand, I think it would be better to structure it somewhat more. One simple solution would be to create separate methods which try to "consume" from your text the different types of tokens that you want to recognize. Each such method could tell whether it succeeded or not. This way you have several smaller chunks of code, resposible for the different tokens instead of just one big piece of code which is harder to understand and to write.

Search hierarchical text in Oracle database

Table = BLOCK (Has composite unique index both the columns)
IP_ADDRESS CIDR_SIZE
========= ==========
10.10 16
15.0 16
67.7 16
18.0 8
Requirements:
Sub block is not allowed. For e.g. 67.7.1 and 24 is not allowed as this is child of 67.7. In other words, if there is any IP address in the database that matches beginning portion of new IP, then it should fail. Is it possible for me to do it using a Oracle SQL query?
I was thinking of doing it by...
Select all records into the memory.
Convert each IP into its binary bits
10.10 = 00001010.00001010
15.0 = 00001111.00000000
67.7 = 01000011.00000111
18.0 = 00010010.00000000
Convert new IP into binary bit. 67.7.1 = 01000011.00000111.00000001
Check to see if new IP binary bits start with existing IP binary bits.
If true, then the new record exists in the database.
For example, new binary bit 01000011.00000111.00000001 does start with existing ip (67.7) binary bits 01000011.00000111. Rest of records don't match.
I am looking to see if there a Oracle query that can do this for me, that is return the matching IP addresses from the database. I checked out Oracle's Text API, but didn't find anything just yet.

Is there a reason you can't use the INSTR function?
http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/functions068.htm#i77598
I'd do something like a NOT EXISTS clause that checks for INSTR(b_outer.IP_ADDRESS,b_inner.IP_ADDRESS) <> 1
*edit: thinking about this you'd probably need to check to see if the result is 1 (meaning the potential IP address matches starting at the first character of an existing IP address) as opposed to a general substring search as I originally had it.

Yes you can do it in SQL by converting IP's to numbers and then ensureing this is not a record with a smaller cidr size that gives the same ipnum when using its cidr size.
WITH ipv AS
( SELECT IP.*
, NVL(REGEXP_SUBSTR( ip, '\d+', 1, 1 ),0) * 256 * 256 * 256 -- octet1
+ NVL(REGEXP_SUBSTR( ip, '\d+', 1, 2 ),0) * 256 * 256 -- octet2
+ NVL(REGEXP_SUBSTR( ip, '\d+', 1, 3 ),0) * 256 -- octet3
+ NVL(REGEXP_SUBSTR( ip, '\d+', 1, 4 ),0) AS ipnum -- octet4
, 32-bits AS ignorebits
FROM ips IP
)
SELECT IP1.ip, IP1.bits
FROM ipv IP1
WHERE NOT EXISTS
( SELECT 1
FROM ipv IP2
WHERE IP2.bits < IP1.bits
AND TRUNC( IP2.ipnum / POWER( 2, IP2.ignorebits ) )
= TRUNC( IP1.ipnum / POWER( 2, IP2.ignorebits ) )
)
Note: My example uses the table equivalent to yours:
SQL> desc ips
Name Null? Type
----------------------------------------- -------- ----------------------------
IP NOT NULL VARCHAR2(16)
BITS NOT NULL NUMBER

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Converting Python Function to Hive UDAF - java

Related

Can I perform a regex search on Redis values?

Origin Destination matrix with Open Trip Planner scripting in Jython / Python

Convert from clojure.lang.LazySeq to type org.apache.spark.api.java.JavaRDD

Regex to find tokens - Java Scanner or another alternative

Search hierarchical text in Oracle database

Categories

Resources