Hive UDF's treatment of URLs - java

I've created a Hive UDF that parses a URL. The URL contains query parameters. When I parse the input in my UDF, however, characters like '=' and '&' are converted to gibberish.
Initially, I was relying on String's toString() method to convert the Hive Text to Java String. The above characters are converted to gibberish with this approach. I then tried using the new String(str, StandardCharsets.UTF_8) to convert the Hive Text to Java String. This worked at first. Then, it started producing gibberish as well.
My method is shown below. Any ideas on what I might not be doing right?
public Text evaluate(final Text requestInput, final Text referrerInput) {
if (requestInput == null || referrerInput == null)
return null;
final String request = new String(requestInput.getBytes(), StandardCharsets.UTF_8); // converts '=' and '&' in URL strings to gibberish
final String referrer = new String(referrerInput.getBytes(), StandardCharsets.UTF_8); // converts '=' and '&' in URL strings to gibberish
}
When I run HQL in Hive:
SELECT get_json_object(json, '$.base.request_url') FROM events
I get this:
GET /api/get_info?id=1465473313746 HTTP/1.1
In my UDF, the toString() method (no additional processing) produces the following output:
GET /api/get_info?id\u003d1465473313746 HTTP/1.1

I learned that the = and & were being converted to their Unicode equivalents. Why this was happening is still unclear to me. Using Apache Commons StringEscapeUtils utility, the problem became easier:
StringEscapeUtils.unescapeJava(requestInput.toString())
solved the issue.

Related

How to convert utf8 string to escape string in JSON Java?

I want to convert a UTF-8 string to escape \uXXX format in value of JSON Object.
I used both JSON Object and Gson, but did not work for me in this case:
JSONObject js = new JSONObject();
js.put("lastReason","nguyễn");
System.out.println(js.toString());
and
Gson gson = new Gson();
String new_js = gson.toJson(js.toString());
System.out.println(new_js);
Output: {"test":"nguyễn"}
But i am expect that my result is:
Expected Output: {"test":"nguy\u1EC5n"}
Any solutions for this case, please help me to resolve it.
You can use apache commons-text library to change a string to use Unicode escape sequences. Use org.apache.commons.text.StringEscapeUtils to translate the text before adding it to JSONObject.
StringEscapeUtils.escapeJava("nguyễn")
will produce
nguy\u1EC5n
One possible problem with using StringEscapeUtils might be that it will escape control characters as well. If there is a tab character at the end of the string it will be translated to \t. I.e.:
StringEscapeUtils.escapeJava("nguyễn\t")
will produce an incorrect string:
nguy\u1EC5n\t
You can use org.apache.commons.text.translate.UnicodeEscaper to get around this but it will translate every character in the string to a Unicode escape sequence. I.e.:
UnicodeEscaper ue = new UnicodeEscaper();
ue.translate(rawString);
will produce
\u006E\u0067\u0075\u0079\u1EC5\u006E
or
\u006E\u0067\u0075\u0079\u1EC5\u006E\u0009
Whether it is a problem or not is up to you to decide.

How to url encode data in Java

So Im trying to translate a working python code into Java. One of the steps required is to url encode the data. But when I encode the data in Java it looks different than the one in encoded in Python.
In one of the block of Python code theres this:
data = {'request-json': json}
print('Sending form data:', data)
data = urlencode(data)
data = data.encode('utf-8')
print('Sending data:', data)
The Output
Sending form data: {'request-json': '{"apikey": "xewpjipcpovwiiql"}'}
The output after being encoded
Sending data: b'request-json=%7B%22apikey%22%3A+%22xewpjipcpovwiiql%22%7D'
So this is what im trying to do in Java. As you can imagine Java is more involved. I used gson to convert to Json
Gson gson = new Gson();
API_Key key = new API_Key("xewpjipcpovwiiql");
String jsonInputString = gson.toJson(key);
Data data = new Data(key);
String request_form = gson.toJson(data);
System.out.println(request_form);
String urlencoded = URLEncoder.encode(request_form,StandardCharsets.UTF_8);
System.out.println(urlencoded);
The output:
Sending form data: {"request-json":{"apikey":"xewpjipcpovwiiql"}}
The output of the encoded string:
%7B%22requestjson%22%3A%7B%22apikey%22%3A%22xewpjipcpovwiiql%22%7D%7D
So they dont look the same so why are they coming differently ? How do I get the same python encoded String in Java ? I noticed in Python it used a combination of single and double quotes and in Java its only Double quotes so I dont know if that makes a difference.
Thank You!
On the Python side: The data.encode('utf-8') call is not necessary or at least the documentation describes with a different intention compared to this use https://docs.python.org/3/library/stdtypes.html#str.encode (and that's why there's a b' at the beggining).
The outer brackets are missing because it is interpreting request-json as the URL parameter name (it may be easier to understand if you add a second property at the json's top/first property level, you'll see you end with request-json=%7B%22apikey%22%3A+%22xewpjipcpovwiiql%22%7D&second-property=<second-property-value>).
On the Java side: the request_form is being completely interpreted as a single value to encode so you can put the encoded value as part of some parameter in a URL, as in: https://host:port?some-parameter-name=%7B%22requestjson%22%3A%7B%22apikey%22%3A%22xewpjipcpovwiiql%22%7D%7D

trying to figure out what kind of unicode should i have

I'm working on spring boot on a project that fetch the data from the database then use post method to send them through HTTP post request, everything is okay but with Latina, the data i have in database encoded with: ISO 8859-6 i have encoded it to UTF-8 and UTF-16 but still it returns unreadable text question marks and special characters
test example in Arabic :
مرحبا
should be like this to be valid and reliable after post method :
06450631062d06280627
i can't figure out what kind of encoding happend here, now im doing integration from .NET to java:
this what they used in .NET :
public static String UnicodeStr2HexStr(String strMessage)
{
byte[] ba = Encoding.BigEndianUnicode.GetBytes(strMessage);
String strHex = BitConverter.ToString(ba);
strHex = strHex.Replace("-", "");
return strHex;
}
i just need to know what kind of encoding happend here to apply in java, and it would helpfull if someone provide me with way:
i have tried this but it return different value:
String encodedWithISO88591 = "مرحبا;
String decodedToUTF8 = new String(encodedWithISO88591.getBytes("ISO-8859-1"), "UTF-8");
What you're looking to get is the hex representation of the Arabic String in UTF-16BE
String yourVal = "مرحبا";
System.out.println(DatatypeConverter.printHexBinary(yourVal.getBytes(StandardCharsets.UTF_16BE)));
output will be :
06450631062D06280627

conwert object which contains strings with utf-8 to string with proper coding

I'm processing MMS and got it text part as :
mmsBodyPart.getContent();
it's simpy Object. Now i need to convert it to String using utf-8. I have tried:
String contentText = (String) mmsBodyPart.getContent();
but it doesn't works with specyfics characters and some strange chars appear.
Also i tried :
String content = new String(contentText.getBytes("UTF-8"), "UTF-8"));
not a mystery that also failed.
How that can be done ?
EDIT: Problem was caused by bad encoding in file. Nothing wrong was in code, ya didn't thought about it in first place...
Strings haven't an Encoding in Java. If you need one, you should use byte[] with Encoding to get a String

ISO-8859-1 encoded strings out of /into JSON in Java

My application has a Java servlet that reads a JSONObject out of the request and constructs some Java objects that are used elsewhere. I'm running into a problem because there are strings in the JSON that are encoded in ISO-8859-1. When I extract them into Java strings, the encoding appears to get interpreted as UTF-16. I need to be able to get the correctly encoded string back at some point to put into another JSON object.
I've tried mucking around with ByteBuffers and CharBuffers, but then I don't get any characters at all. I can't change the encoding, as I have to play nicely with other applications that use ISO-8859-1.
Any tips would be greatly appreciated.
It's a legacy application using Struts 1.3.8. I'm using net.sf.json 2.2.4 for JSONObject and JSONArray.
A snippet of the parsing code is:
final JSONObject a = (JSONObject) i;
final JSONObject attr = a.getJSONObject("attribute");
final String category = attr.getString("category");
final String value = attr.getString("value");
I then create POJOs using that information, that are retrieved by another action class to create JSON to pass to the client for display, or to pass to other applications.
So to clarify, if the JSON contains the string "Juan Guzmán", the Java String contains something like Juan Guzm?_An (I don't have the exact one in front of me). I'm not sure how to get the correct diacritical back. I believe that if I can get a Java String that contains the correct representation, that Mezzie's solution, below, will allow me to create the string with the correct encoding to put back into the JSON to serve back.
I had the same issue and I am using the same technology as you are. In our case, it was UTF 8. so just change that to UTF-16
public static String UTF8toISO( String str )
{
try
{
return new String( str.getBytes( "ISO-8859-1" ), "UTF-8" );
}
catch ( UnsupportedEncodingException e )
{
e.printStackTrace();
}
return str;
}

Categories

Resources