I implemented a Java pipeline which stores some objects in postgres (9.5).
Objects are first serialized to a json string with jackson (2.5.4) and then sent with JDBC to the DBMS in a JSON column type.
When needed objects are read from database, deserialized and used by the application.
The problem is that when i try to query my data with psql I have some strange error, linked with UTF-8:
kronos=# select payload from messages where payload->>'errorMessage' = 'some error';
ERROR: unsupported Unicode escape sequence
DETAIL: \u0000 cannot be converted to text.
CONTEXT: JSON data, line 1:
...globalIdResolution":"GENERATED","authorFullName":...
I am not particularly worried, since my application can write and read all the data in the RDBMS. Is this a psql issue or maybe I am doing something wrong? Is there a way to be sure that my data should be accessible from any application, also if not java+jackson based?
I think this couldn't be jackson's fault, since I tried with a unit test but the string doens't seem to contain \0000 char:
package it.celi.dd.kronos.json;
import static org.junit.Assert.*;
import java.util.HashMap;
import java.util.Map;
import org.junit.Test;
import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.ObjectMapper;
public class JacksonUTF8Test {
#Test
public void test() throws JsonProcessingException {
final ObjectMapper objectMapper = new ObjectMapper();
final String utf8 = "؏ـــطــہ";
final Map<String, Object> map = new HashMap<>();
map.put("content", txt(1600) + utf8 + txt(500));
final String json = objectMapper.writeValueAsString(map);
final int jsonBoilerplate = "{\"content\":\"".length() + "\"}".length();
assertEquals(1600 + utf8.length() + 500 + jsonBoilerplate, json.length());
System.out.println(json.length());
assertEquals("{\"content\":\"" + txt(1600) + utf8 + txt(500) + "\"}", json);
}
private String txt(final int chars) {
final StringBuilder builder = new StringBuilder();
for(int i = 0; i < chars; i ++) {
builder.append('a');
}
return builder.toString();
}
}
Related
I am attempting to read a large text file (2 to 3 gb). I need to read the text file line by line and convert each line into a Json object. I have tried using .collect() and .toLocalIterator() to read through the text file. collect() is fine for small files but will not work for large files. I know that .toLocalIterator() collects data scattered around the cluster into a single cluster. According to the documentation .toLocalIterator() is ineffective when dealing with large RDD's as it will run into memory issues. Is there an efficient way to read large text files in a multi node cluster?
Below is a method with my various attempts at reading through the file and converting each line into a json.
public static void jsonConversion() {
JavaRDD<String> lines = sc.textFile(path);
String newrows = lines.first(); //<--- This reads the first line of the text file
// Reading through with
// tolocaliterator--------------------------------------------
Iterator<String> newstuff = lines.toLocalIterator();
System.out.println("line 1 " + newstuff.next());
System.out.println("line 2 " + newstuff.next());
// Inserting lines in a list.
// Note: .collect() is appropriate for small files
// only.-------------------------
List<String> rows = lines.collect();
// Sets loop limit based on the number on lines in text file.
int count = (int) lines.count();
System.out.println("Number of lines are " + count);
// Using google's library to create a Json builder.
GsonBuilder gsonBuilder = new GsonBuilder();
Gson gson = new GsonBuilder().setLenient().create();
// Created an array list to insert json objects.
ArrayList<String> jsonList = new ArrayList<>();
// Converting each line of the text file into a Json formatted string and
// inserting into the array list 'jsonList'
for (int i = 0; i <= count - 1; i++) {
String JSONObject = gson.toJson(rows.get(i));
Gson prettyGson = new GsonBuilder().setPrettyPrinting().create();
String prettyJson = prettyGson.toJson(rows.get(i));
jsonList.add(prettyJson);
}
// For printing out the all the json objects
int lineNumber = 1;
for (int i = 0; i <= count - 1; i++) {
System.out.println("line " + lineNumber + "-->" + jsonList.get(i));
lineNumber++;
}
}
Below is a list of libraries that I am using
//Spark Libraries
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
//Java Libraries
import java.util.ArrayList;
import java.util.List;
import java.util.Properties;
//Json Builder Libraries
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
You can try to use map function on RDD instead of collecting all results.
JavaRDD<String> lines = sc.textFile(path);
JavaRDD<String> jsonList = lines.map(line -> <<all your json transformations>>)
In that way, you will achieve a distribute transformation of your data. More about map function.
Converting data to a list or array will force to a data collection on one node. If you want to achieve computations distribution in Spark, you need to use either RDD or Dataframe or Dataset.
JavaRDD<String> lines = sc.textFile(path);
JavaRDD<String> jsonList = lines.map(line ->line.split("/"))
Or you can define a new method inside the map
JavaRDD<String> jsonList = lines.map(line ->{
String newline = line.replace("","")
return newline ;
})
//Do convert the JavaRDD to DataFrame
Converting JavaRDD to DataFrame in Spark java
dfTobeSaved.write.format("json").save("/root/data.json")
I have setup a Java program that I made for my apprenticeship project that takes in a JSON file of English strings and outputs a different language JSON file that is defined in the console. Some languages like french and Italian will output with the correct translations whereas Russian or Japanese will output with question marks as seen in the images bellow.
I had searched around at saw that I needed to get the bytes of my string and then encode that to UTF-8 I did do this but was still getting question marks so I started to use he standard charsets built into Java and tried different ways of encoding/decoding the string I tried this:
and this gave me a different output of this : Ð?Ñ?ивеÑ?
package com.bis.propertyfiletranslator;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.security.GeneralSecurityException;
import java.util.List;
import com.google.api.client.googleapis.javanet.GoogleNetHttpTransport;
import com.google.api.client.googleapis.json.GoogleJsonResponseException;
import com.google.api.client.json.jackson2.JacksonFactory;
import com.google.api.services.translate.Translate;
import com.google.api.services.translate.model.TranslationsListResponse;
import com.google.api.services.translate.model.TranslationsResource;
public class Translator {
public static Translate.Translations.List list;
private static final Charset UTF_8 = Charset.forName("UTF-8");
private static final Charset ISO = Charset.forName("ISO-8859-1");
public static void translateJSONMapThroughGoogle(String input, String output, String API, String language,
List<String> subLists) throws IOException, GeneralSecurityException {
Translate t = new Translate.Builder(GoogleNetHttpTransport.newTrustedTransport(),
JacksonFactory.getDefaultInstance(), null).setApplicationName("PhoenUX-Google-Translate").build();
try {
list = t.new Translations().list(subLists, language).setFormat("text");
list.setKey(API);
} catch (GoogleJsonResponseException e) {
if (e.getDetails().getMessage().equals("Invalid Value")) {
System.err.println(
"\n Language not currently supported, check the accepted language codes and try again.\n\n Language Requested: "
+ language);
} else {
System.out.println(e.getDetails().getMessage());
}
}
for (TranslationsResource translationsResource : response.getTranslations()) {
for (String key : JSONFunctions.jsonHashMap.keySet()) {
JSONFunctions.jsonHashMap.remove(key);
String value = translationsResource.getTranslatedText();
String encoded = new String(value.getBytes(StandardCharsets.UTF_8), StandardCharsets.ISO_8859_1);
JSONFunctions.jsonHashMap.put(key, encoded);
System.out.println(encoded);
break;
}
}
JSONFunctions.outputTranslationsBackToJson(output);
}
}
So this is using the google cloud library, I added a sysout so I could see the results of what I had tried, so this code should be all you need to replicate it.
I expect the output of "Hello" to be "Привет"(russian) actual output is ???? or Ð?Ñ?ивеÑ? dependent on the encoding I use.
String encoded = new String(...) is dead wrong. Just
put(key, value):
Note that System.out.println will always have problems as the OS encoding might be some Windows ANSI encoding. Then it is likely non Unicode-capable - and String contains Unicode.
I have a large JSON file (2.5MB) containing about 80000 lines.
It looks like this:
{
"a": 123,
"b": 0.26,
"c": [HUGE irrelevant object],
"d": 32
}
I only want the integer values stored for keys a, b and d and ignore the rest of the JSON (i.e. ignore whatever is there in the c value).
I cannot modify the original JSON as it is created by a 3rd party service, which I download from its server.
How do I do this without loading the entire file in memory?
I tried using gson library and created the bean like this:
public class MyJsonBean {
#SerializedName("a")
#Expose
public Integer a;
#SerializedName("b")
#Expose
public Double b;
#SerializedName("d")
#Expose
public Integer d;
}
but even then in order to deserialize it using Gson, I need to download + read the whole file in memory first and the pass it as a string to Gson?
File myFile = new File(<FILENAME>);
myFile.createNewFile();
URL url = new URL(<URL>);
OutputStream out = new BufferedOutputStream(new FileOutputStream(myFile));
URLConnection conn = url.openConnection();
HttpURLConnection httpConn = (HttpURLConnection) conn;
InputStream in = conn.getInputStream();
byte[] buffer = new byte[1024];
int numRead;
while ((numRead = in.read(buffer)) != -1) {
out.write(buffer, 0, numRead);
}
FileInputStream fis = new FileInputStream(myFile);
byte[] data = new byte[(int) myFile.length()];
fis.read(data);
String str = new String(data, "UTF-8");
Gson gson = new Gson();
MyJsonBean response = gson.fromJson(str, MyJsonBean.class);
System.out.println("a: " + response.a + "" + response.b + "" + response.d);
Is there any way to avoid loading the whole file and just get the relevant values that I need?
You should definitely check different approaches and libraries. If you are really take care about performance check: Gson, Jackson and JsonPath libraries to do that and choose the fastest one. Definitely you have to load the whole JSON file on local disk, probably TMP folder and parse it after that.
Simple JsonPath solution could look like below:
import com.jayway.jsonpath.DocumentContext;
import com.jayway.jsonpath.JsonPath;
import java.io.File;
public class JsonPathApp {
public static void main(String[] args) throws Exception {
File jsonFile = new File("./resource/test.json").getAbsoluteFile();
DocumentContext documentContext = JsonPath.parse(jsonFile);
System.out.println("" + documentContext.read("$.a"));
System.out.println("" + documentContext.read("$.b"));
System.out.println("" + documentContext.read("$.d"));
}
}
Notice, that I do not create any POJO, just read given values using JSONPath feature similarly to XPath. The same you can do with Jackson:
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import java.io.File;
public class JsonPathApp {
public static void main(String[] args) throws Exception {
File jsonFile = new File("./resource/test.json").getAbsoluteFile();
ObjectMapper mapper = new ObjectMapper();
JsonNode root = mapper.readTree(jsonFile);
System.out.println(root.get("a"));
System.out.println(root.get("b"));
System.out.println(root.get("d"));
}
}
We do not need JSONPath because values we need are directly in root node. As you can see, API looks almost the same. We can also create POJO structure:
import com.fasterxml.jackson.annotation.JsonIgnoreProperties;
import com.fasterxml.jackson.databind.ObjectMapper;
import java.io.File;
import java.math.BigDecimal;
public class JsonPathApp {
public static void main(String[] args) throws Exception {
File jsonFile = new File("./resource/test.json").getAbsoluteFile();
ObjectMapper mapper = new ObjectMapper();
Pojo pojo = mapper.readValue(jsonFile, Pojo.class);
System.out.println(pojo);
}
}
#JsonIgnoreProperties(ignoreUnknown = true)
class Pojo {
private Integer a;
private BigDecimal b;
private Integer d;
// getters, setters
}
Even so, both libraries allow to read JSON payload directly from URL I suggest to download it in another step using best approach you can find. For more info, read this article: Download a File From an URL in Java.
There are some excellent libraries for parsing large JSON files with minimal resources. One is the popular GSON library. It gets at the same effect of parsing the file as both stream and object. It handles each record as it passes, then discards the stream, keeping memory usage low.
If you’re interested in using the GSON approach, there’s a great tutorial for that here. Detailed Tutorial
I only want the integer values stored for keys a, b and d and ignore the rest of the JSON (i.e. ignore whatever is there in the c value). ... How do I do this without loading the entire file in memory?
One way would be to use jq's so-called streaming parser, invoked with the --stream option. This does exactly what you want, but there is a trade-off between space and time, and using the streaming parser is usually more difficult.
In the present case, for example, using the non-streaming (i.e., default) parser, one could simply write:
jq '.a, .b, .d' big.json
Using the streaming parser, you would have to write something like:
jq --stream 'select(length==2 and .[0][-1] == ("a","b","d"))[1]' big.json
or if you prefer:
jq -c --stream '["a","b","d"] as $keys | select(length==2 and (.[0][-1] | IN($keys[])))[1]' big.json
In certain cases, you could achieve significant speedup by wrapping the filter in a call to limit, e.g.
["a","b","d"] as $keys
| limit($keys|length;
select(length==2 and .[0][-1] == ("a","b","c"))[1])
Note on Java and jq
Although there are Java bindings for jq (see e.g. "𝑸: What language bindings are available for Java?" in the jq FAQ), I do not know any that work with the --stream option.
However, since 2.5MB is tiny for jq, you could use one of the available Java-jq bindings without bothering with the streaming parser.
I'm reading and writing to a ByteBuffer
import org.assertj.core.api.Assertions;
import java.io.IOException;
import java.net.InetSocketAddress;
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CharsetEncoder;
public class Solution{
public static void main(String[] args) throws Exception{
final CharsetEncoder messageEncoder = Charset.forName("ISO-8859-1").newEncoder();
String message = "TRANSACTION IGNORED";
String carrierName= "CARR00AB";
int messageLength = message.length()+carrierName.length()+8;
System.out.println(" --------Fill data---------");
ByteBuffer messageBuffer = ByteBuffer.allocate(4096);
messageBuffer.order(ByteOrder.BIG_ENDIAN);
messageBuffer.putInt(messageLength);
messageBuffer.put(messageEncoder.encode(CharBuffer.wrap(carrierName)));
messageBuffer.put(messageEncoder.encode(CharBuffer.wrap(message)));
messageBuffer.put((byte) 0x2b);
messageBuffer.flip();
System.out.println("------------Extract Data Approach 1--------");
CharsetDecoder messageDecoder = Charset.forName("ISO-8859-1").newDecoder();
int lengthField = messageBuffer.getInt();
System.out.println("lengthField="+lengthField);
int responseLength = lengthField - 12;
System.out.println("responseLength="+responseLength);
String messageDecoded= messageDecoder.decode(messageBuffer).toString();
System.out.println("messageDecoded="+messageDecoded);
String decodedCarrier = messageDecoded.substring(0, carrierName.length());
System.out.println("decodedCarrier="+ decodedCarrier);
String decodedBody = messageDecoded.substring(carrierName.length(), messageDecoded.length() - 1);
System.out.println("decodedBody="+decodedBody);
Assertions.assertThat(messageLength).isEqualTo(lengthField);
Assertions.assertThat(decodedBody).isEqualTo(message);
Assertions.assertThat(decodedBody).isEqualTo(message);
ByteBuffer messageBuffer2 = ByteBuffer.allocate(4096);
messageBuffer2.order(ByteOrder.BIG_ENDIAN);
messageBuffer2.putInt(messageLength);
messageBuffer2.put(messageEncoder.encode(CharBuffer.wrap(carrierName)));
messageBuffer2.put(messageEncoder.encode(CharBuffer.wrap(message)));
messageBuffer2.put((byte) 0x2b);
messageBuffer2.flip();
System.out.println("---------Extract Data Approach 2--------");
byte [] data = new byte[messageBuffer2.limit()];
messageBuffer2.get(data);
String dataString =new String(data, "ISO-8859-1");
System.out.println(dataString);
}
}
It works fine but then I thought to refactor it, Please see approach 2 in above code
byte [] data = new byte[messageBuffer.limit()];
messageBuffer.get(data);
String dataString =new String(data, "ISO-8859-1");
System.out.println(dataString);
Output= #CARR00ABTRANSACTION IGNORED+
Could you guys help me with explanation
why the integer is got missing in second approach while decoding ???
Is there any way to extract the integer in second approach??
Okay so you are trying to read an int from the Buffer which takes up 4 bits and then trying to get the whole data after reading 4 bits
What I have done is call messageBuffer2.clear(); after reading the int to resolve this issue. here is the full code
System.out.println(messageBuffer2.getInt());
byte[] data = new byte[messageBuffer2.limit()];
messageBuffer2.clear();
messageBuffer2.get(data);
String dataString = new String(data, StandardCharsets.ISO_8859_1);
System.out.println(dataString);
Output is:
35
#CARR0033TRANSACTION IGNORED+
Edit: So basically when you are calling clear it resets various variables and it also resets the position it's getting from and thats how it fixes it.
We've been trying to calculate the Amazon Signature for the past few days using the Java code provided on Amazon's site.
I'm not a seasoned Java developer, but we've managed to use the basis of their code found at http://docs.developer.amazonservices.com/en_US/dev_guide/DG_ClientLibraries.html#DG_OwnClientLibrary__Signatures to generate a signature but we keep getting a "signature doesn't match" error that we're running into a wall debugging. Here is our current Java code with account specific information omitted:
import java.io.UnsupportedEncodingException;
import java.net.URI;
import java.net.URISyntaxException;
import java.net.URLEncoder;
import java.security.InvalidKeyException;
import java.security.NoSuchAlgorithmException;
import java.security.SignatureException;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.Map.Entry;
import java.util.TreeMap;
import javax.crypto.Mac;
import javax.crypto.spec.SecretKeySpec;
import org.apache.commons.codec.binary.Base64;
import java.sql.*;
import java.util.Date;
import java.sql.Timestamp;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.TimeZone;
import java.util.Date;
public class Main {
private static final String CHARACTER_ENCODING = "UTF-8";
final static String ALGORITHM = "HmacSHA256";
public static void main(String[] args) throws Exception {
// Change this secret key to yours
String secretKey = "XXXXXXXXXXXXXXXXXXXXXX"; // This secret key does have a forward slash in it (3rd to last character), would that adversely affect anything?
// Use the endpoint for your marketplace
String serviceUrl = "https://mws.amazonservices.com/Orders/2011-01-01";
// Create set of parameters needed and store in a map
HashMap<String, String> parameters = new HashMap<String,String>();
// Add required parameters. Change these as needed.
parameters.put("AWSAccessKeyId", urlEncode("XXXXXXXXXXX"));
parameters.put("Action", urlEncode("ListOrders"));
parameters.put("MarketplaceId.Id.1", urlEncode("ATVPDKIKX0DER"));
parameters.put("Merchant", urlEncode("ALZYDHGPLQNLD"));
parameters.put("OrderStatus.Status.1", urlEncode("PartiallyShipped"));
parameters.put("OrderStatus.Status.2", urlEncode("Unshipped"));
parameters.put("SignatureMethod", urlEncode(ALGORITHM));
parameters.put("SignatureVersion", urlEncode("2"));
parameters.put("Timestamp", urlEncode("2014-01-29T22:11:00Z"));
parameters.put("Version", urlEncode("2011-01-01"));
//parameters.put("SubmittedFromDate", urlEncode("2014-01-28T15:05:00Z"));
// Format the parameters as they will appear in final format
// (without the signature parameter)
String formattedParameters = calculateStringToSignV2(parameters, serviceUrl);
//System.out.println(formattedParameters);
String signature = sign(formattedParameters, secretKey);
System.out.println(urlEncode(signature));
// Add signature to the parameters and display final results
parameters.put("Signature", urlEncode(signature));
System.out.println(calculateStringToSignV2(parameters, serviceUrl));
// TEST AREA
// Signiture sig = new Signiture();
// String HMAC = sig.calculateRFC2104HMAC(parameters, secretKey);
// TEST AREA
}
/* If Signature Version is 2, string to sign is based on following:
*
* 1. The HTTP Request Method followed by an ASCII newline (%0A)
*
* 2. The HTTP Host header in the form of lowercase host,
* followed by an ASCII newline.
*
* 3. The URL encoded HTTP absolute path component of the URI
* (up to but not including the query string parameters);
* if this is empty use a forward '/'. This parameter is followed
* by an ASCII newline.
*
* 4. The concatenation of all query string components (names and
* values) as UTF-8 characters which are URL encoded as per RFC
* 3986 (hex characters MUST be uppercase), sorted using
* lexicographic byte ordering. Parameter names are separated from
* their values by the '=' character (ASCII character 61), even if
* the value is empty. Pairs of parameter and values are separated
* by the '&' character (ASCII code 38).
*
*/
private static String calculateStringToSignV2(Map<String, String> parameters, String serviceUrl)
throws SignatureException, URISyntaxException {
// Sort the parameters alphabetically by storing
// in TreeMap structure
Map<String, String> sorted = new TreeMap<String, String>();
sorted.putAll(parameters);
// Set endpoint value
URI endpoint = new URI(serviceUrl.toLowerCase());
// Create flattened (String) representation
StringBuilder data = new StringBuilder();
/*data.append("POST\n");
data.append(endpoint.getHost());
data.append("\n/");
data.append("\n");*/
Iterator<Entry<String, String>> pairs = sorted.entrySet().iterator();
while (pairs.hasNext()) {
Map.Entry<String, String> pair = pairs.next();
if (pair.getValue() != null) {
data.append( pair.getKey() + "=" + pair.getValue());
}
else {
data.append( pair.getKey() + "=");
}
// Delimit parameters with ampersand (&)
if (pairs.hasNext()) {
data.append( "&");
}
}
return data.toString();
}
/*
* Sign the text with the given secret key and convert to base64
*/
private static String sign(String data, String secretKey)
throws NoSuchAlgorithmException, InvalidKeyException,
IllegalStateException, UnsupportedEncodingException {
Mac mac = Mac.getInstance(ALGORITHM);
//System.out.println(mac);//
mac.init(new SecretKeySpec(secretKey.getBytes(CHARACTER_ENCODING), ALGORITHM));
//System.out.println(mac);//
byte[] signature = mac.doFinal(data.getBytes(CHARACTER_ENCODING));
//System.out.println(signature);//
String signatureBase64 = new String(Base64.encodeBase64(signature), CHARACTER_ENCODING);
System.out.println(signatureBase64);
return new String(signatureBase64);
}
private static String urlEncode(String rawValue) {
String value = (rawValue == null) ? "" : rawValue;
String encoded = null;
try {
encoded = URLEncoder.encode(value, CHARACTER_ENCODING)
.replace("+", "%20")
.replace("*", "%2A")
.replace("%7E","~");
} catch (UnsupportedEncodingException e) {
System.err.println("Unknown encoding: " + CHARACTER_ENCODING);
e.printStackTrace();
}
return encoded;
}
}
Right now we're testing this by hand because I'm trying to submit requests/manage the data through a Filemaker Database, so I may be adding the signature back to the URL query incorrectly. I'm assuming that 1) all of the parameters need to be listed in the query alphabetically and 2) the signature is the only exception that gets appended, last.
For example:
AWSAccessKeyId=AKIAITPQPJO62G4LAE7Q&Action=ListOrders&MarketplaceId.Id.1=XXXXXXXXXXX&Merchant=XXXXXXXXXXX&OrderStatus.Status.1=PartiallyShipped&OrderStatus.Status.2=Unshipped&SignatureMethod=HmacSHA256&SignatureVersion=2&Timestamp=2014-01-29T22%3A11%3A00Z&Version=2011-01-01&Signature=tNufJeONZlscTlHs%2FLAWBs7zwsfpIaQcUK%2B5XIPJpcQ%3D
But the response I keep getting is this:
<?xml version="1.0"?>
<ErrorResponse xmlns="https://mws.amazonservices.com/Orders/2011-01-01">
<Error>
<Type>Sender</Type>
<Code>SignatureDoesNotMatch</Code>
<Message>The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.</Message>
</Error>
<RequestID>7532e668-c660-4db6-b129-f5fe5d3fad63</RequestID>
</ErrorResponse>
Any insight would be greatly appreciated. Debugging is fun but I need to get past this already!