Spark - Restore nested saved RDD

Spark - Restore nested saved RDD - java

I am using AWS S3 as a backup storage for data coming in to our Spark cluster. Data comes in every second and is processed when 10 seconds of data has been read. The RDD containing the 10 seconds of data is stored to S3 using
rdd.saveAsObjectFile(s3URL + dateFormat.format(new Date()));
This means that we get a lot of files added to S3 each day in the format of
S3URL/2017/07/23/12/00/10, S3URL/2017/07/23/12/00/20 etc
From here it is easy to restore the RDD which is a
JavaRDD<'byte[]>
using either
sc.objectFile or the AmazonS3 API
The problem is that to reduce the number of files needed to iterate through we run a daily cron job that goes through each file during a day, bunch the data together and store the new RDD to S3. This is done as follows:
List<byte[]> dataList = new ArrayList<>(); // A list of all read messages
/* Get all messages from S3 and store them in the above list */
try {
final ListObjectsV2Request req = new ListObjectsV2Request().withBucketName("bucketname").withPrefix("logs/" + dateString);
ListObjectsV2Result result;
do {
result = s3Client.listObjectsV2(req);
for (S3ObjectSummary objectSummary :
result.getObjectSummaries()) {
System.out.println(" - " + objectSummary.getKey() + " " +
"(size = " + objectSummary.getSize() +
")");
if(objectSummary.getKey().contains("part-00000")){ // The messages are stored in files named "part-00000"
S3Object object = s3Client.getObject(
new GetObjectRequest(objectSummary.getBucketName(), objectSummary.getKey()));
InputStream objectData = object.getObjectContent();
byte[] byteData = new byte[(int) objectSummary.getSize()]; // The size of the messages differ
objectData.read(byteData);
dataList.add(byteData); // Add the message to the list
objectData.close();
}
}
/* When iterating, messages are split into chunks called continuation tokens.
* All tokens have to be iterated through to get all messages. */
System.out.println("Next Continuation Token : " + result.getNextContinuationToken());
req.setContinuationToken(result.getNextContinuationToken());
} while(result.isTruncated() == true );
} catch (AmazonServiceException ase) {
System.out.println("Caught an AmazonServiceException, " +
"which means your request made it " +
"to Amazon S3, but was rejected with an error response " +
"for some reason.");
System.out.println("Error Message: " + ase.getMessage());
System.out.println("HTTP Status Code: " + ase.getStatusCode());
System.out.println("AWS Error Code: " + ase.getErrorCode());
System.out.println("Error Type: " + ase.getErrorType());
System.out.println("Request ID: " + ase.getRequestId());
} catch (AmazonClientException ace) {
System.out.println("Caught an AmazonClientException, " +
"which means the client encountered " +
"an internal error while trying to communicate" +
" with S3, " +
"such as not being able to access the network.");
System.out.println("Error Message: " + ace.getMessage());
} catch (IOException e) {
e.printStackTrace();
}
JavaRDD<byte[]> messages = sc.parallelize(dataList); // Loads the messages into an RDD
messages.saveAsObjectFile("S3URL/daily_logs/" + dateString);
This all works fine, but now I am not sure how to actually restore the data to a manageable state again. If I use
sc.objectFile
to restore the RDD I end up with a JavaRDD<'byte[]> where the byte[] is actually a JavaRDD<'byte[]> in itself. How can I restore the nested JavaRDD from the byte[] located in the JavaRDD<'byte[]>?
I hope this somehow makes sense and I am grateful for any help. In a worst case scenario I have to come up with another way to backup the data.
Best regards
Mathias

I solved it by instead of storing a nested RDD I flatmapped all the byte[] into a single JavaRDD and stored that one instead.

Related

No response in SQSMessageSuccess while detecting faces inside a video uploaded on Amazon s3

I had been trying to detect faces from a video stored on Amazon S3, the faces have to be matched against the collection that has the faces which are to be searched for in the video.
I have used Amazon VideoDetect.
My piece of code, goes like this:
CreateCollection createCollection = new CreateCollection(collection);
createCollection.makeCollection();
AddFacesToCollection addFacesToCollection = new AddFacesToCollection(collection, bucketName, image);
addFacesToCollection.addFaces();
VideoDetect videoDetect = new VideoDetect(video, bucketName, collection);
videoDetect.CreateTopicandQueue();
try {
videoDetect.StartFaceSearchCollection(bucketName, video, collection);
if (videoDetect.GetSQSMessageSuccess())
videoDetect.GetFaceSearchCollectionResults();
} catch (Exception e) {
e.printStackTrace();
return false;
}
videoDetect.DeleteTopicandQueue();
return true;
The things seem to work fine till StartFaceSearchCollection and I am getting a jobId being made and a queue as well. But when it is trying to go around to get GetSQSMessageSuccess, its never returning me any message.
The code which is trying to fetch the message is :
ReceiveMessageRequest.Builder receiveMessageRequest = ReceiveMessageRequest.builder().queueUrl(sqsQueueUrl);
messages = sqs.receiveMessage(receiveMessageRequest.build()).messages();
Its having the correct sqsQueueUrl which exist. But I am not getting anything in the message.
On timeout its giving me this exception :
software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: sqs.region.amazonaws.com
at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:97)
Caused by: java.net.UnknownHostException: sqs.region.amazonaws.com
So is there any alternative to this, instead of SQSMessage, can we track/poll the jobId any other way ?? Or I am missing out on anything ??

The simple working code snippet to receive SQS message with the valid sqsQueueUrl for more
ReceiveMessageRequest receiveMessageRequest = new ReceiveMessageRequest(sqsQueueUrl);
final List<Message> messages = sqs.receiveMessage(receiveMessageRequest).getMessages();
for (final Message message : messages) {
System.out.println("Message");
System.out.println(" MessageId: " + message.getMessageId());
System.out.println(" ReceiptHandle: " + message.getReceiptHandle());
System.out.println(" MD5OfBody: " + message.getMD5OfBody());
System.out.println(" Body: " + message.getBody());
for (final Entry<String, String> entry : message.getAttributes().entrySet()) {
System.out.println("Attribute");
System.out.println(" Name: " + entry.getKey());
System.out.println(" Value: " + entry.getValue());
}
}
System.out.println();

Intermittent error with Scanner.read() not picking up delimiter

I'm maintaining some legacy Android code which reads "calibration" values for our product from a text file. In theory, the user can manually adjust the calibration file, so every 10 seconds we reload the file to check for new values. The user can also screw the file up, so if the file is found to be unreadable/unparseable we write out a default version of the "calibration".
Here's the relevant bit of code where we read and parse terms:
public boolean ReadCalibFromFile() {
boolean res = true;
String title = null;
String data = null;
try {
File direct = new File(WSDataStorageUtils.getInstance().getCurrentDirLocation());
if (!direct.exists()) {
File filesDirectory = new File(WSDataStorageUtils.getInstance().getCurrentDirLocation());
filesDirectory.mkdirs();
}
File calib=new File(new File(WSDataStorageUtils.getInstance().getCurrentDirLocation()), currentCalibFile + ".txt");
Logger.i(tag, "Using calibration file: " + currentCalibFile + ".txt");
if(!calib.exists()){
Logger.i(tag, "Calibration file doesn't exist! Recreating...");
SaveCalibFiles();
// having saved the standard files, if we can't find the file we might have a wrong name
if (!calib.exists()) {
Logger.e(tag, "This file: " + currentCalibFile + ", doesn't exist, returning to default");
currentCalibFile = DEFAULT_CALIB_FILE;
calib=new File(new File(WSDataStorageUtils.getInstance().getCurrentDirLocation()), currentCalibFile + ".txt");
}
if (!calib.exists()) { // even after returning to default value
// we are really screwed
Logger.e(tag, "Can't even recover to default file! This is bad!");
return false;
}
}
Scanner read = new Scanner(calib);
read.useDelimiter("=|\\n");
while (read.hasNext()) {
title = read.next();
data = read.next();
if (!nextData(data, title)) {
Logger.e(tag, "Error in reading from Calibration file");
Logger.e(tag, "title = " + title + " data = " + data);
res = false;
break;
}
}
read.close();
if(VersionChecked == false) {
res = false;
}
} catch (Exception e) {
e.printStackTrace();
Logger.e("ReadCalibFromfile", "Received an exception: " + e.getMessage());
StringWriter sw = new StringWriter();
PrintWriter pw = new PrintWriter(sw);
e.printStackTrace(pw);
String stackTrace = sw.toString();
Logger.e("ReadCalibFromFile", stackTrace);
Logger.e(tag, "Last good values from file were:");
Logger.e(tag, "title = " + title + " data = " + data);
res = false;
}
return res;
}
Most of the time this is working fine. We left the app running over the weekend (no-one was editing the calibration file) and in the middle of the night the file somehow had an error. The code "fixed" the error by overwriting the old file - but it's the error which I don't understand.
Logs excerpted:
2018-07-13 00:25:12.441 [Thread-240] INFO Calibration: Using calibration file: name.txt
2018-07-13 00:25:12.470 [Thread-240] ERROR ReadCalibFromfile: Received an exception: null
2018-07-13 00:25:12.470 [Thread-240] ERROR ReadCalibFromFile: java.util.NoSuchElementException
at java.util.Scanner.next(Scanner.java:968)
at java.util.Scanner.next(Scanner.java:941)
at com.mycompany.myapp.Calibration.ReadCalibFromFile(Calibration.java:190)
at com.mycompany.myapp.Calibration.run(Calibration.java:108)
at java.lang.Thread.run(Thread.java:818)
2018-07-13 00:25:12.471 [Thread-240] ERROR Calibration: Last good values from file were:
2018-07-13 00:25:12.471 [Thread-240] ERROR Calibration: title = Version=1.1
firstRegionLimit=8
secondRegionLimit=50
CoafPressRegion0_0=-3.0
CoafPressRegion0_1=3.0
CoafPressRegion0_2=0.01
CoafPressRegion0_3=0.0
CoafPressRegion1_0=1.5
CoafPressRegion1_1=1.4
CoafPressRegion1_2=0.02
CoafPressRegion1_3=0.0
CoafPressRegion2_0=-10.0
CoafPressRegion2_1=5.0
CoafPressRegion2_2=0.015
CoafPressRegion2_3=0.0
*= " Version : VERSION NUMBER - PLEASE DON'T CHANGE"
*= " firstRegionLimit : the high limit of the first range "
*= " secondRegionLimit : the high limit of the second range "
*= " mCoafPressRegion : the coefficient for a certain region "
data = null
As you can see - the app fails to find any delimiters, although the delimiters have been set to "=" and "\n". To be clear, this code works on exactly this file 99% of the time. The entire first "read", which is supposed to be up to a delimiter, holds a text which contains many delimiters (even if you don't believe me that there are real '\n' in between each line, you can see the '=').
When this function returns false I rewrite the file and all works fine afterwards....for a few more hours, it is read correctly every 10 seconds and then a little over 8 hours later it has the same problem.
Has anyone else encountered such an issue? I can't figure out how it would be a timing isue, since the file is read once every 10 seconds, and correctly closed the time before.

How to download a video file from amazon s3 to specific folder using Java

I am trying to download a video file from S3 account.It is niether giving me an error nor I am able to see the file downloaded means I am not getting the location of file.I have a doubt that is my file is getting downloaded.Pls give a solution on this.Thanks in advance here is my code:
System.out.println("Downloading an object");
S3Object object = s3.getObject(new GetObjectRequest(bucketName, key));
System.out.println("Content-Type: " + object.getObjectMetadata().getContentType());
object.getObjectContent();
}catch(AmazonServiceException ex){
System.out.println("Caught an AmazonServiceException, which means your request made it "
+ "to Amazon S3, but was rejected with an error response for some reason.");
System.out.println("Error Message: " + ex.getMessage());
System.out.println("HTTP Status Code: " + ex.getStatusCode());
System.out.println("AWS Error Code: " + ex.getErrorCode());
System.out.println("Error Type: " + ex.getErrorType());
System.out.println("Request ID: " + ex.getRequestId());
}catch (AmazonClientException ace) {
System.out.println("Caught an AmazonClientException, which means the client encountered "
+ "a serious internal problem while trying to communicate with S3, "
+ "such as not being able to access the network.");
System.out.println("Error Message: " + ace.getMessage());
}

object.getObjectContent() gives the stream of data from the HTTP connection, is there any exception reading from the stream? you can also try s3.getObjectAsString()

Put multiple items into DynamoDB by Java code

I would like use batchWriteItem method of SDK Amazon to put a lot of items into table.
I retrive the items from Kinesis, ad it has a lot of shard.
I used this method for one item:
public static void addSingleRecord(Item thingRecord) {
// Add an item
try
{
DynamoDB dynamo = new DynamoDB(dynamoDB);
Table table = dynamo.getTable(dataTable);
table.putItem(thingRecord);
} catch (AmazonServiceException ase) {
System.out.println("addThingsData request "
+ "to AWS was rejected with an error response for some reason.");
System.out.println("Error Message: " + ase.getMessage());
System.out.println("HTTP Status Code: " + ase.getStatusCode());
System.out.println("AWS Error Code: " + ase.getErrorCode());
System.out.println("Error Type: " + ase.getErrorType());
System.out.println("Request ID: " + ase.getRequestId());
} catch (AmazonClientException ace) {
System.out.println("addThingsData - Caught an AmazonClientException, which means the client encountered "
+ "a serious internal problem while trying to communicate with AWS, "
+ "such as not being able to access the network.");
System.out.println("Error Message: " + ace.getMessage());
}
}
public static void addThings(String thingDatum) {
Item itemJ2;
itemJ2 = Item.fromJSON(thingDatum);
addSingleRecord(itemJ2);
}
The item is passed from:
private void processSingleRecord(Record record) {
// TODO Add your own record processing logic here
String data = null;
try {
// For this app, we interpret the payload as UTF-8 chars.
data = decoder.decode(record.getData()).toString();
System.out.println("**processSingleRecord - data " + data);
AmazonDynamoDBSample.addThings(data);
} catch (NumberFormatException e) {
LOG.info("Record does not match sample record format. Ignoring record with data; " + data);
} catch (CharacterCodingException e) {
LOG.error("Malformed data: " + data, e);
}
}
Now if i want to put a lot of record, I will use:
public static void writeMultipleItemsBatchWrite(Item thingRecord) {
try {
dataTableWriteItems.addItemToPut(thingRecord);
System.out.println("Making the request.");
BatchWriteItemOutcome outcome = dynamo.batchWriteItem(dataTableWriteItems);
do {
// Check for unprocessed keys which could happen if you exceed provisioned throughput
Map<String, List<WriteRequest>> unprocessedItems = outcome.getUnprocessedItems();
if (outcome.getUnprocessedItems().size() == 0) {
System.out.println("No unprocessed items found");
} else {
System.out.println("Retrieving the unprocessed items");
outcome = dynamo.batchWriteItemUnprocessed(unprocessedItems);
}
} while (outcome.getUnprocessedItems().size() > 0);
} catch (Exception e) {
System.err.println("Failed to retrieve items: ");
e.printStackTrace(System.err);
}
}
but how can I send the last group? because I send only when I have 25 items, but at the end the number is lower.

You can write items to your DynamoDB table one at a time using the Document SDK in a Lambda function attached to your Kinesis Stream using PutItem or UpdateItem. This way, you can react to Stream Records as they appear in the Stream without worrying about whether there are any more records to process. Behind the scenes, BatchWriteItem consumes the same amount of write capacity units as the corresponding PutItem calls. A BatchWriteItem will be as latent as the PUT in the batch that takes the longest. Therefore, using BatchWriteItem, you may experience higher average latency than with parallel PutItem/UpdateItem calls.

IOException doesn't give enough information

My android program isn't working. I am using normal client-server sockets. I have tested my server with telnet and it works fine, but when I try it with my android program, it doesn't work (more details in a second). Here's my code:
Socket s = null;
try
{
String SocketServerAddress = db.getPhSsServerAddress();
Integer SocketServerPort = db.getPhSsServerPort();
s = new Socket(SocketServerAddress, SocketServerPort);
Log.d(MY_DEBUG_TAG, "Setting up Socket: " + SocketServerAddress + ":" + SocketServerPort);
DataOutputStream out = new DataOutputStream(s.getOutputStream());
DataInputStream in = new DataInputStream(s.getInputStream());
Log.d(MY_DEBUG_TAG, "Connected to: " + s.getInetAddress() + " on port " + s.getPort());
out.writeUTF("Helo, Server");
out.flush();
Log.d(MY_DEBUG_TAG, "Bytes written: " + out.size());
String st = in.readUTF();
Log.d(MY_DEBUG_TAG, "SocketServerResponse: " + st);
}
catch (UnknownHostException e)
{
Log.e(MY_ERROR_TAG, "UnknownHostException: " + e.getMessage() + "; " + e.getCause());
}
catch (IOException e)
{
Log.e(MY_ERROR_TAG, "IOException: " + e.getMessage() + "; " + e.getCause() + "; " + e.getLocalizedMessage());
}
finally
{
try {
s.close();
} catch (IOException e) {
Log.e(MY_ERROR_TAG, "IOException on socket.close(): " + e.getMessage() + "; " + e.getCause());
}
}
All I ever get here is a thrown IOException with no message or cause attached. The specific line causing the error is the String st = in.readUTF(). If I comment out that line, my code runs fine (no exceptions thrown), but my server does not acknowledge that any data has been sent to it. And of course I don't get any data back since that line is commented out.
So, how can I figure out what the problem is? Tonight I am going to try and see what is being passed with wireshark to see if that gives any insight.

Is the server using readUTF() and writeUTF() too? writeUTF() writes data in a unique format that can only be understood by readUTF(), which won't understand anything else.
EDIT EOFException means that there is no more data. You should catch it separately and handle it by closing the socket etc. It can certainly be caused spuriously by readUTF() trying to read data that wasn't written with writeUTF().
And deciding it was an IOException when it was really an EOFException means you didn't print out or log the exception itself, just its message. Always use the log methods provided for exceptions, or at least use Exception.toString().

As I remember I had a problem with DataInpuStream some day... try doing so:
in = new DataInputStream(new BufferedInputStream(socket.getInputStream()));

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark - Restore nested saved RDD - java

I solved it by instead of storing a nested RDD I flatmapped all the byte[] into a single JavaRDD and stored that one instead.

Related

No response in SQSMessageSuccess while detecting faces inside a video uploaded on Amazon s3

Intermittent error with Scanner.read() not picking up delimiter

How to download a video file from amazon s3 to specific folder using Java

Put multiple items into DynamoDB by Java code

IOException doesn't give enough information

Categories

Resources