Batching multiple files to Amazon S3 using the Java SDK

Batching multiple files to Amazon S3 using the Java SDK - java

I'm trying to upload multiple files to Amazon S3 all under the same key, by appending the files. I have a list of file names and want to upload/append the files in that order. I am pretty much exactly following this tutorial but I am looping through each file first and uploading that in part. Because the files are on hdfs (the Path is actually org.apache.hadoop.fs.Path), I am using the input stream to send the file data. Some pseudocode is below (I am commenting the blocks that are word for word from the tutorial):
// Create a list of UploadPartResponse objects. You get one of these for
// each part upload.
List<PartETag> partETags = new ArrayList<PartETag>();
// Step 1: Initialize.
InitiateMultipartUploadRequest initRequest = new InitiateMultipartUploadRequest(
bk.getBucket(), bk.getKey());
InitiateMultipartUploadResult initResponse =
s3Client.initiateMultipartUpload(initRequest);
try {
int i = 1; // part number
for (String file : files) {
Path filePath = new Path(file);
// Get the input stream and content length
long contentLength = fss.get(branch).getFileStatus(filePath).getLen();
InputStream is = fss.get(branch).open(filePath);
long filePosition = 0;
while (filePosition < contentLength) {
// create request
//upload part and add response to our list
i++;
}
}
// Step 3: Complete.
CompleteMultipartUploadRequest compRequest = new
CompleteMultipartUploadRequest(bk.getBucket(),
bk.getKey(),
initResponse.getUploadId(),
partETags);
s3Client.completeMultipartUpload(compRequest);
} catch (Exception e) {
//...
}
However, I am getting the following error:
com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: 2C1126E838F65BB9), S3 Extended Request ID: QmpybmrqepaNtTVxWRM1g2w/fYW+8DPrDwUEK1XeorNKtnUKbnJeVM6qmeNcrPwc
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1109)
at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:741)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:461)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:296)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3743)
at com.amazonaws.services.s3.AmazonS3Client.completeMultipartUpload(AmazonS3Client.java:2617)
If anyone knows what the cause of this error might be, that would be greatly appreciated. Alternatively, if there is a better way to concatenate a bunch of files into one s3 key, that would be great as well. I tried using java's builtin SequenceInputStream but that did not work. Any help would be greatly appreciated. For reference, the total size of all the files could be as large as 10-15 gb.

I know it's probably a bit late but worth giving my contribution.
I've managed to solve a similar problem using the SequenceInputStream.
The tricks is in being able to calculate the total size of the result file and then feeding the SequenceInputStream with an Enumeration<InputStream>.
Here's some example code that might help:
public void combineFiles() {
List<String> files = getFiles();
long totalFileSize = files.stream()
.map(this::getContentLength)
.reduce(0L, (f, s) -> f + s);
try {
try (InputStream partialFile = new SequenceInputStream(getInputStreamEnumeration(files))) {
ObjectMetadata resultFileMetadata = new ObjectMetadata();
resultFileMetadata.setContentLength(totalFileSize);
s3Client.putObject("bucketName", "resultFilePath", partialFile, resultFileMetadata);
}
} catch (IOException e) {
LOG.error("An error occurred while combining files. {}", e);
}
}
private Enumeration<? extends InputStream> getInputStreamEnumeration(List<String> files) {
return new Enumeration<InputStream>() {
private Iterator<String> fileNamesIterator = files.iterator();
#Override
public boolean hasMoreElements() {
return fileNamesIterator.hasNext();
}
#Override
public InputStream nextElement() {
try {
return new FileInputStream(Paths.get(fileNamesIterator.next()).toFile());
} catch (FileNotFoundException e) {
System.err.println(e.getMessage());
throw new RuntimeException(e);
}
}
};
}
Hope this helps!

Related

huffman code - cant decompress BMP files using bitset

I built a classic Hoffman code, with encoder and decoder. I noticed that I had a problem, I use code in "bitset", to compress the input file. But the "bitset" - does not decode all the files I send to, for example when I send a txt file, it works great, but when I send other files like BMP. It doesn't work.
Before I used bitset - the code worked - but without any compression - so I'm afraid the problem is with bitset.
The decoder I built is:
public void Decompress(String[] input_names, String[] output_names) {
HuffmanVerticle tree = new HuffmanVerticle();
tree = readTreeFile(output_names);
restoreInput(tree, output_names, input_names);
}
public static void restoreInput(HuffmanVerticle tree, String[] binary_names, String[] original_names) {
BitSet huffmanCodeBit;
try {
FileOutputStream to_original = new FileOutputStream(original_names[0]);
FileInputStream binary = new FileInputStream(binary_names[0]);
ObjectInputStream s = new ObjectInputStream(binary);
huffmanCodeBit = (BitSet) s.readObject();
System.out.println(huffmanCodeBit.toString());
int index = 0;
while(huffmanCodeBit.length() > index)
{
HuffmanVerticle tmp = tree;
while (!tmp.isNullTree())
{
boolean bit = huffmanCodeBit.get(index);
index++;
System.out.println(bit);
if (!bit)
tmp = tmp.left;
else
tmp = tmp.right;
}
to_original.write(tmp.character);
}
binary.close();
to_original.close();
} catch (Exception e) {
e.printStackTrace();
}
}
What am I missing here? Why doesn't the code work for certain files? I'm trying to run the code on some files but it doesn't work, the files that come back don't work.
The code does not work for bmp files at all, even after half an hour, for example txt files, it runs very fast.
Thank for your help.

PDF file encode to base64 take more time if 100k documents are to be encode

Am trying to encode pdf documents to base64, If it is less in number ( like 2000 documents) its working nicely. But am having 100k plus doucments to be encode.
Its take more time to encode all those files. Is there any better approach to encode large data set.?
Please find my current approach
String filepath=doc.getPath().concat(doc.getFilename());
file = new File(filepath);
if(file.exists() && !file.isDirectory()) {
try {
FileInputStream fileInputStreamReader = new FileInputStream(file);
byte[] bytes = new byte[(int) file.length()];
fileInputStreamReader.read(bytes);
encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
fileInputStreamReader.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}

Try this:
Figure out how many files you need to encode.
int files = Files.list(Paths.get(directory)).count();
Split them up into a reasonable amount that a thread can handle in java. I.E) If you have 100k files to encode. Split it into 1000 lists of 1000, something like that.
int currentIndex = 0;
for (File file : filesInDir) {
if (fileMap.get(currentIndex).size() >= cap)
currentIndex++;
fileMap.get(currentIndex).add(file);
}
/** Its going to take a little more effort than this, but its the idea im trying to show you*/
Execute each worker thread one after another if the computers resources are available.
for (Integer key : fileMap.keySet()) {
new WorkerThread(fileMap.get(key)).start();
}
You can check the current resources available with:
public boolean areResourcesAvailable() {
return imNotThatNice();
}
/**
* Gets the resource utility instance
*
* #return the current instance of the resource utility
*/
private static OperatingSystemMXBean getInstance() {
if (ResourceUtil.instance == null) {
ResourceUtil.instance = ManagementFactory.getOperatingSystemMXBean();
}
return ResourceUtil.instance;
}

Gate- Loading a gapp file is taking time .Hoe can I reduce it?

I am working on GATE related project. So, Here I am creating a pipeline through GATE. I am using that .gapp file in my java code. So, for loading a .gapp file, it takes around 10 seconds, which is too much for my application.
How can I solve this issue?
Second problem is that, I have to do System.exit after processing a document to release the memory, if I didn't then I got OutofMemoryError.
So, how can I solve these issues?
My code is like:
public class GateMainClass {
CorpusController application = null;
Corpus corpus = null;
public void processApp(String gateHome, String gapFilePath, String docSourcePath, String isSingleDocument) throws ResourceInstantiationException {
try {
if (!Gate.isInitialised()) {
Gate.runInSandbox(true);
Gate.setGateHome(new File(gateHome));
Gate.setPluginsHome(new File(gateHome, "plugins"));
Gate.init();
}
application = (CorpusController) PersistenceManager.loadObjectFromFile(new File(gapFilePath));
corpus = Factory.newCorpus("main");
application.setCorpus(corpus);
if(isSingleDocument.equals(Boolean.TRUE.toString())) {
Document doc = Factory.newDocument(new File(docSourcePath).toURI().toURL());
corpus.add(doc);
} else {
File[] files;
File folder = new File(docSourcePath);
files = folder.listFiles(new FileUtil.CustomFileNameFilter(".xml"));
Arrays.sort(files, LastModifiedFileComparator.LASTMODIFIED_REVERSE);
for (int i = 0; i < files.length; i++) {
Document doc = Factory.newDocument(files[i].toURI().toURL());
corpus.add(doc);
}
}
application.execute();
} catch (Exception e) {
e.printStackTrace();
} finally {
corpus.clear();
}
}
And my gapp file is like:
1.Document Reset PR
2.Annie English Tokenizer.
3.ANnie Gazetteer.
4.Annie Sentence Spiliter.
5.Annie POS Tagger.
6.GATE morphological Analyser.
7.Flexible gazetteer.
8.HTML markup transfer
9.Main Jape file.

PDFBox - options to increase the performance

I have 2 questions regarding PDFBox library (JAVA):
I have just started using PDFBox library and though it's working well, I couldn't
help noticing that it runs slower than ITEXT (the other pdf library I used) when
using ut.mergeDocuments() method (against concat_pdf.main(..) of ITEXT).
Does any one know if/how I can increase the performance of this tool?
I see that PDFBox is more sensitive to encrypted files. The ITEXT is allowing
me to do merge on encrypted PDF's but PDFBox is throwing an exception stating:
"PDFBoxConcat failedjava.io.IOException: Error: destination PDF is encrypted, can't append encrypted PDF documents."
Does any one know how come it works on ITEXT but not on PDFBox?
My guess is that the ITEXT is more sophisticated to know exactly what is encrypted
and allowing actions by that, while the PDFBox is just checking if it's encrypted or not.
Can anyone confirm this for me?
I have this code (open source) of pfdBox for the mergeDocuments() method where you can see the check for encryption:
if( destination.isEncrypted() )
{
throw new IOException( "Error: destination PDF is encrypted, can't append encrypted PDF documents." );
}
I tried to put this on remark but the merged document came out as gibberish.
Just adding some code examples of my attempts to improve performance.
These are the 3 different ways I tried to do this:
private static void PDFBoxConcat(String filePath) {
PDFMergerUtility ut = new PDFMergerUtility();
for (int i = 0; i < 50; i++) {
ut.addSource(filePath);
}
ut.setDestinationFileName("C:\\amdocs\\sensis\\dlv858\\pdfBox" + testNum + ".pdf");
try {
ut.mergeDocuments();
} catch (Exception e) {
System.out.println("PDFBoxConcat failed");
e.printStackTrace();
}
}
private static void PDFBoxConcat2(String filePath) {
String [] fileNamesArray = new String[51];
int i = 0;
for (i = 0; i < 50; i++) {
fileNamesArray[i] = filePath;
}
fileNamesArray[i] = "C:\\amdocs\\sensis\\dlv858\\pdfM" + testNum + ".pdf";
try {
PDFMerger.main(fileNamesArray);
} catch (Exception e) {
e.printStackTrace();
}
}
private static void PDFBoxConcat3(String filePath) {
ArrayList<InputStream> list = new ArrayList<InputStream>();
PDFMergerUtility ut = new PDFMergerUtility();
for (int i = 0; i < 50; i++){
InputStream inputStream = new FileInputStream(filePath);
list.add(inputStream);
}
ut.addSources(list);
try {
ut.mergeDocuments();
} catch (Exception e) {
System.out.println("PDFBoxConcat failed");
e.printStackTrace();
}
}

Concerning your first question: Does any one know if/how I can increase the performance of this tool(= Apache PdfMergerUtility)?
The following setting helped me to reduce the merge time by ~ 75%:
pdfMergerUtility.setDocumentMergeMode(PDFMergerUtility.DocumentMergeMode.OPTIMIZE_RESOURCES_MODE);

NullPointerException using ImageIO.read

I'm getting an NPE while trying to read in an image file, and I can't for the life of me figure out why. Here is my line:
BufferedImage source = ImageIO.read(new File(imgPath));
imgPath is basically guaranteed to be valid and right before it gets here it copies the file from the server. When it hits that line, I get this stack trace:
Exception in thread "Thread-26" java.lang.NullPointerException
at com.ctreber.aclib.image.ico.ICOReader.getICOEntry(ICOReader.java:120)
at com.ctreber.aclib.image.ico.ICOReader.read(ICOReader.java:89)
at javax.imageio.ImageIO.read(ImageIO.java:1400)
at javax.imageio.ImageIO.read(ImageIO.java:1286)
at PrintServer.resizeImage(PrintServer.java:981) <---My function
<Stack of rest of my application here>
Also, this is thrown into my output window:
Can't create ICOFile: Can't read bytes: 2
I have no idea what is going on, especially since the File constructor is succeeding. I can't seem to find anybody who has had a similar problem. Anybody have any ideas? (Java 5 if that makes any difference)

I poked around some more and found that you can specify which ImageReader ImageIO will use and read it in that way. I poked around our codebase and found that we already had a function in place for doing EXACTLY what I was trying to accomplish here. Just for anybody else who runs into a similar issue, here is the crux of the code (some of the crap is defined above, but this should help anybody who tries to do it):
File imageFile = new File(filename);
Iterator<ImageReader> imageReaders = ImageIO.getImageReadersByFormatName("jpeg");
if ( imageReaders.hasNext() ) {
imageReader = (ImageReader)imageReaders.next();
stream = ImageIO.createImageInputStream(imageFile);
imageReader.setInput(stream, true);
ImageReadParam param = imageReader.getDefaultReadParam();
curImage = imageReader.read(0, param);
}
Thanks for the suggestions and help all.

The File constructor will almost certainly succeed, regardless of whether it points to a valid/existing file. At the very least, I'd check whether your underlying file exists via the exists() method.

Also note that ImageIO.read is not thread-safe (it reuses cached ImageReaders which are not thread-safe).
This means you can't easily read multiple files in parallel. To do that, you'll have to deal with ImageReaders yourself.

Have you considered that the file may simply be corrupted, or that ImageIO is trying to read it as the wrong type of file?

Googling for the ICOReader class results in one hit: IconsFactory from jide-common.
Apparently they had the same problem:
// Using ImageIO approach results in exception like this.
// Exception in thread "main" java.lang.NullPointerException
// at com.ctreber.aclib.image.ico.ICOReader.getICOEntry(ICOReader.java:120)
// at com.ctreber.aclib.image.ico.ICOReader.read(ICOReader.java:89)
// at javax.imageio.ImageIO.read(ImageIO.java:1400)
// at javax.imageio.ImageIO.read(ImageIO.java:1322)
// at com.jidesoft.icons.IconsFactory.b(Unknown Source)
// at com.jidesoft.icons.IconsFactory.a(Unknown Source)
// at com.jidesoft.icons.IconsFactory.getImageIcon(Unknown Source)
// at com.jidesoft.plaf.vsnet.VsnetMetalUtils.initComponentDefaults(Unknown Source)
// private static ImageIcon createImageIconWithException(final Class<?> baseClass, final String file) throws IOException {
// try {
// InputStream resource =
// baseClass.getResourceAsStream(file);
// if (resource == null) {
// throw new IOException("File " + file + " not found");
// }
// BufferedInputStream in =
// new BufferedInputStream(resource);
// return new ImageIcon(ImageIO.read(in));
// }
// catch (IOException ioe) {
// throw ioe;
// }
// }
What did they do instead?
private static ImageIcon createImageIconWithException(
final Class<?> baseClass, final String file)
throws IOException {
InputStream resource = baseClass.getResourceAsStream(file);
final byte[][] buffer = new byte[1][];
try {
if (resource == null) {
throw new IOException("File " + file + " not found");
}
BufferedInputStream in = new BufferedInputStream(resource);
ByteArrayOutputStream out = new ByteArrayOutputStream(1024);
buffer[0] = new byte[1024];
int n;
while ((n = in.read(buffer[0])) > 0) {
out.write(buffer[0], 0, n);
}
in.close();
out.flush();
buffer[0] = out.toByteArray();
} catch (IOException ioe) {
throw ioe;
}
if (buffer[0] == null) {
throw new IOException(baseClass.getName() + "/" + file
+ " not found.");
}
if (buffer[0].length == 0) {
throw new IOException("Warning: " + file
+ " is zero-length");
}
return new ImageIcon(Toolkit.getDefaultToolkit().createImage(
buffer[0]));
}
So you might want to try the same approach: read the raw bytes and use Toolkit to create an image from them.

"it's a jpeg but doesn't have a jpeg
extension."
That might be it.
It appears that the library AC.lib-ICO is throwing the NPE. Since this library is intended to read the Microsoft ICO file format, a JPEG might be a problem for it.
Consider explicitly providing the format using an alternative method.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Batching multiple files to Amazon S3 using the Java SDK - java

Related

huffman code - cant decompress BMP files using bitset

PDF file encode to base64 take more time if 100k documents are to be encode

Gate- Loading a gapp file is taking time .Hoe can I reduce it?

PDFBox - options to increase the performance

NullPointerException using ImageIO.read

Categories

Resources