File Compression in Java (Hadoop DefaultCodec) - how to make it human readable?

File Compression in Java (Hadoop DefaultCodec) - how to make it human readable? - java

I have a file was compressed with org.apache.hadoop.io.compress.DefaultCodec I'd like to revert this file back to its original format - which is a string in JSON format.
I'm not too sure about how to use DefaultCodec's documentation to make this happen. Can someone give me an example of how this would look like? Here's what I have so far, I have no idea if I'm on the right track...
//grab my file (it's on S3)
S3Object fileOnS3 = s3Service.getObject("mys3bucket", "myfilename");
DefaultCodec codec = new DefaultCodec();
Decompressor decompressor = codec.createDecompressor();
//does the following line create a input stream that parses DefaultCodec into uncompressed form?
CompressionInputStream is = codec.createInputStream(fileOnS3.getDataInputStream(), decompressor);
//also, I have no idea what to do from here.
I'd like to store the uncompressed version in a String variable, since I know the file is a small one-liner.

I would try the following:
Decompress the file using the hdfs shell command -text and unix shell, like that:
hadoop dfs -text /path/on/hdfs/ > /local/path/for/local/raw/file
Load the file using SequenceFileInputFormat for the input and set as output TextOutputFormat, using an identity mapper (and zero reducers).
I would go for the first option, especially if you say that the input file is a small string. If you want to load this file in a String variable, you can either load the file (which seems unnecessarily expensive), or store the output of -text command immediately in a String (skipping the part after >).

Related

Creating a text file with java without using absolute path

following the question I asked before How to have my java project to use some files without using their absolute path? I found the solution but another problem popped up in creating text files that I want to write into.here's my code:
private String pathProvider() throws Exception {
//finding the location where the jar file has been located
String jarPath=URLDecoder.decode(getClass().getProtectionDomain().getCodeSource().getLocation().getPath(), "UTF-8");
//creating the full and final path
String completePath=jarPath.substring(0,jarPath.lastIndexOf("/"))+File.separator+"Records.txt";
return completePath;
}
public void writeRecord() {
try(Formatter writer=new Formatter(new FileWriter(new File(pathProvider()),true))) {
writer.format("%s %s %s %s %s %s %s %s %n", whichIsChecked(),nameInput.getText(),lastNameInput.getText()
,idInput.getText(),fieldOfStudyInput.getText(),date.getSelectedItem().toString()
,month.getSelectedItem().toString(),year.getSelectedItem().toString());
successful();
} catch (Exception e) {
failure();
}
}
this works and creates the text file wherever the jar file is running from but my problem is that when the information is been written to the file, the numbers,symbols, and English characters are remained but other characters which are in Persian are turned into question marks. like: ????? 111 ????? ????.although running the app in eclipse doesn't make this problem,running the jar does.
Note:I found the code ,inside pathProvider method, in some person's question.

Your pasted code and the linked question are complete red herrings - they have nothing whatsoever to do with the error you ran into. Also, that protection domain stuff is a hack and you've been told before not to write data files next to your jar files, it's not how OSes (are supposed to) work. Use user.home for this.
There is nothing in this method that explains the question marks - the string, as returned, has plenty of issues (see above), but NOT that it will result in question marks in the output.
Files are fundamentally bytes. Strings are fundamentally characters. Therefore, when you write code that writes a string to a file, some code somewhere is converting chars to bytes.
Make sure the place where that happens includes a charset encoding.
Use the new API (I think you've also been told to do this, by me, in an earlier question of yours) which defaults to UTF-8. Alternatively, specify UTF-8 when you write. Note that the usage of UTF-8 here is about the file name, not the contents of it (as in, if you put persian symbols in the file name, it's not about persian symbols in the contents of the file / in the contents you want to write).
Because you didn't paste the code, I can't give you specific details as there are hundreds of ways to do this, and I do not know which one you used.
To write to a file given a String representing its path:
Path p = Paths.get(completePath);
Files.write("Hello, World!", p);
is all you need. This will write as UTF_8, which can handle persian symbols (because the Files API defaults to UTF-8 if you specify no encoding, unlike e.g. new File, FileOutputStream, FileWriter, etc).
If you're using outdated APIs: new BufferedWriter(new OutputStreamWriter(new FileOutputStream(thePath), StandardCharsets.UTF-8) - but note that this is a resource leak bug unless you add the appropriate try-with-resources.
If you're using FileWriter: FileWriter is broken, never use this class. Use something else.
If you're converting the string on its own, it's str.getBytes(StandardCharsets.UTF_8), not str.getBytes().

JMeter not decoding base64 correctly - Results in blank PDF

In JMeter, I get a base64 encoded PDF in a Response that I extract using the RegEx Extractor. That is all working great.
Then I need to decode that base64 encoded document and write it out to a file, which I'm doing with the following in a BeanShell Post Processor:
import org.apache.commons.io.FileUtils;
import org.apache.commons.codec.binary.Base64;
// Set the response variable
String response = vars.get("documentText");
// Remove the carriage return hex code and condense to single string
String encodedFile = response.replace("
","").replaceAll("[\n]+","");
// Decode the encoded string
vars.put("decodedFile",new String(Base64.decodeBase64(encodedFile)));
// Write out the decoded file
Output = vars.get("decodedFile");
f = new FileOutputStream("C:\\Users\\user\\Desktop\\decodedFile.pdf");
p = new PrintStream(f);
this.interpreter.setOut(p);
print(Output);
p.flush();
f.close();
My problem is that the file that gets decoded and written out opens as a blank PDF.
In troubleshooting this, I wrote out a file with the encoded string from JMeter and then manually decoded it using this base64 tool. When I manually decoded the file, it opened as expected.
I then compared the text of the file that was produced by JMeter and the one I decoded with the tool and noticed that the file produced by JMeter included random ?'s throughout
I am assuming this must be the culprit, however, I do not know what is causing these to show up or how to fix it.

JMeter is not decoding Base64 correctly because JMeter cannot decode Base64. If you are using some custom code to do it I would suggest look into this code first.
Given you need to do this magic:
String encodedFile = response.replace("
","").replaceAll("[\n]+","");
my expectation is that your either your regular expression or server response is shitty
Given you use scripting-based post-processor you ain't gonna need this "regex" interim step, you should be able to access parent sampler response from Beanshell PostProcessor via data shorthand
So your great script can be optimized into something like:
FileUtils.writeByteArrayToFile(new File("C:\\Users\\user\\Desktop\\decodedFile.pdf"), Base64.decodeBase64(data));
As a fallback option you can execute this decb64.exe program using OS Process Sampler.

Java saving a file with special characters in file name

I'm having a problem on Java file encoding.
I have a Java program will save a input stream as a file with a given file name, the code snippet is like:
File out = new File(strFileName);
Files.copy(inStream, out.toPath());
It works fine on Windows unless the file name contains some special characters like Ö, with these characters in the file name, the saved file will display a garbled file name on Windows.
I understand that by applying JVM option -Dfile.encoding=UTF-8 this issue can be fixed, but I would have a solution in my code rather than ask all my users to change their JVM options.
While debugging the program I can see the file name string always shows the correct character, so I guess the problem is not about internal encoding.
Could someone please explain what went wrong behind the scene? and is there a way to avoid this problem programmatically? I tried get the bytes from the string and change the encoding but it doesn't work.
Thanks.

Using the URLEncoder class would work:
String name = URLEncoder.encode("fileName#", "UTF-8");
File output = new File(name);

Reading path from ini file, backslash escape character disappearing?

I'm reading in an absolute pathname from an ini file and storing the pathname as a String value in my program. However, when I do this, the value that gets stored somehow seems to be losing the backslash so that the path just comes out one big jumbled mess? For example, the ini file would have key, value of:
key=C:\folder\folder2\filename.extension
and the value that gets stored is coming out as C:folderfolder2filename.extension.
Would anyone know how to escape the keys before it gets read in?
Let's also assume that changing the values of the ini file is not an alternative because it's not a file that I create.

Try setting the escape property to false in Ini4j.
http://ini4j.sourceforge.net/apidocs/org/ini4j/Config.html#setEscape%28boolean%29
You can try:
Config.getGlobal().setEscape(false);

If you read the file and then translate the \ to a / before processing, that would work. So the library you are using has a method Ini#load(InputStream) that takes the INI file contents, call it like this:
byte[] data = Files.readAllBytes(Paths.get("directory", "file.ini");
String contents = new String(data).replaceAll("\\\\", "/");
InputStream stream = new ByteArrayInputStream(contents.getBytes());
ini.load(stream);
The processor must be doing the interpretation of the back-slashes, so this will give it data with forward-slashes instead. Or, you could escape the back-slashes before processing, like this:
String contents = new String(data).replaceAll("\\\\", "\\\\\\\\");

How to convert byte array to file

I have connected to an ftp location using;
URL url = new URL("ftp://user:password#mydomain.com/" + file_name +";type=i");
I read the content into a byte array as shown below;
byte[] buffer = new byte[1024];
int count = 0;
while((count = fis.read(buffer)) > 0)
{
//check if bytes in buffer is a file
}
I want to be able to check if the bytes in buffer is a file without explicitly passing a specific file to write to it like;
File xfile= new File("dir1/");
FileOutputStream fos = new FileOutputStream(xfile);
fos.write(bytes);
if(xfile.isFile())
{
}
In an Ideal world something like this;
File xfile = new File(buffer);//Note: you cannot do this in java
if(xfile.isFile())
{
}
isFile() is to check if the bytes read from the ftp is file. I don't want to pass an explicit file name as I do not know the name of the file on the ftp location.
Any solutions available?

What is a file?
A computer file is a block of arbitrary information [...] which is available to a computer program and is usually based on some kind of durable storage. A file is durable in the sense that it remains available for programs to use after the current program has finished.
Your bytes that are stored in the byte array will be a part of a file if you write them on some kind of durable storage.
Sure, we often say that we read a file or write a file, but basically we read bytes from a file and write bytes to a file.
So we can't test a byte array whether it's content is a file or not. Simply because every byte array can be used to create a file (even an empty array).
BTW - the ftp server does not send a file, it (1) reads bytes and (2) a filename and (3) sends the bytes and (4) the filename so that a client can (5) read the bytes and (6) the filename and use both datasets to (7) create a file. The ftp server doesn't have to access a file, it can take bytes and names from a database or create both in memory...

I guess you cannot check if the byte[] array is a file or not. Why dont' you just use already written and tested library like maybe for example: http://commons.apache.org/net/

There is no way to do that easily.
A file is a byte array on a disk and a byte array will be a file if you write it to disk. There is no reliable way of telling what is in the data you just received, without parsing the data and checking if you can find a valid file header in it.

Where is isFile() file means the content fetched from from the ftp stream is a file.
The answer to that is simple. You can't do it because it doesn't make any sense.
What you have read from the stream IS a sequence of bytes stored in memory.
A file is a sequence of bytes stored on a disk (typically).
These are not the same thing. (Or if you want to get all theoretical / philosophical you have to answer the question "when is a sequence of bytes a file, and when is it not a file".
Now a more sensible question to ask might be:
How do I know if the stuff I fetched by FTP is the contents of a file on the FTP server.
(... as distinct from a rendering of a directory or something).
The answer is that you can't be sure if you fetched the file by opening an URLConnection to the FTP server ... like you have done. It is like asking "is '(123) 555-5555' a phone number?". It could be a phone number, or it could just be a sequence of characters that look like a phone number.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.