Java: Telling whether a byte array is a zip file - java

I have a server written in Java, that in a single request, gets a whole file from the client. The file is passed to the server as a list of bytes, and is finally represented in the java server as a byte array.
Is there some standard way / standard library that could tell whether a file represented by a byte array is a valid zip file?

Files are typically identified using magic numbers in the beginning of the file.
To make an educated guess about a given file Java has built-in method of detecting some file types: Files.probeContentType. Plus, there are various third party libraries: simplemagic or Apache Tika (which supports more than only magic numbers).
But content detection alone won't tell you whether the file is valid. For that, you'd need something that actually knows how to read Zip files, such as Java's ZipFile.

If you want to standard way to implement for this process, you can use serialization API.For that use following articles that found myself while searching about this topic.
Article 1 - javaworld
Article 2 - developer.com

Check Zip4j library. It is really easy to use and the ZipFile class has a isValidZipFile() method

The easiest way is to check the "PK" magic at the beginning of the byte array.
Something like this:
"PK".equals(new String(array, 0,2))

Related

Zip folder to Array of byte using Scala

I am working on application where I have to convert .Zip folder to array of byte and I am using Scala and Play framework.
As of now I'm using,
val byteOfArray = Source.fromFile("resultZip.zip", "UTF-8").map(_.toByte).toArray
But when I am performing operation with byteOfArray I was getting error.
I have printed byteOfArray and found the result as below
empty parser
can you please let me know is this the correct way to convert .zip to array of byte?
Also let me know if is there another good way to convert array of byte.
Your solution is incorrect. UTF-8 is a text encoding, and zip files are binary files. It might happen by accident that a zip file is a valid UTF-8 file, but even in this case UTF-8 can use multiple bytes for a single character which you'll then convert to a single byte. Source is only intended to work with text files (as you can see from the presence of encoding parameter, Char type use, etc.). There is nothing in the standard Scala library to work with binary IO.
If you really hate the idea of using Java standard library (you shouldn't; that's what any Scala solution is going to be based on, and it doesn't get less verbose than a single method call), use better-files (not tested, just based on README examples):
import better.files._
val file = File("resultZip.zip")
file.bytes.toArray // if you really need an Array and can't work with Iterator
but for this specific case it isn't a real win, you just need to add an extra dependency.
I mean a folder contains files and another folders having files in it
If you have a folder which contains .zip files and possibly some others in nested folders, you can get all of them with
val zipFiles = File(directoryName).glob("**/*.zip")
and then
zipFiles.map(_.bytes.toArray)
will give you a Seq[Array[Byte]] containing all zip files as byte arrays. Modify to taste if you need to use file names and/or paths, etc. in further processing.

File type detection in Java without I/O

There is a built-in method in the Java JDK that detects file types:
Files.probeContentType(Paths.get("/temp/word.doc"));
The javadoc says that a FileTypeDetector may examine the filename, or it may examine a few bytes in the file, which means that it would have to actually try to pull the file from a URL.
This is unacceptable in our app; the content of the file is available only through an InputStream.
I tried to step through the code to see what the JDK is actually doing, but it seems that it goes to FileTypeDetectors.defaultFileTypeDetector.probeContentType(path) which goes to sun.nio.fs.AbstractFileTypeDetector, and I couldn't step into that code because there's no source attachment.
How do I use JDK file type detection and force it to use file content that I supply, rather than having it go out and perform I/O on its own?
The docs for Files.probeContentType() explain how to plug in your own FileTypeDetector implementation, but if you follow the docs you'll find that there is no reliable way to ensure that your implementation is the one that is selected (the idea is that different implementations serve as fallbacks for each other, not alternatives). There is certainly no documented way to prevent the built-in implementation from ever reading the target file.
You can surely find a map of common filename extensions to content types in various places around the web and probably on your own system; mime.types is a common name for such files. If you want to rely only on such a mapping file then you probably need to use your own custom facility, not the Java standard library's.
The JDK's Files.probeContentType() simply loads a FileTypeDetector available in your JDK installation and asks it to detect the MIME type. If none exists then it does nothing.
Apache has a library called Tika which does exactly what you want. It determines the MIME type of the given content. It can also be plugged into your JDK to make your Files.probeContentType() function using Tika. Check this tutorial for quick code - http://wilddiary.com/detect-file-type-from-content/
If you are worried about reading the contents of an InputStream you can wrap it in a PushBackInputStream to "unread" those bytes so the next detector implementation can read it.
Usually binary file's magic numbers are 4 bytes so having a new PushBackInputStream(in, 4) should be sufficient.
PushBackInputStream pushbackStream = new PushBackInputStream(in, 4);
byte[] magicNumber = new byte[4];
//for this example we will assume it reads whole array
//for production you will need to check all 4 bytes read etc
pushbackStream.read(magicNumber);
//now figure out content type basic on magic number
ContentType type = ...
//now pushback those 4 bytes so you can read the whole stream
pushbackStream.unread(magicNumber);
//now your downstream process can read the pushbackStream as a
//normal InputStream and gets those magic number bytes back
...

Read and process large one line JSON file

I have a large json file(200 MB), but all are in one single line.
I need to do some processing with the data in the file and write the data in to a relational database.
What is the best way we can do this using java.
Note: Most of the available methods are using line by line reading. Also We can use thing like MappedByteBuffer to read by characters but it is not an efficient solution
Non java solutions are also welcome
I recommend you the library from Douglas Crackford https://github.com/douglascrockford/JSON-java, use the following command to load a json array.
org.json.JSONArray mediaArray = new org.json.JSONArray(filecontent);
Check the following article for read a file content.
http://www.javapractices.com/topic/TopicAction.do?Id=42

Uncompress a zlib-compressed string in Java

I have a Java module that is receiving a compressed string from a remote Python script. The Python script compresses the string using zlib.compress(). I simply want to uncompress it in Java and display it to the user.
The man page for Java's built-in zip.Deflater object describes pretty explicitly how to uncompress something that has been compressed using zlib.compress(). However, this method does not work for me. Depending on which encoding I use, I either get "Incorrect Header Check" errors or the uncompression returns an empty string.
So, how am I supposed to uncompress this? The data are not getting corrupted in transmission, and the compressed string begins with "x\x9c", which is apparently appropriate for zlib-compressed stuff.
I've never dealt with compression/uncompression on this level before and am getting confused. For extra credit, I'd appreciate an explanation between compressed/uncompressed and inflated/deflated. According to this they are different, but most of the internet seems to use them interchangeably for zlib. This just makes trying to find a solution even more difficult, as I couldn't tell you whether I'm actually trying to "uncompress" or "inflate" these data.
The confusion has arisen because some bright spark started describing the zlib protocol as "deflate". It might help you to read the RFCs mentioned in these Java docs.
Also this SO topic is quite relevant.
I suggest that you do
print repr(zlib.compress("The quick brown dog etc etc")
in Python (A) and compare the result from using the equivalent Java code using Deflater (B). Also ensure that you can Inflate B to recover your test input. Check that you are not suffering from unicode <-> bytes complications in Python or Java or both.
Have you tried doing a Python "deflate" as per the answer by #patthoyts in the SO topic that you quoted?
It seems Python's zlib.compress() uses gzip, are you sure to create Inflater with nowrap parameter for gzip compatible uncompression?
Inflate/deflate is used only regarding DEFLATE algorithm I believe, whereas compress/uncompress is more general term.

Detect if user changed file extension to upload?

Using a Java servlet, is it possible to detect the true file type of a file, regardless of its extension?
Scenario: You only allow plain text file uploads (.txt and .csv) The user takes the file, mypicture.jpg, renames it to mypicture.txt and proceeds to upload the file. Your servlet expects only text files and blows up trying to read the jpg.
Obviously this is user error, but is there a way to detect that its not plain text and not proceed?
You can do this using the builtin URLConnection#guessContentTypeFromStream() API. It's however pretty limited in content types it can detect, you can then better use a 3rd party library like jMimeMagic.
See also:
Best way to determine file type in Java
When do browsers send application/octet-stream as Content-Type?
No. There is no way to know what type of file you are being uploaded. You must make all verifications on the server before taking any actions with the file.
I think you should consider why your program might blow up when give a JPEG (say) and make it defensive against this. For example a JPEG file is likely to have apparently very long lines (any LF of CR LF will be soemwhat randomly spread). But a so called text file could equally have long lines that might kill your program,
What exactly do you mean by "plain text file"? Would a file consisting of Chinese text be a plain text file? If you assume English text in ASCII or ANSI coding, you would have to read the full file as binary file, and check that e. g. all byte values are between, say, 32 and 127 plus 13, 10, 9, maybe.

Categories

Resources