Storing files on Database

Storing files on Database - java

i have to save a file in any format (XLS, PDF, DOC, JPG ....) in a database using Java. in my experience i would have do this by storing the binary data of the file into a BLOB type field, someone told me that an alternative is coding the binary data as Text using BASE64 and store the string in a TEXT type field. Which one is the best option to performn this task?.
Thanks.
Paul Manjarres

BLOB would be better, simply because you can use a byte[] data type, and you don't have to encode/decode from BASE64. No reason to use BASE64 for simple storage.

The argument for using BLOB is that it takes fewer CPU cycles, less disk and network i/o, less code, and reduces the likelihood of bugs:
As Will Hartung says, using BLOB enables you to skip the encode/decode steps, which will reduce CPU cycles. Moreover, there are many Java libraries for Base64 encoding and decoding, and there are nuances in implementation (ie PEM line wraps). This means to be safe, the same library should be used for encoding and decoding. This creates an unnecessary coupling between the application which creates the record, and the application that reads the record.
The encoded output will be larger than the raw bytes, which means it will take up more disk space (and network i/o).

Use BLOB to put them in database

FILE to BLOB = DB will not query the content and treat it as, as a ... well ... a meaningless binary BLOB regardless of its content. DB knows this field may be 1KB or 1GB and allocates resources accordingly.
FILE to TEXT = DB can query this thing. Strings can be searched replaced modified in the file. But this time DBMS will spend more resources to make this thing work. There may be a 100 char long text inside a field which may or may not be storing 1 million char long text. Files can have any kind of text encoding and invalid characters may be lost due to table/DB encoding settings.No need to use this if content of the files will not be used in SQL queries.
BASE64 = Converts any content to a lovely super valid text. A work around to bypass every compatibility issue. Store anywhere, print it, telegraph it, write it on a paper, convert your favorite selfie to a private key. Output will be meaningless and bigger but it will be an ordinary text.

Related

Obtain a string from the compressed data and vice versa in java

I want to compress a string(an XML Document) in Java and store it in Cassandra db as varchar. I should be able to decompress it while reading from db. I looked into GZIP and lz4 and both return a byte array on compressing.
My goal is to obtain a string from the compressed data which can also be used to decompress and get back the original string.
What is the best possible approach?

I don't see any good reasons for you to compress your data: Cassandra can do it for you transparently (it will LZ4 your data by default). So, if your goal is to reduce your data footprint then you have a non-existent problem, and I'd feed the XML document directly to C*.
By the way, all the compression algorithms take array of bytes and produce array of bytes. As a solution, you could apply something like a base64 encoding to your compressed byte array. On decompression, reverse the logic: decode base64 your string and then apply your decompression algorithm.

Not enough reputation to comment so posting as an answer. If you want a string back, then significant compression will depend on your data. A very simple solution might be something like Java compressing Strings but that would work if your string is only characters and no numbers. You can modify this solution to work for most characters but then if you don't have repeating characters then you might actually get a larger string than your original one.

What is the fastest file / way to parse a large data file?

So I am working on a GAE project. I need to look up cities, Country Names and Country Codes for sign ups, LBS, ect ...
Now I figured that putting all the information in the Datastore is rather stupid as it will be used quite frequently and its gonna eat up my datastore quotations for no reason, specially that these lists arent going to change, so its pointless to put in datastore.
Now that leaves me with a few options:
API - No budget for paid services, free ones are not exactly reliable.
Upload Parse-able file - Favorable option as I like the certainty that the data will always be there.
So I got the files needed from GeoNames (link has source files for all countries in case someone needs it). The file for each country is a regular UTF-8 tab delimited file which is great.
However, now that I have the option to choose how to format and access the data, the question is:
What is the best way to format and retrieve data systematically from a static file in a Java servelet container ?
The best way being the fastest, and least resource hungry method.
Valid options:
TXT file, tab delimited
XML file Static
Java Class with Tons of enums
I know that importing country files as Java Enums and going through their values will be very fast, but do you think this is going to affect memory beyond reasonable limits ? On the other hand, every time I need to access a record, the loop will go through a few thousand lines until it finds the required record ... reading line by line so no memory issues, but incredibly slow ... I have had some experience with parsing an excel file in a Java servelet and it took something like 20 seconds just to parse 250 records, on large scale, response time WILL timeout (no doubt about it) so is XML anything like excel ??
Thank you very much guys !! Please provide opinions, all and anything is appreciated !

Easiest and fastest way would be to have the file as a static web resource file, under the WEB-INF folder and on application startup, have a context listener to load the file into memory.
In memory, it should be a Map, mapping from a key you want to search by. This will allow you like a constant access time.
Memory consumption would only matter if it is really big. A hundred thousand record for example not worth optimizing if you need to access this many times.
The static file should be plain text format or CSV, they are read and parsed most efficiently. No need XML formatting as parsing it would be slow.
If the list is really big, you can break it up into multiple, smaller files, and only parse those and only when they are required. A reasonable, easy partitioning would be to break it up by country, but any other partitioning would work (like based on its name using the first few characters from its name).
You could also consider building this Map in the memory once, and then serialize this map to a binary file, and include that binary file as a static resource file, and that way you would only have to deserialize this Map and would be no need to parse/process it as a text file and build objects yourself.
Improvements on the data file
An alternative to having the static resource file as a text/CSV file or a serialized Map
data file would be to have it as a binary data file where you could create your own custom file format.
Using DataOutputStream you can write data to a binary file in a very compact and efficient way. Then you could use DataInputStream to load data from this custom file.
This solution has the advantages that the file could be much less (compared to plain text / CSV / serialized Map), and loading it would be much faster (because DataInputStream doesn't use number parsing from a text for example, it reads the bytes of a number directly).

Hold the data in source form as XML. At start of day, or when it changes, read it into memory: that's the only time you incur the parsing cost. There are then two main options:
(a) your in-memory form is still an XML tree, and you use XPath/XQuery to query it.
(b) your in-memory form is something like a java HashMap
If the data is very simple then (b) is probably best, but it only allows you to do one kind of query, which is hard-coded. If the data is more complex or you have a variety of possible queries, then (a) is more flexible.

How may I store images in a database using the Inubit tool set?

I am learning Inubit. I want to know, how may I store images in a database using the Inubit tool set?

The question is more than a year old. I guess you solved it by now.
For all others coming here, let me sketch out the typical way you'd do that.
0. (optional) Compress data.
Depending on the compression of the image (e.g. its GIF, PDF, uncompressed TIFF, etc. and not JPEG), you might want to compress it via a Compressor module first to reduce needed database space and increase overall performance on the next steps. Be sure to compress the binary data and not the base64-encoded string (see next step)!
1. Encode binary stream to base64.
Depending on where you get the image
data from, chances are that it already is base64 encoded. E.g. you
used a file connector to retrieve it from disk with the appropriate option checked or used a web service
connector. If you really have a binary data stream, convert it to
base64 using an encoder module (better self-documenting) or using a variable
assignment using the XPATH-function isxp:encode (more concise).
2. Save the encoded data via a database connector.
Well, the details
for doing this right are pretty much database specific. The cheap
trick that should work on any database, is storing the base64-string
simply as a string in a TEXT / CLOB column. This will waste about
three times as much space in the database as the original binary
data, since base64 is poorly packed. Doing it right would mean to
construct a forced SQL query in an XSLT that decodes the
base64-string to binary and stores it. Here is some reference
to how it can be done in Oracle.
Hope, this might be of some help.
Cheers,
Jörn
Jörn Willhöft
Willhöft IT-Beratung GmbH, Berlin, Germany

You do not store the image in the database, you only record the path to the image. The Image will be stored on the server.
Here is an example of how to store the path to the image : How to insert multiple images path to database

Efficient JSON encoding for data that may be binary, but is often text

I need to send a JSON packet across the wire with the contents of an arbitrary file. This may be a binary file (like a ZIP file), but most often it will be plain ASCII text.
I'm currently using base64 encoding, which handles all files, but it increases the size of the data significantly - even if the file is ASCII to begin with. Is there a more efficient way I can encode the data, other than manually checking for any non-ASCII characters and then deciding whether or not to base64-encode it?
I'm currently writing this in Python, but will probably need to do the same in Java, C# and C++, so an easily portable solution would be preferable.

Use quoted-printable encoding. Any language should support that.
http://en.wikipedia.org/wiki/Quoted-printable

Best way to serialize a C structure to be deserialized by Java, etc

Currently, I'm saving and loading some data in C/C++ structs to files by using fread()/fwrite(). This works just fine when working within this one C app (I can recompile whenever the structure changes to update the sizeof() arguments to fread()/fwrite()), but how can I load this file in other programs without knowing in advance the sizeof()s of the C struct?
In particular, I have written this other Java app that visualizes the data contained in that C struct binary file, but I'd like a general solution as to how read that binary file. (Instead of me having to manually put in the sizeof()s in the Java app source whenever the C structure changes...)
I'm thinking of serializing to text or XML of some sort, but I'm not sure where to start with that (how to serialize in C, then how to deserialize in Java and possibly other languages in the future), and if that is advisable here where one member of the struct is a float array that can go upwards of ~50 MB in binary format (and I have hundreds of these data files to read and write).
The C structure is simple (no severe nesting or pointer references) and looks like the following:
struct MyStructure {
char *title;
int id;
int param1;
int param2;
float *data;
}
The part that are liable to change the most are the param integers.
What are my options here?

If you have control of both code bases, you should consider using Protocol Buffers.

You could use Java's DataInput/DataOutput format that is well described in the javadoc.

Take a look at JSON. http://www.json.org. If you go to from javascript it's a big help. I don't know how good the java support is though.

If your structure isn't going to change (much), and your data is in a pretty consistent format, you could just write the values out to a CSV file, or some other plain format.
This can be easily read in Java, and you won't have to worry about serializing to XML. Sometimes going simple is the easiest route.

Take a look at Resin's Hessian/Burlap services. You may not want the whole service, just part of the API and an understanding of the wire protocol.

If:
your data is essentially a big array of floats;
you are able to test the writing/reading procedure in all the likely environments (=combinations of machines/OS/C compiler) that each end will be running on;
performance is important.
then I would probably just keep writing the data from C in the way that you are doing (maybe with a slight amendment -- see below) and turn the problem into how you read that data from Java.
To read the data back in from Java, use a ByteBuffer. Essentially, pull in slabs of bytes from your data, wrap a ByteBuffer around them, and then use the get(), getFloat(), getInt() etc methods. The NIO package also has "wrapper" buffers, e.g. FloatBuffer, which from tests I've done appear to be about 20% faster for reading large numbers of the same type.
Now, one thing you'll have to be careful about is byte ordering. From Java, you need to call order(ByteOrder.LITTLE _ ENDIAN) or order(ByteOrder.BIG _ ENDIAN) on your buffer before you start reading the data. To decide which to use, I'd recommend that at the very start of the stream, you write some known 16-byte value (e.g. 255 = 0x00ff). Then from Java, pull out these two bytes and check the order (0xff, 0x00 or 0x00, 0xff) to see whether you have little or big endian.

One possibility is creating small XML files with title, ID, params, etc, and then a reference (by filename) to where the float data is contained. Assuming there's nothing special about the float data, and that Java and C are using the same floating point format, you can read that file in with readFloat() of a DataInputStream.

I like the CSV and "Protocol Buffers" answers (though, at a glance, the protocol buffer thing might be very similar to YAML for all I know).
If you need tightly packed records for high volume data, you might consider this:
Create a textual file header describing the current file structure: record sizes (types????) and field names / sizes. Read and parse the header, then use low level binary I/O operations to load up each record's fields, er, object's properties or whatever we are calling it this year.
This gives you the ability to change the strucutre a bit and have it be self-describing, while still allowing you to pack a high volume in a smaller space than XML would allow.
TMTOWTDI, I guess.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.