Is there a better way to shorten (Use fewer characters) a String in java besides converting the chars to int's and running them through base36?
For example, say if I wanted to shorten a URL.
Short URL services (like 'tinyurl') work by storing a big database table that maps from short URLs to their full form.
When you request a tinyurl, the service allocates a random-looking short url (that is not currently in use) and creates an entry in its table that maps from the short url to your supplied longer one.
When you try to load the short url in a browser, the request first goes to the tinyURL service, which looks up the full URL and then sends an HTTP redirect response to the browser telling it to go to the real URL.
You can implement your own URL shortening service by doing the same thing, though if you are shortening your own URLs you can maybe do the redirection internally to your web server; e.g. using a servlet request filter.
I described the above in the context of shortening URLs in a way that still allows the URLs to be resolved1. But, this approach can also be used more generally; i.e. by creating a pair of Map<String,String> objects and populate it with bidirectional mappings between sequentially generated short strings and the original (probably longer) strings. It is possible to prove that will give a smaller average size of short string than any algorithmic compression or encoding scheme over the same set of long strings.
The downside is the space needed to store the mappings, and the fact that you need the mappings any place (e.g. on any computer) where you need to do the short-to-long or long-to-short conversions.
1 - When you think about it, that is essential. If you shorten a URL string and the result is no longer resolvable, it not a useful URL for most purposes.
Since URL's are UTF-8, and since the characters are therefore base 256, encoding the same characters as integer code-points in base 32 can only make them longer. Or are you not asking what it sounds like you are asking?
Further, in Java Strings are base 65536 UTF-16, so encoding their code points as base 32 will make Java strings even longer.
Just as encoding binary data in base 64 makes it longer by 4/3's - every 3 bytes requires 4 base 64 bytes to encode.
Put the full Urls in a database and give the id as the redirect URL
Related
I am sending data via json body in a post request from a client (Java) to a server (Java) using a Spring RestTemplate and RestController.
The data is present as a POJO on the client and will be parsed into a POJO with the same structure on the server.
On the client I am converting a file with Files.readAllBytes to byte[] and store it in the content field.
On the server side the whole object including the byte[] will be marshalled to XML using JAXB annotations.
class BinaryObject {
String fileName;
String mimeCode;
byte[] content;
}
Everything is working fine and running as intended.
I heard it could be beneficial to encode the content field before transmitting the date to the server and decode it there before it is marshaled into XML.
My Question
Is it necessary or recommended to additionally encode / decode the content field with base64?
TL;DR
To the best of my knowledge, you are not going against any good practice with your current implementation. One might question the design (exchanging files in JSON ? Storing binary inside XML ?), but this is a separate question.
Still, there is room for possible optmization, but the toolset you use (e.g. Spring rest template + Spring Controler + JSON serialization (jackson) + XML using JAXB) kind of hide the possible optimizations from you.
You have to carrefully weight the pros and cons of working around your comfortable "automat(g)ical" serializations that work well as of today to see if it is worth the trouble to tweak it.
We can nonetheless discuss the theory of what could be done.
A discussion about Base64
Base64 encoding in an efficient way to encode binary data in pure text formats (e.g. MIME strucutres such as email or some HTTP bodies, JSON, XML, ...) but it has two costs : the first is a non negligible size increase (~ 33% size), the second is CPU time.
Sometimes, (but you'd have to profile, check if that is your case), this cost is not negligible, esp. for large files (due to some buffering and char/byte conversions in the frameworks, you could easilly end up using e.g. 4x the size of the encoded file in the Java Heap).
When handling 10kb files at 10 requests/sec, this is usually NOT an issue.
But 10MB files at 100 req/second, well that is another ball park.
So you'd have to check (I doubt your typical server will reach 100 req/s with 10MB files, because that is a 1GB/s incoming network bandwidth).
What is optimizable in your current process
In your current process, you have multiple encodings taking place : the client needs to Base64 encode the bytes read from the file.
When the request hits the server, the server decodes the base64 to a byte[], then your XML serialization (JAXB) reconverts the byte[] to base64.
So in effect, "you" (more exactly, the REST controler side of things) decoded base64 content, all for nothing because the XML side of things could have used it directly.
What could be done
A few things.
Do you need base64 at the calling site ?
First, you do not have to encode at the client side. When using JSON, there is no choice, but the world did not wait for JSON to exchange files (e.g. arbitrary binary content) over HTTP.
If your content is a file name, a MIME type, and a file body, then standard, direct HTTP calls with no JSON at all is perfectly fine.
The MIME type could be mapped to the Content-Type HTTP Header, the file name inside the Content-Disposition HTTP header, and the contents as the raw HTTP body. No base64 needed (but you need your server-side to accept raw HTTP content as is). This is standard as can be.
This change would allow you to remove the encoding (client side), lower the network size of the call (~33% less), and remove one decoding at the server side. The server would just have to base64 encode (once) a raw stream to produce the XML, and you would not even need to buffer the whole file contents for that (you'd have to tweak you JAXB model a bit, but you can JAXB serialize directly bytes from an InputStream, which means, almost no buffer, and since your CPU probably encodes faster than your network serves content, no real latency incurred).
If this, for some reason, is not an option, let's say your client has to send JSON (and therefore base64 content)
Can you avoid decoding at the server side
Sort of. You can use a server-side bean where the content is actually a String and NOT a byte[]. This is hacky, but your REST controler will no longer deserialize base64, it will keep it "as is", which is a JSON string (which happens to be base64 encoded content, but the controler does not care).
So your server will have saved the CPU cost of one base64 decoding, but in exchange, you'll have a base64 String in java heap (compared to the raw byte[], +33% size on Java >=9 with compact strings, +166% size on Java < 9).
If you are to profit from this, you also have to tweak your JAXB to see the base64 encoded String as a byte[], which is not trivial as far as I can tell, unless you modify the JAXB object in such a way that it accepts a String instead of the byte[] which is kind of hacky (if your JAXB objects are generated from a XML schema, this might really become a pain to implement)
All in all this is much harder - probably too much if you are not really hitting the wall, performance wise, on this particular issue.
A few other stuff
Are your files pure binary, or are they actually text ? If there are text, you may benefit from using CDATA encoding on the XML side instead of base64 ?
Is your XML actually a SOAP call ? If so, and if the service supports MTOM, you could avoid base64 completely, but that is an altogether different subject.
I have a Java WebAgent in Lotus-Domino which runs through the OpenURL command (https://link.com/db.nsf/agentName?openagent). This agent is created for receiving a POST with XML content. Before even parsing or saving the (XML) content, the webagent saves the content into a in-memory document:
For an agent run from a browser with the OpenAgent URL command, the
in-memory document is a new document containing an item for each CGI
(Common Gateway Interface) variable supported by Domino®. Each item
has the name and current value of a supported CGI variable. (No design
work on your part is needed; the CGI variables are available
automatically.)
https://www.ibm.com/support/knowledgecenter/en/SSVRGU_9.0.1/basic/H_DOCUMENTCONTEXT_PROPERTY_JAVA.html
The content of the POST will be saved (by Lotus) into the request_content field. When receiving content with this character: é, like:
<Name xml:lang="en">tést</Name>
The é is changed by Lotus to a ?®. This is also what I see when reading out the request_content field in the document properties. Is it possible to save the é as a é and not a: ?® in Lotus?
Solution:
The way I fixed it is via this post:
Link which help me solve this problem
The solution but in Java:
/****** INITIALIZATION ******/
session = getSession();
AgentContext agentContext = session.getAgentContext();
Stream stream = session.createStream();
stream.open("C:\\Temp\\test.txt", "LMBCS");
stream.writeText(agentContext.getDocumentContext().getItemValueString("REQUEST_CONTENT"));
stream.close();
stream.open("C:\\Temp\\test.txt", "UTF-8");
String Content = stream.readText();
stream.close();
System.out.println("Content: " + Content);
I've dealt with this before, but I no longer have access to the code so I'm going to have to work from memory.
This looks like a UTF-8 vs UTF-16 issue, but there are up to five charsets that can come into play: the charset used in the code that does the POST, the charset of the JVM the agent runs in, the charset of the Domino server code, the charset of the NSF - which is always LMBCS, and the charset of the Domino server's host OS.
If I recall correctly, REQUEST_CONTENT is treated as raw data, not character data. To get it right, you have to handle the conversion of REQUEST_CONTENT yourself.
The Notes API calls that you use to save data in the Java agent will automatically convert from Unicode to LMBCS and vice versa, but this only works if Java has interpreted the incoming data stream correctly. I think in most cases, the JVM running under Domino is configured for UTF-16 - though that may not be the case. (I recall some issue with a server in Japan, and one of the charsets that came into play was one of the JIS standard charsets, but I don't recall if that was in the JVM.)
So if I recall correctly, you need to read REQUEST_CONTENT as UTF-8 from a String into a byte array by using getBytes("UTF-8") and then construct a new String from the byte array using new String(byte[] bytes, "UTF-16"). That's assuming that Then pass that string to NotesDocument.ReplaceItemValue() so the Notes API calls should interpret it correctly.
I may have some details wrong here. It's been a while. I built a database a long time ago that shows the LMBCS, UTF-8 and UTF-16 values for all Unicode characters years ago. If you can get down to the byte values, it can be a useful tool for looking at data like this and figuring out what's really going on. It's downloadable from OpenNTF here. In a situation like this, I recall writing code that got the byte array and converted it to hex and wrote it to a NotesItem so that I could see exactly what was coming in and compare it to the database entries.
And, yes, as per the comments, it's much better if you let the XML tools on both sides handle the charset issues and encoding - but it's not always foolproof. You're adding another layer of charsets into the process! You have to get it right. If the goal is to store data in NotesItems, you still have to make sure that the server-side XML tools decode into the correct charset, which may not be the default.
my heart breaks looking at this. I also just passed through this hell, found the old advice, but... I just could not write to disk to solve this trivial matter.
Item item = agentContext.getDocumentContext().getFirstItem("REQUEST_CONTENT");
byte[] bytes = item.getValueCustomDataBytes("");
String content= new String (bytes, Charset.forName("UTF-8"));
Edited in response to comment by OP: There is an old post on this theme:
http://www-10.lotus.com/ldd/nd85forum.nsf/DateAllFlatWeb/ab8a5283e5a4acd485257baa006bbef2?OpenDocument (the same thread that OP used for his workaround)
the guy claims that when he uses a particular http header the method fails.
Now he was working with 8.5 and using LS. In my case I cannot make it fail by sending an additional header (or in function of the string argument)
How I Learned to Stop Worrying and Love the Notes/Domino:
For what it's worth getValueCustomDataBytes() works only with very short payloads. Dependent on content! Starting your text with an accented character such as 'é' will increase the length it still works with... But whatever I tried I could not get past 195 characters. Am I surprised? After all these years with Notes, I must admit I still am...
Well, admittedly it should not have worked in the first place as it is documented to be used only with User Defined Data fields.
Finally
Use IBM's icu4j and icu4j-charset packages - drop them in jvm/lib/ext. Then the code becomes:
byte[] bytes = item.getText().getBytes(CharsetICU.forNameICU("LMBCS"));
String content= new String (bytes, Charset.forName("UTF-8"));
and yes, will need a permission in java.policy:
permission java.lang.RuntimePermission "charsetProvider";
Is this any better than passing through the file system? Don't know. But kinda looks cleaner.
Generally is there a way to get a big JSON string by a single request by parts?
For example, if I have a JSON string consisting of three big objects and having each size of 1mb, can I somehow in a single request get the first 1mb then parse it while other 3 objects are still being downloaded, instead of waiting for the full 3mb string to download?
If you know how big the parts are, it would be possible to split your request in three using HTTP1.1's range requests. Assuming your ranges are defined correctly, you should be able to get the JSON objects directly from the server (if the server supports range requests).
Note that this hinges on a) the server's capability to handle range requests, b) the idempotency of your REST operation (it could very well run the call three times, a cache or reverse proxy may help with this) and c) your ability to know the ranges before you call.
I'm making a small project in Google AppEngine but I'm having problems with international chars. My program takes data from the user through the url "page.html?data1&data2..." and stores it for displaying later.
But when the user are using some international characters like åäö it gets coded as %F4, %F5 and %F6. I assume it is because only the first 128(?) chars in ASCII table are allowed in http-requests.
Is there anyone who has a good solution for this? Any simple way to decode the text? And is it better to decode it before I store the data or should I decode it when displaying it to the user.
URLs can contain anything, but it should be encoded. In Java you can use URLEncoder and URLDecoder to encode and decode urls with the desired character encoding.
Have in mind that these classes are actually meant for HTML form encoding, but they can be applied to the query string (the parameters) of the URLs, so do not use them on the whole URLs - only on the parameters.
The URI spec (RFC 3986) restricts the characters that can be used in URIs (see the ABNF) and defines a percent-encoding scheme for transmitting "unsafe" characters. As Bozho says, the query part of the URL is usually encoded as per the HTML spec (application/x-www-form-urlencoded).
The doc for App Engine says:
App Engine uses the Java Servlet standard for web applications.
So, you should probably let the Servlet API decode the parameters for you. See the parameter methods on HttpServletRequest. This sort of encoding should generally be kept to the view layer, so data would be stored unencoded.
If you do things manually, have a look at this blog post on character handling in URIs.
I got the answer for If I disabled the cookies then using URL ReDirect I can pass the JSESSIONID but my URL is already very long as I use the GET method it has constraint. Then how
should I use my sessions.I want my application to be very security intensive.
This is one of the question asked to my friend in GOOGLE interview.
Apart from using one-letter parameter names (e.g. ?a=value1&b=value2&c=value3 or using RESTFul-like URL's (i.e. just the pathinfo, no query parameters, e.g. /value1/value2/value3, which is accessible by HttpServletRequest#getPathInfo() in the servlet) instead of ?name1=value1&name2=value2&name3=value3, you can also consider to Gzip and Base64-encode the query string so that it becomes shorter. Both JavaScript and Java are capable of (de)compressing and (d)e(n)coding it. You can eventually format the query string in JSON before compressing/encoding, it will be shorter in case of arrays/collections/maps.
That said, are you sure that the request URL's are often that unfriendly long (assuming that it's over 255 characters)? Why would you need to pass that much information in? Are they supposed to maintain the client state? If so, you shouldn't use the URL for this, but the HttpSession instance in the server side which is already associated with the jsessionid cooke. Use HttpSession#setAttribute() to store some information in session and use HttpSession#getAttribute() to retrieve it.
As far as I understand, your main problem with JSESSIONID in the URL is the total length.
Perhaps you should have a closer look at why the length of the URLs are too long in the first place. Since you allready have a session, it is not unlikely you can move some GET parameters to the session. There are also lots of different way to make shorter URLs for pages (a la mod_rewrite).
With regards to security, JSESSIONID is just as vunerable with HTTP GET as HTTP POST. The base64 encoding HTTP POST does is not a security measure at all. The best way to gain a bit more security is to encrypt the transport channel through TLS/SSL, in effect enable HTTPS. This will make sure that eavesdropping (or man in the middle attacks) will not have access to the plain text.
If you want your application to be security intensive why are you using GET. Use POST. This will also reduce the URL length.
As such, as per the HTTP protocol there is no max length limit to URL length. Most of the time its the browser that puts in the max length limit. Try different browsers
You should put forward the above points to the interviewer. They might be more interested in your ability to assess the system as a whole and identify any fundamental flaws.
If the URL is too long then you have to store that data somewhere else. Most sites would put the session ID in a cookie.