Java Security Risks for using URL class to get html source

Java Security Risks for using URL class to get html source - java

Hello I have a question about the java.net.URL class.
If I just type a random url, and get the html in string form, do I put my computer at risk?
Is it possible that a certain website maybe exploits the URL class and take over my computer though my application, or at least infects with some kind of malware.
I hope that my question is clear, if not please ask me to clear it up, English is not my native language.
Thanks for all your help.

If I just type a random url, and get the html in string form, do I put my computer at risk?
You do not specify what you would consider to be a risk. But fetching a string consumes network bandwidth, CPU time and (in most cases) storage space for the down-loaded text. A malicious HTTP server than served an infinite random string would use some of your network bandwidth and CPU forever, and if your program stored the string would eventually cause your program to fail with an OutOfMemoryError. If your program was configured to use a large fraction of the RAM of the computer, the URL would reduce the performance of all the other programs on your computer, until your program exited.
Something similar, a tarpit, has been done to slow down malicious programs, such as computer worms.

No. There is no way by which any string can harm your machine. There is no harm until and unless you try to execute some extract from the string.

Getting it in string form in itself won't do any harm, although if you set no timeout and don't do the fetching in a background thread, it is possible for the remote party to "freeze" your application by intentionally sending data incredibly slowly.
All this can be avoided by using Executors to fetch the html and setting a sensible limit to how long you're willing to wait for the content.
There is also the buffer size problem that Raedwald mentions, again this can be avoided by limiting the amount of data you accept.
Remember though that while getting the content into a string is safe (with the caveats mentioned already), what you then do with that string is what can make things dangerous.

Related

Ways to buffer REST response

There's a REST endpoint, which serves large (tens of gigabytes) chunks of data to my application.
Application processes the data in it's own pace, and as incoming data volumes grow, I'm starting to hit REST endpoint timeout.
Meaning, processing speed is less then network throughoutput.
Unfortunately, there's no way to raise processing speed enough, as there's no "enough" - incoming data volumes may grow indefinitely.
I'm thinking of a way to store incoming data locally before processing, in order to release REST endpoint connection before timeout occurs.
What I've came up so far, is downloading incoming data to a temporary file and reading (processing) said file simultaneously using OutputStream/InputStream.
Sort of buffering, using a file.
This brings it's own problems:
what if processing speed becomes faster then downloading speed for
some time and I get EOF?
file parser operates with
ObjectInputStream and it behaves weird in cases of empty file/EOF
and so on
Are there conventional ways to do such a thing?
Are there alternative solutions?
Please provide some guidance.
Upd:
I'd like to point out: http server is out of my control.
Consider it to be a vendor data provider. They have many consumers and refuse to alter anything for just one.
Looks like we're the only ones to use all of their data, as our client app processing speed is far greater than their sample client performance metrics. Still, we can not match our app performance with network throughoutput.
Server does not support http range requests or pagination.
There's no way to divide data in chunks to load, as there's no filtering attribute to guarantee that every chunk will be small enough.
Shortly: we can download all the data in a given time before timeout occurs, but can not process it.
Having an adapter between inputstream and outpustream, to pefrorm as a blocking queue, will help a ton.

You're using something like new ObjectInputStream(new FileInputStream(..._) and the solution for EOF could be wrapping the FileInputStream first in an WriterAwareStream which would block when hitting EOF as long a the writer is writing.
Anyway, in case latency don't matter much, I would not bother start processing before the download finished. Oftentimes, there isn't much you can do with an incomplete list of objects.
Maybe some memory-mapped-file-based queue like Chronicle-Queue may help you. It's faster than dealing with files directly and may be even simpler to use.
You could also implement a HugeBufferingInputStream internally using a queue, which reads from its input stream, and, in case it has a lot of data, it spits them out to disk. This may be a nice abstraction, completely hiding the buffering.
There's also FileBackedOutputStream in Guava, automatically switching from using memory to using a file when getting big, but I'm afraid, it's optimized for small sizes (with tens of gigabytes expected, there's no point of trying to use memory).

Are there alternative solutions?
If your consumer (the http client) is having trouble keeping up with the stream of data, you might want to look at a design where the client manages its own work in progress, pulling data from the server on demand.
RFC 7233 describes the Range Requests
devices with limited local storage might benefit from being able to request only a subset of a larger representation, such as a single page of a very large document, or the dimensions of an embedded image
HTTP Range requests on the MDN Web Docs site might be a more approachable introduction.

This is the sort of thing that queueing servers are made for. RabbitMQ, Kafka, Kinesis, any of those. Perhaps KStream would work. With everything you get from the HTTP server (given your constraint that it cannot be broken up into units of work), you could partition it into chunks of bytes of some reasonable size, maybe 1024kB. Your application would push/publish those records/messages to the topic/queue. They would all share some common series ID so you know which chunks match up, and each would need to carry an ordinal so they can be put back together in the right order; with a single Kafka partition you could probably rely upon offsets. You might publish a final record for that series with a "done" flag that would act as an EOF for whatever is consuming it. Of course, you'd send an HTTP response as soon as all the data is queued, though it may not necessarily be processed yet.

not sure if this would help in your case because you haven't mentioned what structure & format the data are coming to you in, however, i'll assume a beautifully normalised, deeply nested hierarchical xml (ie. pretty much the worst case for streaming, right? ... pega bix?)
i propose a partial solution that could allow you to sidestep the limitation of your not being able to control how your client interacts with the http data server -
deploy your own webserver, in whatever contemporary tech you please (which you do control) - your local server will sit in front of your locally cached copy of the data
periodically download the output of the webservice using a built-in http querying library, a commnd-line util such as aria2c curl wget et. al, an etl (or whatever you please) directly onto a local device-backed .xml file - this happens as often as it needs to
point your rest client to your own-hosted 127.0.0.1/modern_gigabyte_large/get... 'smart' server, instead of the old api.vendor.com/last_tested_on_megabytes/get... server
some thoughts:
you might need to refactor your data model to indicate that the xml webservice data that you and your clients are consuming was dated at the last successful run^ (ie. update this date when the next ingest process completes)
it would be theoretically possible for you to transform the underlying xml on the way through to better yield records in a streaming fashion to your webservice client (if you're not already doing this) but this would take effort - i could discuss this more if a sample of the data structure was provided
all of this work can run in parallel to your existing application, which continues on your last version of the successfully processed 'old data' until the next version 'new data' are available
^
in trade you will now need to manage a 'sliding window' of data files, where each 'result' is a specific instance of your app downloading the webservice data and storing it on disc, then successfully ingesting it into your model:
last (two?) good result(s) compressed (in my experience, gigabytes of xml packs down a helluva lot)
next pending/ provisional result while you're streaming to disc/ doing an integrity check/ ingesting data - (this becomes the current 'good' result, and the last 'good' result becomes the 'previous good' result)
if we assume that you're ingesting into a relational db, the current (and maybe previous) tables with the webservice data loaded into your app, and the next pending table
switching these around becomes a metadata operation, but now your database must store at least webservice data x2 (or x3 - whatever fits in your limitations)
... yes you don't need to do this, but you'll wish you did after something goes wrong :)
Looks like we're the only ones to use all of their data
this implies that there is some way for you to partition or limit the webservice feed - how are the other clients discriminating so as not to receive the full monty?

You can use in-memory caching techniques OR you can use Java 8 streams. Please see the following link for more info:
https://www.conductor.com/nightlight/using-java-8-streams-to-process-large-amounts-of-data/

Camel could maybe help you the regulate the network load between the REST producer and producer ?
You might for instance introduce a Camel endpoint acting as a proxy in front of the real REST endpoint, apply some throttling policy, before forwarding to the real endpoint:
from("http://localhost:8081/mywebserviceproxy")
.throttle(...)
.to("http://myserver.com:8080/myrealwebservice);
http://camel.apache.org/throttler.html
http://camel.apache.org/route-throttling-example.html
My 2 cents,
Bernard.

If you have enough memory, Maybe you can use in-memory data store like Redis.
When you get data from your Rest endpoint you can save your data into Redis list (or any other data structure which is appropriate for you).
Your consumer will consume data from the list.

Linux : dirty page writeback and concurrent write

Background : in Java I'm memory mapping a file (shared).
I'm writing some value at the address 0 of that file. I understand the corresponding PAGE in the PAGE CACHE is flagged as DIRTY and will be written later depending on the dirty_ratio and the like settings.
So far so good.
But I'm wondering what is happening when writing once more at the address 0 while the kernel is writing back the dirty page to the file. Is my process blocked somehow waiting for the writeback to be completed?

It may be. It is only necessary when the device-level I/O requests include a checksum alongside the written data. Otherwise, the first write may be torn, but it can then be corrected by the second write.
As always, carefully consider your safety against power-failure, kernel crashes etc.
The waiting is allegedly avoided in btrfs. (Also, by happenstance, in the legacy ext3 filesystem. But not ext4 or ext2).
This looks like it is a bit of a moving target. The above (as far as I could tell) describes the first optimization of this "stable page write" code, following the complaints when it was first introduced. The commit description mentions several possibilities for future changes.
bdi: allow block devices to say that they require stable page writes
mm: only enforce stable page writes if the backing device requires it
Does my device currently use "stable page writes"?
There is a sysfs attribute you can look at, called stable_pages_required

Protecting Sections of Source Code [duplicate]

This question already has answers here:
Handling passwords used for auth in source code
(7 answers)
Closed 7 years ago.
I'm writing a Java class that connects to a server and reads messages in a given queue.
I would like to protect the username and password, which, right now, appear as plain text in the source code.
What I'm wondering, is, what is a good way to do this? If I encrypt the username and password in a text file, won't I need to store the key, in plain text, in any source code that accesses this file? And then anyone else who decides to use my class will be able to gain access to these fields.
There is no prompt where someone can enter the key, either, as this class will autonomously be used by the system.
EDIT: this will become a java lib file. But those can easily be decompiled and thus are basically the original class files anyway, right? And the people this is being protected from are fellow developers of other systems who will gain access to this lib file.
My End Goal: is to have the username and password strings not appear as plain text anywhere, and for them to be as difficult as possible to crack.

It is not possible to do this. Even if you encrypt the login/password and store it somewhere (may it be your class or an external file) you'd still need to save the encryption key somewhere in plain text. This is actually just marginally better than saving username/password in plain text, in fact I would avoid doing so as it creates a false sense of security.
So I'd suggest that your class takes username/password as a parameter and that the system which is using your class will have to care about protecting the credentials. It could do so by asking an end user to enter the credentials or store them into an external file which is only readable to the operating system user that your process is running as.
Edit: You might also think about using mechanisms such as OAuth which use tokens instead of passwords. Tokens have a limited life time and are tied to a certain purpose so they pose a good alternative to access credentials. So your end users could get an access token with their,say, Windows credentials, which is then used inside your class to call the protected service.

This is a classic authentication issue, except that here, Eve can wear Bob's skin like a suit. Is that stretching the metaphor? I'm not sure.
The short answer is that there is no true answer, because what you want is something that basically violates information theory, in that anything transmittable is copyable and thus anything accessible can be viewed as no-longer-unique. Even if you had a magic box, they could just yank out the magic box with some serious JVM hacking.
The long answer is that there are a few solutions that are almost pretty okay, by making it really quite darn hard. I suggest you read the article linked, acquaint yourself with the ideas behind SRP, the vulnerabilities the spec entails, and try to figure out how to get the right to use and implement it. The problem is still there though. It's that you want a system that ensures Bob can never become a flesh-chariot, or fall to the dark side.
Fundamentally, you're breaking the tenth law. I agree with Kork, there's no solution that really does what you want, because you're trying to solve a social problem with a technical feat, one that is quite nearly provably impossible.

There are a few ways of handling this problem. The challenge as you've noted is associating an account with this automated process. So, here are some of the possibilities (from least secure to more secure):
Encrypt the username and password with a calculated key.
The calculated key is based on something both the client and the server can infer (like machine name and IP address)
Associate an authentication token with the client (OAuth style).
The token is negotiated by a one time user interaction to set up the client
The negotiated token is used for all future requests
The negotiated token is only valid for that client on that machine using that user account (server uses socket info to determine the match)
Use multiple forms of authentication
OAuth style token
Calculated token based on time + secondary id (requires clients and servers to be synched to the same time server)
It is important to note that your security measures should be more restrictive than it is worth to crack. In short, if all the potential bad guy is only going to be able to get your food preferences of the day you might not need to be as vigilent as protecting something more high profile like a bank account. User names and paswords are not the only means of authentication.

It's not clear which code has to know the user name & password. Are these credentials just for the queue being read? If so, only the server code would need to know them. In that case, you could store them in a server file whose permissions allow only the server code to read them. The file permissions would then be enforced by the server operating system, which presuambly is much better at security than most programmers will ever be.

I know this question is long since abandoned, but I want to point out that of course you can do this by requiring typed credentials at runtime but only storing a hash of the password. Of course, it needs to be a really good hash. Use a standard one, don't make up your own. The whole point of a hash is that even if you plain text the hashed result, no one else will be able to come up with a string that yields that hash, even if they know how the hash is computed.
Of course users can try a brute force attack, and since they know the result they want they can run it fast, so you need to use a highly secure password.

URL.openStream() is very slow when ran on school's unix server

I am using URL.openStream() to download many html pages for a crawler that I am writing. The method runs great locally on my mac however on my schools unix server the method is extremely slow. But only when downloading the first page.
Here is the method that downloads the page:
public static String download(URL url) throws IOException {
Long start = System.currentTimeMillis();
InputStream is = url.openStream();
System.out.println("\t\tCreated 'is' in "+((System.currentTimeMillis()-start)/(1000.0*60))+"minutes");
...
}
And the main method that invokes it:
LinkedList<URL> ll = new LinkedList<URL>();
ll.add(new URL("http://sheldonbrown.org/bicycle.html"));
ll.add(new URL("http://www.trentobike.org/nongeo/index.html"));
ll.add(new URL("http://www.trentobike.org/byauthor/index.html"));
ll.add(new URL("http://www.myra-simon.com/bike/travel/index.html"));
for (URL tmp : ll) {
System.out.println();
System.out.println(tmp);
CrawlerTools.download(tmp);
}
Output locally (Note: all are fast):
http://sheldonbrown.org/bicycle.html
Created 'is' in 0.00475minutes
http://www.trentobike.org/nongeo/index.html
Created 'is' in 0.005083333333333333minutes
http://www.trentobike.org/byauthor/index.html
Created 'is' in 0.0023833333333333332minutes
http://www.myra-simon.com/bike/travel/index.html
Created 'is' in 0.00405minutes
Output on School Machine Server (Note: All are fast except the first one. The first one is slow regardless of what the first site is):
http://sheldonbrown.org/bicycle.html
Created 'is' in 3.2330666666666668minutes
http://www.trentobike.org/nongeo/index.html
Created 'is' in 0.016416666666666666minutes
http://www.trentobike.org/byauthor/index.html
Created 'is' in 0.0022166666666666667minutes
http://www.myra-simon.com/bike/travel/index.html
Created 'is' in 0.009533333333333333minutes
I am not sure if this is a Java issue (*A problem in my Java code) or a server issue. What are my options?
When run on the server this is the output of the time command:
real 3m11.385s
user 0m0.277s
sys 0m0.113s
I am not sure if this is relevant... What should I do to try and isolate my problem..?

You've answered your own question. It's not a Java issue, it has to do with your school's network or server.
I'd recommend that you report your timings in milliseconds and see if they're repeatable. Run that test in a loop - 1,000 or 10,000 times - and keep track of all the values you get. Import them into a spreadsheet and calculate some statistics. Look at the distribution of values. You don't know if the one data point that you have is an outlier or the mean value. I'd recommend that you do this for both networks in exactly the same way.
I'd also recommend using Fiddler or some other tool to watch network traffic as you download. You can get better insight into what's going on and perhaps ferret out the root cause.
But it's not Java. It's your code, your network. If this was a bug in the JDK it would have been fixed a long time ago. Suspect yourself first, last, and always.
UPDATE:
My network admin assured me that this
was a bad java implementation Not a
network problem. What do you think?
"Assured" you? What evidence did s/he produce to support this conclusion? What data? What measurements were taken? Sounds like laziness and ignorance to me.
It certainly doesn't explain why all the other requests behave just fine. What changed in Java between the first and subsequent calls? Did the JVM suddenly rewrite itself?
You can accept it if you want, but I'd say shame on your network admin for not being more curious. It would have been more honorable to be honest and say they didn't know, didn't have time, and weren't interested.

By Default Java prefers to use IPv6. My school's firewall
drops all IPv6 traffic (with no warning). After 3 minutes, 15 seconds Java falls back to IPv4. Seems strange to me that it takes so long to fall back to IPv4.
duffymo's answer, essentially: "Go talk to your network admin", helped me to solve the problem however I think that this is a problem caused by a strange Java implementation and a strange network configuration.
My network admin assured me that this was a bad java implementation Not a network problem. What do you think?

Can someone steal a password from a Java application?

Suppose there is a String variable that holds a plain text password.
Is there any possibility of reading this password using a memory dump. (Suppose using cheat engine.) I am puzzled with this JVM thing. Does JVM provide some sort of protection against this. If no what are the practices that I need to use to avoid such "stealing".
A practical threat would be a Trojan; that sends the segments of the memory dump to an external party.

As already noted, yes, anybody can extract the password, in various ways. Encrypting the password won't really help -- if it's decrypted by the application, then the decrypted form will also be present at some point, plus the decryption key (or code) itself becomes a vulnerability. If it's sent somewhere else in encrypted form, then just knowing the encrypted form is enough to spoof the transaction, so that doesn't help much either.
Basically, as long as the "attacker" is also the "sender", you're eventually going to get cracked -- this is why the music and video industries can't get DRM to work.
I suggest you pick up a copy of Applied Cryptography and read the first section, "Cryptographic Protocols". Even without getting into the mathematics of the actual crypto, this will give you a good overview of all sorts of design patterns in this area.

If you keep the password in plain text in your application then someone can read it by playing with memory dumps regardless of the language or runtime you use.
To reduce the chance of this happening only keep the password in plain text when you really need to, then dump or encrypt it. One thing to note here is that JPasswordField returns a char[] rather than a string. This is because you have no control over when the String will vanish. While you also have no control over when a char[] will vanish you can fill it with junk when you are done with the password.
I say reduce as this will not stop someone. As long as the password is in memory it can be recovered, and as the decryption has to also be part of the deliverable it too could be cracked leaving your password wide open.

This has nothing to do with Java - the exact same problem (if it really is one) exists for applications written in any language:
If the executable contains a password, no matter how obfuscated or encrypted, everyone who has access to the executable can find out the password.
If an application knows a password or key temporarily (e.g. as part of an network authentication protocol) then anyone who can observe the memory the application is executing in can find out the password.
The latter is usually not considered a problem, since a modern OS does not allow arbitrary applications to observe each other's memory, and privilege escalation attacks typically rely on different vectors of attack.

If the program knows the password, anybody using the program can extract the password.

In theory you could just hook it up to the debugger... set a breakpoint...and read the string contents

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.