This FileInputStream.available() javadoc says:
Returns an estimate of the number of
remaining bytes that can be read (or
skipped over) from this input stream
without blocking by the next
invocation of a method for this input
stream. The next invocation might be
the same thread or another thread. A
single read or skip of this many bytes
will not block, but may read or skip
fewer bytes.
In some cases, a non-blocking read (or
skip) may appear to be blocked when it
is merely slow, for example when
reading large files over slow
networks.
I'm not sure if in this check:
if (new FileInputStream(xmlFile).available() == 0)
can I rely that empty files will always return zero?
--
Thanks #SB, who does not exactly answered the question, but was the first to give the best alternative:
If xmlFile is a java.io.File object,
you can use the length() method to get
its size.
You can rely on new FileInputStream(fileName).available() returning zero if the named file is empty.
You cannot rely on new FileInputStream(fileName).available() == 0 as a definitive test that the file is empty. If fileName is a regular file on a local file system it will probably work. But if fileName is a device file or if it is a file on a remote file system, available() may return zero to report that a read() will have to block for a period. (Or in the case of a remote file system, it may not.)
A more reliable way to test the length of a regular file is to use new File(fileName).length() == 0. However for a device file or pipe, a length() call may return zero, irrespective of the number of bytes that can ultimately be read. And bear in mind that new File(fileName).length() also returns zero if the file does not exist.
EDIT If you want a reliable test to see if a file is empty, you have to make a number of calls:
public static isEmptyFile(String fileName) {
File file = new File(fileName);
if (!file.exists()) {
return false;
} else if (file.length() != 0L) {
return false;
} else if (file.isFile()) {
return true;
} else if (file.isDirectory()) {
return false;
} else {
// It may be impossible to tell that a device file / named pipe is
// "empty" without actually reading it. This is not a failing of
// Java: it is a logical consequence of the way that certain
// devices, etc work.
throw new CannotAnswerException(...);
}
}
But you would be well advised to test this carefully with a variety of "file" types on all platforms that you run your application on. The behavior of some of the file predicates is documented as being platform specific; see the javadoc.
I strongly advise against using available() - it can return 0 because the stream is blocked, even though there are still enough bytes to read. It probably won't occur with Files but the API does not guarantee it won't.
The same approach can be used with read() though:
if (new FileInputStream(xmlFile).read() == -1)
System.out.println("!!File empty!!");
If xmlFile is a java.io.File object, you can use the length() method to get its size.
My logical answer to the question "can I rely that empty files will always return zero?" is "Yes, for empty files, available() will return 0".
But you probably do also want to know "can I rely that only empty files will return zero?", and there the answer is "No, not by specification: available() might return 0 even if the file is not empty".
Additionally, you are opening a stream on a file and do not close it again. This may lead to unexpected and undesired behaviour, e.g. you may not be able to move or delete the file as long as your Java program is running. This is especially annoying if your program runs in an application server which usually runs for a very long time, making the file effectively immutable.
Related
I've read the java docs and a number of related questions but am unsure if the following is guaranteed to work:
I have a DataInputStream on a dedicated thread that continually reads small amounts of data, of known byte-size, from a very active connection. I'd like to alert the user when the stream becomes inactive (i.e. network goes down) so I've implemented the following:
...
streamState = waitOnStreamForState(stream, 4);
int i = stream.readInt();
...
private static int
waitOnStreamForState(DataInputStream stream, int nBytes) throws IOException {
return waitOnStream(stream, nBytes, STREAM_ACTIVITY_THRESHOLD, STREAM_POLL_INTERVAL)
? STREAM_STATE_ACTIVE
: STREAM_STATE_INACTIVE;
private static boolean
waitOnStream(DataInputStream stream, int nBytes, long timeout, long pollInterval) throws IOException {
int timeWaitingForAvailable = 0;
while( stream.available() < nBytes ){
if( timeWaitingForAvailable >= timeout && timeout > 0 ){
return false;
}
try{
Thread.sleep(pollInterval);
}catch( InterruptedException e ){
Thread.currentThread().interrupt();
return (stream.available() >= nBytes);
}
timeWaitingForAvailable += pollInterval;
}
return true;
}
The docs for available() explain:
Returns an estimate of the number of bytes that can be read (or skipped over) from this input stream without blocking by the next caller of a method for this input stream. The next caller might be the same thread or another thread. A single read or skip of this many bytes will not block, but may read or skip fewer bytes.
Does this mean it's possible the next read (inside readInt()) might only, for instance, read 2 bytes, and the subsequent read to finish retrieving the Integer could block? I realize readInt() is a method of the stream 'called next' but I presume it has to loop on a read call until it gets 4 bytes and the docs don't mention subsequent calls. In the above example is it possible that the readInt() call could still block even if waitOnStreamForState(stream, 4) returns STREAM_STATE_ACTIVE?
(and yes, I realize my timeout mechanism is not exact)
Does this mean it's possible the next read (inside readInt()) might only, for instance, read 2 bytes, and the subsequent read to finish retrieving the Integer could block?
That's what it says. However at least the next read() won't block.
I realize readInt() is a method of the stream 'called next' but I presume it has to loop on a read call until it gets 4 bytes and the docs don't mention subsequent calls. In the above example is it possible that the readInt() call could still block even if waitOnStreamForState(stream, 4) returns STREAM_STATE_ACTIVE?
That's what it says.
For example, consider SSL. You can tell that there is data available, but you can't tell how much without actually decrpyting it, so a JSSE implementation is free to:
always return 0 from available() (this is what it used to do)
always return 1 if the underlying socket's input stream has available() > 0, otherwise zero
return the underlying socket input stream's available() value and rely on this wording to get it out of trouble if the actual plaintext data is less. (However the correct value might still be zero, if the cipher data consisted entirely of handshake messages or alerts.)
However you don't need any of this. All you need is a read timeout, set via Socket.setSoTimeout(), and a catch for SocketTimeoutException. There are few if any correct uses of available(): fewer and fewer as time goes on, it seems to me. You should certainly not waste time calling sleep().
I am using klocwok to review my code.
For the given line of code:
byte sigToVerify = new byte[sigFileInputStream.available()];
I am getting the following error report:
SV.DOS.ARRSIZE: Unvalidated user input
sigFileInputStream.available() used for array size - attacker can
specify a large number leading to high resource usage on the server
and a DOS attack
Please help me resolve this issue.
Without more of your code snippet to go on, I would think that Klocwork is reporting a valid issue here. You should review the documentation provided for the SV.DOS.ARRSIZE checker, which explains why this is reported. On the Vulnerability and risk:
The use of data from outside the application must be validated before
use by the application. If this data is used to allocate arrays of
objects in the application, the content of the data must be closely
checked. Attackers can exploit this vulnerability to force the
application to allocate very large numbers of objects, leading to high
resource usage on the application server and the potential for a
denial-of-service (DoS) condition.
On the Mitigation and prevention:
The prevention of DoS attacks from user input can be achieved by
validating any and all input from outside the application (user input,
file input, system parameters, etc.). Validation should include length
and content. ... Data used for allocation should also be checked for
reasonable values, assuming that user input could contain very small
or very large values.
Even the Java InputStream API docs (of which FileInputStream is a subclass) warn that using the return value of the available() method is a bad idea:
Note that while some implementations of InputStream will return the
total number of bytes in the stream, many will not. It is never
correct to use the return value of this method to allocate a buffer
intended to hold all data in this stream.
An example of how to fix your code to avoid this would be to, as suggested above, validate the value returned by available() before using it to allocate the array:
int buffSize = sigFileInputStream.available();
if (buffSize > 0 && buffSize < 100000000) { // 100MB
byte sigToVerify = new byte[buffSize];
// do something with sigToVerify ...
} else {
// error
}
Note that 100000000 or 100MB for sigToVerify may still be way too large for your purposes, or it could be too small. You should determine the most sane value to use here based on what your code is trying to accomplish.
I have a large file (English Wikipedia articles only database as XML files). I am reading one character at a time using BufferedReader. The pseudocode is:
file = BufferedReader...
while (file.ready())
character = file.read()
Is this actually valid? Or will ready just return false when it is waiting for the HDD to return data and not actually when the EOF has been reached? I tried to use if (file.read() == -1) but seemed to run into an infinite loop that I literally could not find.
I am just wondering if it is reading the whole file as my statistics say 444,380 Wikipedia pages have been read but I thought there were many more articles.
The Reader.ready() method is not intended to be used to test for end of file. Rather, it is a way to test whether calling read() will block.
The correct way to detect that you have reached EOF is to examine the result of a read call.
For example, if you are reading a character at a time, the read() method returns an int which will either be a valid character or -1 if you've reached the end-of-file. Thus, your code should look like this:
int character;
while ((character = file.read()) != -1) {
...
}
This is not guaranteed to read the whole input. ready() just tells you if the underlying stream has some content ready. If it is abstracting over a network socket or file, for example, it could mean that there isn't any buffered data available yet.
For searching a string in a file and writing the lines with matched string to another
file it takes 15 - 20 mins for a single zip file of 70MB(compressed state).
Is there any ways to minimise it.
my source code:
getting Zip file entries
zipFile = new ZipFile(source_file_name);
entries = zipFile.entries();
while (entries.hasMoreElements())
{ ZipEntry entry = (ZipEntry)entries.nextElement();
if (entry.isDirectory())
{
continue;
}
searchString(Thread.currentThread(),entry.getName(), new BufferedInputStream (zipFile.getInputStream(entry)), Out_File, search_string, stats); }
zipFile.close();
Searching String
public void searchString(Thread CThread, String Source_File, BufferedInputStream in, File outfile, String search, String stats) throws IOException
{
int count = 0;
int countw = 0;
int countl = 0;
String s;
String[] str;
BufferedReader br2 = new BufferedReader(new InputStreamReader(in));
System.out.println(CThread.currentThread());
while ((s = br2.readLine()) != null)
{
str = s.split(search);
count = str.length - 1;
countw += count; //word count
if (s.contains(search))
{
countl++; //line count
WriteFile(CThread,s, outfile.toString(), search);
}
}
br2.close();
in.close();
}
--------------------------------------------------------------------------------
public void WriteFile(Thread CThread,String line, String out, String search) throws IOException
{
BufferedWriter bufferedWriter = null;
System.out.println("writre thread"+CThread.currentThread());
bufferedWriter = new BufferedWriter(new FileWriter(out, true));
bufferedWriter.write(line);
bufferedWriter.newLine();
bufferedWriter.flush();
}
Please help me. Its really taking 40 mins for 10 files using threads and 15 - 20 mins for a single file of 70MB after being compressed. Any ways to minimise the time.
You are reopening the file output handle for every single line you write.
This is likely to have a massive performance impact, far outweighing other performance issues. Instead I would recommend creating the BufferedWriter once (e.g. upon the first match) and then keeping it open, writing each matching line and then closing the Writer upon completion.
Also, remove the call to flush(); there is no need to flush each line as the call to Writer.close() will automatically flush any unwritten data to disk.
Finally, as a side note your variable and method naming style does not follow the Java camel case convention; you might want to consider changing it.
I'm not sure if the cost you are seeing is from disk operations or from string manipulations. I'll assume for now that the problem is the strings, you can check that by writing a test driver that runs your code with the same line over and over.
I can tell you that split() is going to be very expensive in your case because you are producing strings you don't need and then recycling them, creating much overhead. You may want to increase the amount of space available to your JVM with -Xmx.
If you merely separate words by the presence of whitespace, then you would do much better by using a regular expression matcher that you create before the loop and apply it to the string The number of matches when applied to a given string will be your word count, and that should not create an array of strings (which is very wasteful and which you don't use). You will see in the JavaDocs that split does work via regular expressions; that is true, but split does the extra step of creating separate strings and that's where your waste might be.
You can also use a regular expression to search for the match instead of contains though that may not be significantly faster.
You could make things parallel by using multiple threads. However, if split() is the cause of your grief, your problem is the overhead and running out of heap space, so you won't necessarily benefit from it.
More generally, if you need to do this a lot, you may want to write a script in a language more "friendly" to string manipulation. A 10-line script in Python can do this much faster.
wow, what are you doing in this method
WriteFile(CThread,s, outfile.toString(), search);
every time you got the line containing your text, you are creating BufferedWriter(new FileWriter(out, true));
Just create a bufferedWriter in your searchString method and use that to insert lines. No need to open that again and again. It will drastically improve the performance.
One problem here might be that you stop reading when you write. I would probably use one thread for reading and another thread for writing the file. As an extra optimization the thread writing the results could buffer them into memory and write them to the file as a batch, say every ten entries or something.
In the writing thread you should queue the incoming entries before handling them.
Of course, you should maybe first debug where that time is spent, is it the IO or something else.
There are too many potential bottlenecks in this code for anyone to be sure what the critical ones are. Therefore you should profile the application to determine what it causing it to be slow.
Armed with that information, decide whether the problem is in reading the ZIP file, soing the searching or writing the matches to the output file.
(Repeatedly opening and closing the output file is a bad idea, but if you only get a tiny number of search hits it won't make much difference to the overall performance.)
I am writing a utility in Java that reads a stream which may contain both text and binary data. I want to avoid having I/O wait. To do that I create a thread to keep reading the data (and wait for it) putting it into a buffer, so the clients can check avialability and terminate the waiting whenever they want (by closing the input stream which will generate IOException and stop waiting). This works every well as far as reading bytes out of it; as binary is concerned.
Now, I also want to make it easy for the client to read line out of it like '.hasNextLine()' and '.readLine()'. Without using an I/O-wait stream like buffered stream, (Q1) How can I check if a binary (byte[]) contain a valid unicode line (in the form of the length of the first line)? I look around the String/CharSet API but could not find it (or I miss it?). (NOTE: If possible I don't want to use non-build-in library).
Since I could not find one, I try to create one. Without being so complicated, here is my algorithm.
1). I look from the start of the byte array until I find '\n' or '\r' without '\n'.
2). Then, I cut the byte array from the start to that point and using it to create a string (with CharSet if specified) using 'new String(byte[])' or 'new String(byte[], CharSet)'.
3). If that success without exception, we found the first valid line and return it.
4). Otherwise, these bytes may not be a string, so I look further to another '\n' or '\r' w/o '\n'. and this process repeat.
5. If the search ends at the end of available bytes I stop and return null (no valid line found).
My question is (Q2)Is the following algorithm adequate?
Just when I was about to implement it, I searched on Google and found that there are many other codes for new line, for example U+2424, U+0085, U+000C, U+2028 and U+2029.
So my last question is (Q3), Do I really need to detect these code? If I do, Will it increase the chance of false alarm?
I am well aware that recognize something from binary is not absolute. I am just trying to find the best balance.
To sum up, I have an array of byte and I want to extract a first valid string line from it with/without specific CharSet. This must be done in Java and avoid using any non-build-in library.
Thanks you all in advance.
I am afraid your problem is not well-defined. You write that you want to extract the "first valid string line" from your data. But whether somet byte sequence is a "valid string" depends on the encoding. So you must decide which encoding(s) you want to use in testing.
Sensible choices would be:
the platform default encoding (Java property "file.encoding")
UTF-8 (as it is most common)
a list of encodings you know your clients will use (such as several Russian or Chinese encodings)
What makes sense will depend on the data, there's no general answer.
Once you have your encodings, the problem of line termination should follow, as most encodings have rules on what terminates a line. In ASCII or Latin-1, LF,CR-LF and LF-CR would suffice. On Unicode, you need all the ones you listed above.
But again, there's no general answer, as new line codes are not strictly regulated. Again, it would depend on your data.
First of all let me ask you a question, is the data you are trying to process a legacy data? In other words, are you responsible for the input stream format that you are trying to consume here?
If you are indeed controlling the input format, then you probably want to take a decision Binary vs. Text out of the Q1 algorithm. For me this algorithm has one troubling part.
`4). Otherwise, these bytes may not be a string, so I look further to
another '\n' or '\r' w/o '\n'. and this process repeat.`
Are you dismissing input prior to line terminator and take the bytes that start immediately after, or try to reevaluate the string with now 2 line terminators? If former, you may have broken binary data interface, if latter you may still not parse the text correctly.
I think having well defined markers for binary data and text data in your stream will simplify your algorithm a lot.
Couple of words on String constructor. new String(byte[], CharSet) will not generate any exception if the byte array is not in particular CharSet, instead it will create a string full of question marks ( probably not what you want ). If you want to generate an exception you should use CharsetDecoder.
Also note that in Java 6 there are 2 constructors that take charset
String(byte[] bytes, String charsetName) and String(byte[] bytes, Charset charset). I did some simple performance test a while ago, and constructor with String charsetName is magnitudes faster than the one that takes Charset object ( Question to Sun: bug, feature? ).
I would try this:
make the IO reader put strings/lines into a thread safe collection (for example some implementation of BlockingQueue)
the main code has only reference to the synced collection and checks for new data when needed, like queue.peek(). It doesn't need to know about the io thread nor the stream.
Some pseudo java code (missing exception & io handling, generics, imports++) :
class IORunner extends Thread {
IORunner(InputStream in, BlockingQueue outputQueue) {
this.reader = new BufferedReader(new InputStreamReader(in, "utf-8"));
this.outputQueue = outputQueue;
}
public void run() {
String line;
while((line=reader.readLine())!=null)
this.outputQueue.put(line);
}
}
class Main {
public static void main(String args[]) {
...
BlockingQueue dataQueue = new LinkedBlockingQueue();
new IORunner(myStreamFromSomewhere, dataQueue).start();
while(true) {
if(!dataQueue.isEmpty()) { // can also use .peek() != null
System.out.println(dataQueue.take());
}
Thread.sleep(1000);
}
}
}
The collection decouples the input(stream) more from the main code. You can also limit the number of lines stored/mem used by creating the queue with a limited capacity (see blockingqueue doc).
The BufferedReader handles the checking of new lines for you :) The InputStreamReader handles the charset (recommend setting one yourself since the default one changes depending on OS etc.).
The java.text namespace is designed for this sort of natural language operation. The BreakIterator.getLineInstance() static method returns an iterator that detects line breaks. You do need to know the locale and encoding for best results, though.
Q2: The method you use seems reasonable enough to work.
Q1: Can't think of something better than the algorithm that you are using
Q3: I believe it will be enough to test for \r and \n. The others are too exotic for usual text files.
I just solved this to get test stubb working for Datagram - I did byte[] varName= String.getBytes(); then final int len = varName.length; then send the int as DataOutputStream and then the byte array and just do readInt() on the rcv then read bytes(count) using the readInt.
Not a lib, not hard to do either. Just read up on readUTF and do what they did for the bytes.
The string should construct from the byte array recovered that way, if not you have other problems. If the string can be reconstructed, it can be buffered ... no?
May be able to just use read / write UTF() in DataStream - why not?
{ edit: per OP's request }
//Sending end
String data = new String("fdsfjal;sajssaafe8e88e88aa");// fingers pounding keyboard
DataOutputStream dataOutputStream = new DataOutputStream();//
final Integer length = new Integer(data.length());
dataOutputStream.writeInt(length.intValue());//
dataOutputStream.write(data.getBytes());//
dataOutputStream.flush();//
dataOutputStream.close();//
// rcv end
DataInputStream dataInputStream = new DataInputStream(source);
final int sizeToRead = dataInputStream.readInt();
byte[] datasink = new byte[sizeToRead.intValue()];
dataInputStream.read(datasink,sizeToRead);
dataInputStream.close;
try
{
// constructor
// String(byte[] bytes, int offset, int length)
final String result = new String(datasink,0x00000000,sizeToRead);//
// continue coding here
Do me a favor, keep the heat off of me. This is very fast right in the posting tool - code probably contains substantial errors - it's faster for me just to explain it writing Java ~ there will be others who can translate it to other code language ( s ) which you can too if you wish it in another codebase. You will need exception trapping an so on, just do a compile and start fixing errors. When you get a clean compile, start over from the beginnning and look for blunders. ( that's what a blunder is called in engineering - a blunder )