I've been told at school that it's a bad practice to modify the index variable of a for loop:
Example :
for(int i = 0 ; i < limit ; i++){
if(something){
i+=2; //bad
}
if(something){
limit+=2; //bad
}
}
The argument was that some compiler optimization can optimize the loop and not recalculate the index and bound at each loop.
I 've made some test in java and it seems that by default index and bound are recalculate each time.
I'm wondering if it's possible to activate this kind of feature in the JVM HotSpot?
For example to optimize this kind of loop :
for(int i = 0 ; i < foo.getLength() ; i++){ }
without having to write :
int length = foo.getLength()
for(int i = 0 ; i < length ; i++){ }
It's just an example I'm curious to try and see the improvments.
EDIT
According to Peter Lawrey answer why in this simple example the JVM don't inline getLength() method? :
public static void main(String[] args) {
Too t = new Too();
for(int j=0; j<t.getLength();j++){
}
}
class Too {
int l = 10;
public Too() {
}
public int getLength(){
//System.out.println("test");
return l;
}
}
In the output "test" is print 10 times.
I think it could be nice to optimize this kind of execution.
EDIT 2 :
Seems I made a misunderstood...
I have remove the println and indeed the profiler tell me that the method getLength() is not even call once in this case.
I've made some test in java and it seems that by default index and bound are recalculate each time.
According to the Java Language Specification, this:
for(int i = 0 ; i < foo.getLength() ; i++){ }
means that getLength() is called on each loop iteration. Java compilers are only allowed to move the getLength() call out of the loop if they can effectively prove that it does not alter the observable behavior.
(For instance, if getLength() just returns the value of some variable, then there is a chance that the JIT compiler can inline the call. If after inlining it can deduce that the variable won't change (under certain assumptions) it can apply a hoisting optimization. On the other hand, if getLength() involves getting the length of a concurrent or synchronized collection, the chances are slim to none that the hoisting optimization will be permitted ... because of potential actions of other threads.)
So that's what a compiler is allowed to do.
I'm wondering if it's possible to activate this kind of feature in the JVM HotSpot?
The simple answer is No.
You seem to be suggesting a compiler switch that tells / allows the compiler to ignore the JLS rules. There is no such switch. Such a switch would be a BAD IDEA. It would be liable to cause correct/valid/working programs to break. Consider this:
class Test {
int count;
int test(String[] arg) {
for (int i = 0; i < getLength(arg); i++) {
// ...
}
return count;
}
int getLength(String[] arg) {
count++;
return arg.length;
}
}
If the compiler was permitted to move the getLength(arg) call out of the loop, it would change the number of times that the method was called, and therefore change the value returned by the test method.
Java optimizations that change the behaviour of a properly written Java program are not valid optimizations. (Note that multi-threading tends to muddy the waters. The JLS, and specifically the memory model rules, permit a compiler to perform optimizations that could result in different threads seeing inconsistent versions of the application's state ... if they don't synchronize properly, resulting in behaviour that is incorrect from the developer's perspective. But the real problem is with the application, not the compiler.)
By the way, a more convincing reason that you shouldn't change the loop variable in the loop body is that it makes your code harder to understand.
It depends on what foo.getLength() does. If it can be inlined, it can be effectively the same thing. If it cannot be inlined, the JVM cannot determine whether the result is the same.
BTW you can write for a one liner.
for(int i = 0, length = foo.getLength(); i < length; i++){ }
EDIT: It is worth nothing that;
methods and loops are usually not optimised until they have been called 10,000 times.
profilers sub-sample invocations to reduce overhead. They might count every 10 or 100 or more so a trivial example may not show up.
The main reason not to do that is that it makes it much harder to understand and maintain the code.
Whatever the JVM optimizes, it won't compromise the correctness of the program. If it can't do an optimization because the index is modified inside the loop, then it won't optimize it. I fail to see how a Java test could show if there is or not such an optimization.
Anyway, Hotspot will optimize a whole lot of things for you. And your second example is a kind of explicit optimization that Hotspot will happily do for you.
Before we go into more reasoning why the field access isn't inlined. Maybe we should show that yes, if you know what you're looking for (which really is non-trivial in Java) the field access is inlined just fine.
First we need a basic understanding of how the JIT works - and I really can't do that in one answer. Suffice to say that the JIT only works after a function has been called often enough (>10k usually)
So we use the following code for actual testing stuff:
public class Test {
private int length;
public Test() {
length = 10000;
}
public static void main(String[] args) {
for (int i = 0; i < 14000; i++) {
foo();
}
}
public static void foo() {
Test bar = new Test();
int sum = 0;
for (int i = 0; i < bar.getLength(); i++) {
sum += i;
}
System.out.println(sum);
}
public int getLength() {
System.out.print("_");
return length;
}
}
Now we compile this code and run it with java.exe -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,*Test.foo Test >Test.txt Which results in a unholy long output, but the interesting part is:
0x023de0e7: mov %esi,0x24(%esp)
0x023de0eb: mov %edi,0x28(%esp)
0x023de0ef: mov $0x38fba220,%edx ; {oop(a 'java/lang/Class' = 'java/lang/System')}
0x023de0f4: mov 0x6c(%edx),%ecx ;*getstatic out
; - Test::getLength#0 (line 24)
; - Test::foo#14 (line 17)
0x023de0f7: cmp (%ecx),%eax ;*invokevirtual print
; - Test::getLength#5 (line 24)
; - Test::foo#14 (line 17)
; implicit exception: dispatches to 0x023de29b
0x023de0f9: mov $0x3900e9d0,%edx ;*invokespecial write
; - java.io.PrintStream::print#9
; - Test::getLength#5 (line 24)
; - Test::foo#14 (line 17)
; {oop("_")}
0x023de0fe: nop
0x023de0ff: call 0x0238d1c0 ; OopMap{[32]=Oop off=132}
;*invokespecial write
; - java.io.PrintStream::print#9
; - Test::getLength#5 (line 24)
; - Test::foo#14 (line 17)
; {optimized virtual_call}
0x023de104: mov 0x20(%esp),%eax
0x023de108: mov 0x8(%eax),%ecx ;*getfield length
; - Test::getLength#9 (line 25)
; - Test::foo#14 (line 17)
0x023de10b: mov 0x24(%esp),%esi
0x023de10f: cmp %ecx,%esi
0x023de111: jl 0x023de0d8 ;*if_icmpge
; - Test::foo#17 (line 17)
which is the inner loop we're actually executing. Note that the following 0x023de108: mov 0x8(%eax),%ecx loads the length value in a register - the stuff above it is for the System.out call (I'd have removed it since it makes it more complicated, but since more than one person thought this would hinder inlining I left it in there). Even if you aren't that fit in x86 assembly you can clearly see: No call instruction anywhere except for the native write call.
Related
There is a method that can search substring from a text(use brute force algorithm, please ignore null pointer)
public static int forceSearch(String text, String pattern) {
int patternLength = pattern.length();
int textLength = text.length();
for (int i = 0, n = textLength - patternLength; i <= n; i++) {
int j = 0;
for (; j < patternLength && text.charAt(i + j) == pattern.charAt(j); j++) {
;
}
if (j == patternLength) {
return i;
}
}
return -1;
}
Strangely! Use the same algorithm, but the following code is more faster!!!
public static int forceSearch(String text, String pattern) {
int patternLength = pattern.length();
int textLength = text.length();
char first = pattern.charAt(0);
for (int i = 0, n = textLength - patternLength; i <= n; i++) {
if (text.charAt(i) != first) {
while (++i <= n && text.charAt(i) != first)
;
}
int j = 0;
for (; j < patternLength && text.charAt(i + j) == pattern.charAt(j); j++) {
;
}
if (j == patternLength) {
return i;
}
}
return -1;
}
I found the second code is obviously faster than first if I run it by jvm. Howere, when I write it in c and run, the two functions take almost the same time. So I think the reason is that jvm optimize loop code
if (text.charAt(i) != first) {
while (++i <= max && text.charAt(i) != first)
;
}
Am I right? If I'm right, how should we use the jvm optimization strategy to
optimize our code?
Hope somebody help, thankyou:)
If you really want to get to the bottom of this, you'll probably need to instruct the JVM to print the assembly. In my experience, minor tweaks to loops can cause surprising performance differences. But it's not necessarily due to optimizations of the loop itself.
There are plenty of factors that can affect how your code gets JIT compiled.
For example, tweaking the size of a method can affect your inlining tree, which could mean better or worse performance depending on what your call stack looks like. If a method gets inlined further up the call stack, it could prevent nested call sites from being inlined into the same frame. If those nested call sites are especially 'hot', the added call overhead could be substantial. I'm not saying that's the cause here; I'm merely pointing out that there are many thresholds that govern how the JIT arranges your code, and the reasons for performance differences are not always obvious.
One nice thing about using JMH for benchmarks is that you can reduce the influence of such changes by explicitly setting inlining boundaries. But you can use -XX:CompileCommand to achieve the same effects manually.
There are, of course, other factors like cache friendliness that require more intuitive analysis. Given that your benchmark probably doesn't have a particularly deep call tree, I'm inclined to lean towards cache behavior as a more likely explanation. I would guess that your second version performs better because your comparand (the first chunk of pattern) is usually in your L1 cache, while your first version causes more cache churn. If your inputs are long (and it sounds like they are), then this is a likely explanation. If not, the reasons could be more subtle, e.g., your first version could be 'tricking' the CPU into employing more aggressive cache prefetching, but in a way that actually hurts performance (at least for the inputs you are benchmarking). Regardless, if cache behavior is to explain, then I wonder why you do not see a similar difference in the C versions. What optimization flags are you compiling the C version with?
Dead code elimination might also be a factor. I would have to see what your inputs are, but it's possible that your hand-optimized version causes certain instruction blocks to never be hit during the instrumented interpretation phase, leading the JIT to exclude them from the final assembly.
I reiterate: if you want to get to the bottom of this, you'll want to force the JIT to dump the assembly for each version (and compare to the C versions as well).
This if statement simplify a lot of work (especially when the pattern is found at the end of the input string.
if (text.charAt(i) != first) {
while (++i <= n && text.charAt(i) != first)
;
}
In the first version, you have to check j < patternLength for every i before comparing the first character.
In the second version you don't need to.
But strangely I think for small input it does not make much different.
Could you share the length of items you used to benchmark?
If you search for JVM compiler optimization on the internet, the
"loop unwinding" or "loop unrolling"
should jump out. Again benchmarking is tricky. You will find plenty of SO answer for the same.
I am currently reviewing a PullRequest that contains this:
- for (int i = 0; i < outgoingMassages.size(); i++) {
+ for (int i = 0, size = outgoingMassages.size(); i < size; i++)
https://github.com/criticalmaps/criticalmaps-android/pull/52
somehow it feels wrong to me - would think that the vm is doing these optimisations - but cannot really say for sure. Would love to get some input if this change can make sense - or a confirmation that this is done on VM-side.
No, is not sure that the VM will change your code from
- for (int i = 0; i < outgoingMassages.size(); i++) {
to
+ for (int i = 0, size = outgoingMassages.size(); i < size; i++)
In your for loop it is possible that the outgoingMassages will change its size. So This optimization can't be applied by the JVM. Also another thread can change the outgoingMassages size if this is a shared resource.
The JVM can change the code only if the behaviour doesn't change. For example it can substitute a list of string concatenations with a sequence of append to StringBuilder, or it can inline a simple method call, or can calculate a value out of the loop if it is a constant value.
The VM will not do this optimizations. Since it is possible that the size()-Method does not return the same result each call. So the method must be called each iteration.
However if size is a simple getter-method the performance impact is very small. Probably not measurable. (In few cases it may enable Java to use parallelization which could make a difference then, but this depends on the content of the loop).
The bigger difference may be to ensure that the for-loop has an amount of iterations which is known in advance. It doesn't seem make sense to me in this example. But maybe the called method could return changing results which are unwanted?
If the size() method on your collection is just giving the value of a private field, then the VM will optimise most of this away (but not quite all). It'll do it by inlining the size() method, so that it just becomes access to that field.
The remaining bit that won't get optimised is that size in the new code will get treated as final, and therefore constant, whereas the field picked up from the collection won't be treated as final (perhaps it's modified from another thread). So in the original case the field will get read in every iteration, but in the new case it won't.
It is likely that any decent optimiser - either at the VM or in the compiler - will recognise:
class Messages {
int size;
public int size() {
return size;
}
}
public void test() {
Messages outgoingMassages = new Messages();
for (int i = 0; i < outgoingMassages.size(); i++) {
}
}
and optimise it to
for (int i = 0; i < outgoingMassages.size; i++) {
doing the extra - untested - optimisation should therefore be considered evil.
The method invocation will happen on each iteration of the loop and is not free of costs. Since you cannot predict how often this happens calling it once will always be less. It is a minor optimization but you should not rely on the compiler to do the optimization for you.
Further, the member outgoingMassages ..
private ArrayList<OutgoingChatMessage> outgoingMassages ..
... should be an interface:
private List<OutgoingChatMessage> outgoingMassages ..
Then, calling .size() will become a virtual method. To find out the specific object class the method table will be invoked for all classes of the hierarchy. This is not free of costs again.
This question is specifically geared towards the Java language, but I would not mind feedback about this being a general concept if so. I would like to know which operation might be faster, or if there is no difference between assigning a variable a value and performing tests for values. For this issue we could have a large series of Boolean values that will have many requests for changes. I would like to know if testing for the need to change a value would be considered a waste when weighed against the speed of simply changing the value during every request.
public static void main(String[] args){
Boolean array[] = new Boolean[veryLargeValue];
for(int i = 0; i < array.length; i++) {
array[i] = randomTrueFalseAssignment;
}
for(int i = 400; i < array.length - 400; i++) {
testAndChange(array, i);
}
for(int i = 400; i < array.length - 400; i++) {
justChange(array, i);
}
}
This could be the testAndChange method
public static void testAndChange(Boolean[] pArray, int ind) {
if(pArray)
pArray[ind] = false;
}
This could be the justChange method
public static void justChange(Boolean[] pArray, int ind) {
pArray[ind] = false;
}
If we were to end up with the very rare case that every value within the range supplied to the methods were false, would there be a point where one method would eventually become slower than the other? Is there a best practice for issues similar to this?
Edit: I wanted to add this to help clarify this question a bit more. I realize that the data type can be factored into the answer as larger or more efficient datatypes can be utilized. I am more focused on the task itself. Is the task of a test "if(aConditionalTest)" is slower, faster, or indeterminable without additional informaiton (such as data type) than the task of an assignment "x=avalue".
As #TrippKinetics points out, there is a semantical difference between the two methods. Because you use Boolean instead of boolean, it is possible that one of the values is a null reference. In that case the first method (with the if-statement) will throw an exception while the second, simply assigns values to all the elements in the array.
Assuming you use boolean[] instead of Boolean[]. Optimization is an undecidable problem. There are very rare cases where adding an if-statement could result in better performance. For instance most processors use cache and the if-statement can result in the fact that the executed code is stored exactly on two cache-pages where without an if on more resulting in cache faults. Perhaps you think you will save an assignment instruction but at the cost of a fetch instruction and a conditional instruction (which breaks the CPU pipeline). Assigning has more or less the same cost as fetching a value.
In general however, one can assume that adding an if statement is useless and will nearly always result in slower code. So you can quite safely state that the if statement will slow down your code always.
More specifically on your question, there are faster ways to set a range to false. For instance using bitvectors like:
long[] data = new long[(veryLargeValue+0x3f)>>0x06];//a long has 64 bits
//assign random values
int low = 400>>0x06;
int high = (veryLargeValue-400)>>0x06;
data[low] &= 0xffffffffffffffff<<(0x3f-(400&0x3f));
for(int i = low+0x01; i < high; i++) {
data[i] = 0x00;
}
data[high] &= 0xffffffffffffffff>>(veryLargeValue-400)&0x3f));
The advantage is that a processor can perform operations on 32- or 64-bits at once. Since a boolean is one bit, by storing bits into a long or int, operations are done in parallel.
Imagine you want to count how many non-ASCII chars a given char[] contains. Imagine, the performance really matters, so we can skip our favorite slogan.
The simplest way is obviously
int simpleCount() {
int result = 0;
for (int i = 0; i < string.length; i++) {
result += string[i] >= 128 ? 1 : 0;
}
return result;
}
Then you think that many inputs are pure ASCII and that it could be a good idea to deal with them separately. For simplicity assume you write just this
private int skip(int i) {
for (; i < string.length; i++) {
if (string[i] >= 128) break;
}
return i;
}
Such a trivial method could be useful for more complicated processing and here it can't do no harm, right? So let's continue with
int smartCount() {
int result = 0;
for (int i = skip(0); i < string.length; i++) {
result += string[i] >= 128 ? 1 : 0;
}
return result;
}
It's the same as simpleCount. I'm calling it "smart" as the actual work to be done is more complicated, so skipping over ASCII quickly makes sense. If there's no or a very short ASCII prefix, it can costs a few cycles more, but that's all, right?
Maybe you want to rewrite it like this, it's the same, just possibly more reusable, right?
int smarterCount() {
return finish(skip(0));
}
int finish(int i) {
int result = 0;
for (; i < string.length; i++) {
result += string[i] >= 128 ? 1 : 0;
}
return result;
}
And then you ran a benchmark on some very long random string and get this
The parameters determine the ASCII to non-ASCII ratio and the average length of a non-ASCII sequence, but as you can see they don't matter. Trying different seeds and whatever doesn't matter. The benchmark uses caliper, so the usual gotchas don't apply. The results are fairly repeatable, the tiny black bars at the end denote the minimum and maximum times.
Does anybody have an idea what's going on here? Can anybody reproduce it?
Got it.
The difference is in the possibility for the optimizer/CPU to predict the number of loops in for. If it is able to predict the number of repeats up front, it can skip the actual check of i < string.length. Therefore the optimizer needs to know up front how often the condition in the for-loop will succeed and therefore it must know the value of string.length and i.
I made a simple test, by replacing string.length with a local variable, that is set once in the setup method. Result: smarterCount has runtime of about simpleCount. Before the change smarterCount took about 50% longer then simpleCount. smartCount did not change.
It looks like the optimizer looses the information of how many loops it will have to do when a call to another method occurs. That's the reason why finish() immediately ran faster with the constant set, but not smartCount(), as smartCount() has no clue about what i will be after the skip() step. So I did a second test, where I copied the loop from skip() into smartCount().
And voilĂ , all three methods return within the same time (800-900 ms).
My tentative guess would be that this is about branch prediction.
This loop:
for (int i = 0; i < string.length; i++) {
result += string[i] >= 128 ? 1 : 0;
}
Contains exactly one branch, the backward edge of the loop, and it is highly predictable. A modern processor will be able to accurately predict this, and so fill its whole pipeline with instructions. The sequence of loads is also highly predictable, so it will be able to pre-fetch everything the pipelined instructions need. High performance results.
This loop:
for (; i < string.length - 1; i++) {
if (string[i] >= 128) break;
}
Has a dirty great data-dependent conditional branch sitting in the middle of it. That is much harder for the processor to predict accurately.
Now, that doesn't entirely make sense, because (a) the processor will surely quickly learn that the break branch will usually not be taken, (b) the loads are still predictable, and so just as pre-fetchable, and (c) after that loop exits, the code goes into a loop which is identical to the loop which goes fast. So i wouldn't expect this to make all that much difference.
Java specification guarantees primitive variable assignments are always atomic (expect for long and double types.
On the contrary, Fetch-and-Add operation corresponding to the famous i++ increment operation, would be non-atomic because leading to a read-modify-write operation.
Assuming this code:
public void assign(int b) {
int a = b;
}
The generated bytecode is:
public void assign(int);
Code:
0: iload_1
1: istore_2
2: return
Thus, we see the assignment is composed of two steps (loading and storing).
Assuming this code:
public void assign(int b) {
int i = b++;
}
Bytecode:
public void assign(int);
Code:
0: iload_1
1: iinc 1, 1 //extra step here regarding the previous sample
4: istore_2
5: return
Knowing that X86 processor can (at least modern ones), operates increment operation atomically, as said:
In computer science, the fetch-and-add CPU instruction is a special
instruction that atomically modifies the contents of a memory
location. It is used to implement mutual exclusion and concurrent
algorithms in multiprocessor systems, a generalization of semaphores.
Thus, first question: Despite of the fact that bytecode requires both steps (loading and storage), does Java rely on the fact that assignment operation is an operation always carried out atomically whatever the processor's architecture and so can ensure permanent atomicity (for primitive assignments) in its specification?
Second question: Is it wrong to confirm that with very modern X86 processor and without sharing compiled code across different architectures, there's no need at all to synchronize the i++ operation (or AtomicInteger)? Considering it already atomic.
Even if the i++ would translate into an X86 Fetch-And-Add instruction would change nothing because the memory mentionned in the Fetch-And-Add instruction refers to the local memory registres of the CPU and not to the general memory of the device/application. On a modern CPU, this property will extend to the local memory caches of the CPU and can even extend to the various caches used by the different cores for a multicores CPU but in the case of a multithreading application; there is absolutely no garanty that this distribution will extend to the copy of the memory used by the threads themselves.
In clear, in a multithread application, if a variable can be modified by different threads running at the same time then you must use some synchronisation mecanism provided by the system and you cannot rely on the fact that the instruction i++ occupies a single line of java code to be atomic.
Considering the Second question.
You imply that i++ will translate into the X86 Fetch-And-Add instruction which is not true. If the code is compiled and optimized by the JVM it may be true (would have to check the source code of JVM to confirm that), but that code can also run in interpreted mode, where the fetch and add are seperated and not synchronized.
Out of curiosity I checked what assembly code is generated for this Java code:
public class Main {
volatile int a;
static public final void main (String[] args) throws Exception {
new Main ().run ();
}
private void run () {
for (int i = 0; i < 1000000; i++) {
increase ();
}
}
private void increase () {
a++;
}
}
I used Java HotSpot(TM) Server VM (17.0-b12-fastdebug) for windows-x86 JRE (1.6.0_20-ea-fastdebug-b02), built on Apr 1 2010 03:25:33 version of JVM (this one I had somewhere on my drive).
These is the crucial output of running it (java -server -XX:+PrintAssembly -cp . Main):
At first it is compiled into this:
00c PUSHL EBP
SUB ESP,8 # Create frame
013 MOV EBX,[ECX + #8] # int ! Field VolatileMain.a
016 MEMBAR-acquire ! (empty encoding)
016 MEMBAR-release ! (empty encoding)
016 INC EBX
017 MOV [ECX + #8],EBX ! Field VolatileMain.a
01a MEMBAR-volatile (unnecessary so empty encoding)
01a LOCK ADDL [ESP + #0], 0 ! membar_volatile
01f ADD ESP,8 # Destroy frame
POPL EBP
TEST PollPage,EAX ! Poll Safepoint
029 RET
Then it is inlined and compiled into this:
0a8 B11: # B11 B12 <- B10 B11 Loop: B11-B11 inner stride: not constant post of N161 Freq: 0.999997
0a8 MOV EBX,[ESI] # int ! Field VolatileMain.a
0aa MEMBAR-acquire ! (empty encoding)
0aa MEMBAR-release ! (empty encoding)
0aa INC EDI
0ab INC EBX
0ac MOV [ESI],EBX ! Field VolatileMain.a
0ae MEMBAR-volatile (unnecessary so empty encoding)
0ae LOCK ADDL [ESP + #0], 0 ! membar_volatile
0b3 CMP EDI,#1000000
0b9 Jl,s B11 # Loop end P=0.500000 C=126282.000000
As you can see it does not use Fetch-And-Add instructions for a++.
Regarding your first question: the read and the write are atomic, but the read/write operation is not. I could not find a specific reference on primitives but the JLS #17.7 says something similar regarding references:
Writes to and reads of references are always atomic, regardless of whether they are implemented as 32-bit or 64-bit values.
So in your case, both the iload and istore are atomic, but the whole (iload, istore) operation is not.
Is it wrong to [consider that] there's no need at all to synchronize the i++ operation?
Regarding your second question, the code below prints 982 on my x86 machine (and not 1,000) which shows that some ++ got lost in translation ==> you need to properly synchronize a ++ operation even on a processor architecture that supports a fetch-and-add instruction.
public class Test1 {
private static int i = 0;
public static void main(String args[]) throws InterruptedException {
ExecutorService executor = Executors.newFixedThreadPool(10);
final CountDownLatch start = new CountDownLatch(1);
final Set<Integer> set = new ConcurrentSkipListSet<>();
Runnable r = new Runnable() {
#Override
public void run() {
try {
start.await();
} catch (InterruptedException ignore) {}
for (int j = 0; j < 100; j++) {
set.add(i++);
}
}
};
for (int j = 0; j < 10; j++) {
executor.submit(r);
}
start.countDown();
executor.shutdown();
executor.awaitTermination(1, TimeUnit.SECONDS);
System.out.println(set.size());
}
}