I read that for handling date range query NumericRangeQuery is better than TermRangeQuery in "Lucene in action", But i couldnot find the reason. i want to know the reason behind it.
I used TermRangeQuery and NumericRangequery both for handling date range query and i found that searching is fast via NumericRangeQuery.
My second point is to query using NumericRangeQuery i have to create indexes using NumericField by which i can create indexes upto milisecond but what if i want to reduce my resolution upto hour or day.
Why is numeric so much faster than term?
As you have noted, there is a "precision step". This means that numbers are only stored to a certain precision, which means that there is a (very) limited number of terms. According to the documentation, it is rare to have more than 300 terms in an index. Check out the wikipedia article on Tries if you are interested in the theory.
How can you reduce precision?
The NumericField class has a "precision" parameter in the constructor. Note that the range query also has a precision parameter, and they must be the same. That JavaDoc page has a link to a paper written about the implementation explaining more of what precision means.
Explanation by #Xodarap about Numeric field is correct. Essentially, the precision is dropped for the numbers to reduce the actual term space. Also, I suppose, TermRangeQuery uses String comparison whereas NumericRange query is working with integers. That should squeeze some more performance.
You can index at any desirable resolution - millisecond to day. Date.getTime() gives you milliseconds since epoch. You can divide this number by 1000 to get time with resolution at second. Or you can divide by 60,000 to get resolution at minute. And so on.
Related
I am writing a process that returns data to a subscribers every few seconds. I would like to create a unique id for to the subscribers:
producer -> subsriber1
-> subsriber2
What is the difference between using:
java.util.UUID.randomUUID()
System.nanoTime()
System.currentTimeMillis()
Will the nano time always be unique? What about the random UUID?
UUID
The 128-bit UUID was invented exactly for your purpose: Generating identifiers across one or more machines without coordinating through a central authority.
Ideally you would use the original Version 1 UUID, or its variations in Versions 2, 3, and 5. The original takes the MAC address of the host computer’s network interface and combines it with the current moment plus a small arbitrary number that increments when the host clock has been adjusted. This approach eliminates any practical concern for duplicates.
Java does not bundle an implementation for generating these Versions. I presume the Java designers had privacy and security concerns over divulging place, time, and MAC address.
Java comes with only one implementation of a generator, for Version 4. In this type all but 6 of the 128 bits are randomly generated. If a cryptographically strong random generator is used, this Version is good enough to use in most common situations without concern for collisions.
Understand that 122 bits is a really big range of numbers (5.316911983139664e+36). 64-bits yields a range of 18,446,744,073,709,552,000 (18 quintillion). The remaining 58 bits (122-64=58) yields a number range of 288,230,376,151,711,740 (288 quadrillion). Now multiply those two numbers to get the range of 122-bits: 2^122 = ( 18,446,744,073,709,552,000 * 288,230,376,151,711,740 ) which is 5.3 undecillion.
Nevertheless, if you have access to generating a Version of UUID other than 4, take it. For example in a database system such as Postgres, the database server can generate UUID numbers in the various Versions including Version 1. Or you may find a Java library for generating such UUIDs, though that library may not be platform-independent (it may have native code within).
System.nanoTime
Be clear that System.nanoTime has nothing to do with the current date and time. To quote the Javadoc:
This method can only be used to measure elapsed time and is not related to any other notion of system or wall-clock time.
The System.nanoTime feature simply returns a long number, a count of nanoseconds since some origin, but that origin is not specified.
The only promise made in the Java spec is that the origin will not change during the runtime of a JVM. So you know the number is ever increasing during execution of your app. Unless reaching the limit of a long, when the counter will rollover. That rollover might take 292 years (2^63 nanoseconds), if the origin is zero — but, again, the origin is not specified.
In my experience with the particular Java implementations I have used, the origin is the moment when the JVM starts up. This means I will most certainly see the same numbers all over again after the next JVM restart.
So using System.nanoTime as an identifier is a poor choice. Whether your app happens to hit coincidentally the exact same nanosecond number as seen in a prior run is pure chance, but a chance you need not take. Use UUID instead.
java.util.UUID.randomUUID() is potentially thread-safe.
It is not safe to compare the results of System.nanoTime() calls between different threads. If many threads run during the same millisecond, this function returns the same milliseconds.
The same is true for System.currentTimeMillis() also.
Comparing System.currentTimeMillis() and System.nanoTime(), the latter is more expensive as it takes more cpu cycles but is more accurate too. So UUID should serve your purpose.
I think yes, you can use System.nanoTime() as id. I have tested it and did not face with duplication.
P.S. But I strongly offer you to use UUID.
If I calculate the difference between 2 LocalDate's in java.time using:
Period p = Period.between(testDate, today);
Then I get an output with the number of years, months, days like:
Days = 9
Months = 6
Years = 18
Does anyone know a clean way to represent that as a decimal type value (ie, above would be something around 18.5...)?
You mentioned in one of your comments that you need quarter year precision if you need the current quarter you can use IsoFields.QUARTER_YEARS:
double yearAndQuarter = testDate.until(today, IsoFields.QUARTER_YEARS) / 4.0;
This way you will actually use the time api, always get the correct result and #Mike won't have to loathe anything.
Please do not do this.
Representing the difference between two dates as a 'number of years' multiplier is problematic because the average length of a year between two dates is dependent on which dates you are comparing. It's easy to get this wrong, and it's much harder to come up with all the test cases necessary to prove you got it right.
Most programmers should never perform date/time calculations manually. You are virtually guaranteed to get it wrong. Seriously, there are so many ways things can go horribly wrong. Only a handful of programmers on the planet fully understand the many subtleties involved. The fact that you are asking this question proves that you are not one of them, and that's okay--neither am I. You, along with the vast majority of us, should rely on a solid Date/Time API like java.util.time.
If you really need a single numeric value, then the safest option I can think of is to use the number of days, because the LocalDate API can calculate that number for you:
long differenceInDays = testDate.until(today, ChronoUnit.DAYS)
Note that this difference is only valid for the two dates used to produce it. The round-trip conversion is straightforward:
LocalDate today = testDate.plus(differenceInDays, ChronoUnit.DAYS)
Do not attempt to manually convert a Period with year, month, and day components into a whole number of days. The correct answer depends on the dates involved, which is why we want to let the LocalDate API calculate it for us.
When precision isn't important
Based on your comments, precision isn't an issue for you, because you only want to display someone's age to the nearest quarter-year or so. You aren't trying to represent an exact difference in time; only an approximate one, with a rather large margin for error. You also don't need to be able to perform any round-trip calculations. This changes things considerably.
An approximation like #VGR's should be more than adequate for these purposes: the 'number of years' should be accurate to within 3 days (< 0.01 years) unless people start living hundreds of thousands of years, in which case you can switch to double ;).
#Oleg's approach also works quite well, and will give you a date difference in whole quarters, which you can divide by 4 to convert to years. This is probably the easiest solution to get right, as you won't need to round or truncate the result. This is, I think, the closest you will get to a direct solution from java.util.time. The Java Time API (and date/time APIs in general) are designed for correctness: they'll give you whole units, but they usually avoid giving you fractional approximations due to the inherent error involved in floating-point types (there are exceptions, like .NET's System.TimeSpan).
However, if your goal is to present someone's age for human users, and you want greater precision than whole years, I think 18 years, 9 months (or an abbreviated form like 18 yr, 9 mo) is a better choice than 18.75 years.
I would avoid using Period, and instead just calculate the difference in days:
float years = testDate1.until(today, ChronoUnit.DAYS) / 365.2425f;
So, I came across a problem today in my construction of a restricted Boltzmann machine that should be trivial, but seems to be troublingly difficult. Basically I'm initializing 2k values to random doubles between 0 and 1.
What I would like to do is calculate the geometric mean of this data set. The problem I'm running into is that since the data set is so long, multiplying everything together will always result in zero, and doing the proper root at every step will just rail to 1.
I could potentially chunk the list up, but I think that's really gross. Any ideas on how to do this in an elegant way?
In theory I would like to extend my current RBM code to have closer to 15k+ entries, and be able to run the RBM across multiple threads. Sadly this rules out apache commons math (geometric mean method is not synchronized), longs.
Wow, using a big decimal type is way overkill!
Just take the logarithm of everything, find the arithmetic mean, and then exponentiate.
Mehrdad's logarithm solution certainly works. You can do it faster (and possibly more accurately), though:
Compute the sum of the exponents of the numbers, say S.
Slam all of the exponents to zero so that each number is between 1/2 and 1.
Group the numbers into bunches of at most 1000.
For each group, compute the product of the numbers. This will not underflow.
Add the exponent of the product to S and slam the exponent to zero.
You now have about 1/1000 as many numbers. Repeat steps 2 and 3 unless you only have one number.
Call the one remaining number T. The geometric mean is T1/N 2S/N, where N is the size of the input.
It looks like after a sufficient number of multiplications the double precision is not sufficient anymore. Too many leading zeros, if you will.
The wiki page on arbitrary precision arithmetic shows a few ways to deal with the problem. In Java, BigDecimal seems the way to go, though at the expense of speed.
I'm working with money so I need my results to be accurate but I only need a precision of 2 decimal points (cents). Is BigDecimal needed to guarantee results of multiplication/division are accurate?
BigDecimal is a very appropriate type for decimal fraction arithmetic with a known number of digits after the decimal point. You can use an integer type and keep track of the multiplier yourself, but that involves doing in your code work that could be automated.
As well as managing the digits after the decimal point, BigDecimal will also expand the number of stored digits as needed - many business and government financial calculations involve sums too large to store in cents in an int.
I would consider avoiding it only if you need to store a very large array of amounts of money, and are short of memory.
One common option is to do all your calculation with integer or long(the cents value) and then simply add two decimal places when you need to display it.
Similarly, there is a JODA Money library that will give you a more full-featured API for money calculations.
It depends on your application. One reason to use that level of accuracy is to prevent errors accumulated over many operations from percolating up and causing loss of valuable information. If you're creating a casual application and/or are only using it for, say, data entry, BigDecimal is very likely overkill.
+1 for Patricias answer, but I very strongly discourage anyone to implement own classes with an integer datatype with fixed bitlength as long as someone really do not know what you are doing. BigDecimal supports all rounding and precision issues while a long/int has severe problems:
Unknown number of fraction digits: Trade exchanges/Law/Commerce are varying in their amount
of fractional digits, so you do not know if your chosen number of digits must be changed and
adjusted in the future. Worse: There are some things like stock evaluation which need a ridiculous amount of fractional digits. A ship with 1000 metric tons of coal causes e.g.
4,12 € costs of ice, leading to 0,000412 €/ton.
Unimplemented operations: It means that people are likely to use floating-point for
rounding/division or other arithmetic operations, hiding the inexactness and leading to
all the known problems of floating-point arithmetic.
Overflow/Underflow: After reaching the maximum amount, adding an amount results in changing the sign. Long.MAX_VALUE switches to Long.MIN_VALUE. This can easily happen if you are doing fractions like (a*b*c*d)/(e*f) which may perfectly valid results in range of a long, but the intermediate nominator or denominator does not.
You could write your own Currency class, using a long to hold the amount. The class methods would set and get the amount using a String.
Division will be a concern no matter whether you use a long or a BigDecimal. You have to determine on a case by case basis what you do with fractional cents. Discard them, round them, or save them (somewhere besides your own account).
Is there any library or open source function that approximate the area under a line that is described by some of its values taken at irregular intervals?
Action Script would be preferred but Java might work fine as well.
You could use the as3mathlib math library. Here's the relevant class:
http://code.google.com/p/as3mathlib/source/browse/trunk/src/com/vizsage/as3mathlib/math/calc/Integral.as
It includes the most common integral approximation methods.
Edit for more explanation (based on comments below):
Use timestamp values for each date; only convert to anything else if you need to display it to the user, and do so at the very end.
Hopefully there's a standard greatest common divisor (GCD) among the various differences between each set of adjacent timestamps. (If not, you'll need to calculate that first.) In other words, hopefully each timestamp differs by a number of whole days. If so, the GCD is 1 day. If it's not like this, you'll have to calculate what that GCD equals on the fly.
Then, use the GCD value in combination with the delta between the first and last timestamps to determine n, the number of partitions. Then, in f (your function to be integrated), determine whether the passed x corresponds to a defined timestamp. If so, return the numeric_value associated with that timestamp. If not, interpolate between the numeric_values of the nearest two defined timestamps, and return that.