This is my O( N log N ) solution for the NoCOUG's "First international SQL challenge" problem.

It is based on the FFT (Fast Fourier Transform) implemented using the Oracle MODEL clause.

**UPDATE 2009-07-31**: this solution has been ranked first and hence has won the contest !

See Iggy Fernandez 's and Chen Shapira 's blog posts

**:)**

"Find the pmf (probability mass function) P(s,N), s being the sum of the face values of N throws of a biased die, given the pmf of one throw, that is, P(s,1)".

In order to solve the problem using only the available information, we must assume that each throw of the die is independent from the other ones (that is, that the die does not "remember" the previous throws). If this holds, s is the sum of N independent random variables with the same pmf P(s,1), hence we can calculate P(s,N) by calculating the convolution N-th power of P(s,1), since the pmf of the sum of independent random variables is the convolution of the pmf of the variables (see here), and in our problem, all throws have the same pmf P(s,1).

Calculating a single convolution has a complexity of O ( N^2 ), and since we need to perform O ( N ) convolutions, the total complexity would be O ( N^3 ); this is significantly much better then the O ( k^N ) complexity of a brute-force calculation over all possible throws combinations, but still not practically feasible at all even for very small N.

But the DFT (Discrete Fourier Transform) of the convolution is the product of the DFT of the inputs (convolution theorem), and hence, for our case:

DFT [ P(s,N) ] = DFT [ P(s,1) ] ^ N

that is, we can calculate P(s,N) by transforming P(s,1), raising the spectrum to the N-th power, and then inverse transforming.

In passing, it is interesting to note that the FT transformation makes the original recursive problem a non-recursive one, hence easily solved using SQL. But this implementation has the only merit of being quite simple and easy to port on different database products with little effort (it would be enough to generate sequential values using a table instead of Mikito Harakiri's connect-by trick), since O ( N^2 ) is still too resource intensive, and hence unusable for non-small N. We need to switch to the FFT.

All the FFT variants are based on a divide-and-conquer approach, in which the FT is reformulated as a function of the FT of a subset of the points; the latter FTs are then recursively reformulated as well, until we get to a subset of exactly one point (the base case of the recursion).

Even if the FFT algorithms make sophisticated reasonings in order to avoid explicit recursion, their "recursiveness" is anyway apparent, since they are always formulated as a succession of steps that need to be feeded by the previous step; in a manner, the "recursiveness" of the original problem, that was eliminated by the FT, comes back in the FFT. This is especially challenging for the SQL language, that has not been designed (or at least is not usually implemented) to support easily multiple iterations that are feeded by the previous one.

It is probably possible to implement the FFT by using the Oracle connect-by or the new 11gR2 feature mentioned by Vadim Tropashko, but that would probably (depending on how much the Sql Optimizer is smart) need auxiliary storage to store the logN intermediate steps; by using the Oracle model clause, it is possible to solve the problem by using only O(N) auxiliary storage, as shown below.

For an introduction of the Model clause, look no further than the excellent tutorials by Rob van Wijk's (part one, part two and part three), that have made me productive in a couple of hours starting from zero.

A quick summary of my FFT implementation below: it uses a model composed by a series of vectors indexed by f (the frequencies). The real[f] and imag[f] are the main ones; they are initialized with the input vector to be transformed, and then are modified by the model rules (a succession of model iterations) until they contain the output values. Each iteration implements a "stage" of the algorithm, each one composed by a "butterfly" computation and a complex multiplication by W (twiddle factors, more formally known as "roots of unity"), stages that mimic the original recursion. See the excellent CMLAB link above for more informations.

Do not be scared by the length of the solution, it is just a succession of simple steps, as it is often the case in numerical programming ...

The script compare_results.sql automatically compares the output with Rob van Wijk's solution for a random die (random P(s,1)), over a small range of N values (to keep the test duration manageable).

The script test_correctness.sql automatically performs automatic tests over these P(s,1) distributions set :

We note that the execution time is indeed asymptotically very well approximated by NlogN.

Note also that the absolute performance, already not bad at all, could be further optimized by making standard optimizations (e.g. avoiding multiplications by 1 or zero, using other more sophisticated variants instead of the Cooley-Tukey) that have not been made for the sake of clarity (and also because they provide only linear improvements).

Also, notice that the FFT uses only O(N) additional memory, hence the available RAM should not be an issue; of course, as N increases the CPU/RAM caches could be used much less efficiently.

And yes, it should be noted that the Central Limit Theorem is going to kick in even for relatively small values of N, hence the problem could be easily solved for "large" N by calculating the asymptotic Gaussian and using it instead of the numeric calculation - but that would be cheating :)

## Attachment

All the scripts and code mentioned in this document are available here.## Statistical analysis

The problem can be formally reformulated as follows:"Find the pmf (probability mass function) P(s,N), s being the sum of the face values of N throws of a biased die, given the pmf of one throw, that is, P(s,1)".

In order to solve the problem using only the available information, we must assume that each throw of the die is independent from the other ones (that is, that the die does not "remember" the previous throws). If this holds, s is the sum of N independent random variables with the same pmf P(s,1), hence we can calculate P(s,N) by calculating the convolution N-th power of P(s,1), since the pmf of the sum of independent random variables is the convolution of the pmf of the variables (see here), and in our problem, all throws have the same pmf P(s,1).

Calculating a single convolution has a complexity of O ( N^2 ), and since we need to perform O ( N ) convolutions, the total complexity would be O ( N^3 ); this is significantly much better then the O ( k^N ) complexity of a brute-force calculation over all possible throws combinations, but still not practically feasible at all even for very small N.

But the DFT (Discrete Fourier Transform) of the convolution is the product of the DFT of the inputs (convolution theorem), and hence, for our case:

DFT [ P(s,N) ] = DFT [ P(s,1) ] ^ N

that is, we can calculate P(s,N) by transforming P(s,1), raising the spectrum to the N-th power, and then inverse transforming.

## Implementing the DFT as a single SQL statement

Calculation by using the DFT definition decreases the complexity to O ( N^2 ), a fact that is well known and that can be tested by using my script adellera_FT.sql. The computation is trivial, basically the FT computation is made by building a cartesian join of the input vector times an equally sized frequency vector, computing the elements and then summing by grouping by the output vector (by the frequencies for the direct DFT, by s for the inverse DFT).In passing, it is interesting to note that the FT transformation makes the original recursive problem a non-recursive one, hence easily solved using SQL. But this implementation has the only merit of being quite simple and easy to port on different database products with little effort (it would be enough to generate sequential values using a table instead of Mikito Harakiri's connect-by trick), since O ( N^2 ) is still too resource intensive, and hence unusable for non-small N. We need to switch to the FFT.

## Implementing the FFT as a single SQL statement

By using the classic FFT (Fast Fourier Transform) algorithm, we can get a much better (and probabily optimal) complexity of O ( N log N ). The challenge is how to implement a FFT in SQL.All the FFT variants are based on a divide-and-conquer approach, in which the FT is reformulated as a function of the FT of a subset of the points; the latter FTs are then recursively reformulated as well, until we get to a subset of exactly one point (the base case of the recursion).

Even if the FFT algorithms make sophisticated reasonings in order to avoid explicit recursion, their "recursiveness" is anyway apparent, since they are always formulated as a succession of steps that need to be feeded by the previous step; in a manner, the "recursiveness" of the original problem, that was eliminated by the FT, comes back in the FFT. This is especially challenging for the SQL language, that has not been designed (or at least is not usually implemented) to support easily multiple iterations that are feeded by the previous one.

It is probably possible to implement the FFT by using the Oracle connect-by or the new 11gR2 feature mentioned by Vadim Tropashko, but that would probably (depending on how much the Sql Optimizer is smart) need auxiliary storage to store the logN intermediate steps; by using the Oracle model clause, it is possible to solve the problem by using only O(N) auxiliary storage, as shown below.

## Implementing the FFT using the Oracle model clause

There are a lot of excellent resources online to get a better understanding of the FFT in general and the classic Cooley-Tukey variant I've used in particular; above all, I love the CMLAB page, that contains those wonderful diagrams that nicely explain visually the data flow. Of course, many pages of wikipedia have been useful as well.For an introduction of the Model clause, look no further than the excellent tutorials by Rob van Wijk's (part one, part two and part three), that have made me productive in a couple of hours starting from zero.

A quick summary of my FFT implementation below: it uses a model composed by a series of vectors indexed by f (the frequencies). The real[f] and imag[f] are the main ones; they are initialized with the input vector to be transformed, and then are modified by the model rules (a succession of model iterations) until they contain the output values. Each iteration implements a "stage" of the algorithm, each one composed by a "butterfly" computation and a complex multiplication by W (twiddle factors, more formally known as "roots of unity"), stages that mimic the original recursion. See the excellent CMLAB link above for more informations.

Do not be scared by the length of the solution, it is just a succession of simple steps, as it is often the case in numerical programming ...

with die_ieee as ( -- the die casted using IEEE 754 arithmetic select cast (face_value as binary_double) as face_value, cast (probability as binary_double) as probability from die ), base_constants as ( -- Calculate max_s_1, the highest s for which P(s,1) is > 0, -- that is, the max face value of the die_ieee select max (face_value) as max_s_1 from die_ieee where probability != 0 ), derived_constants as ( -- Calculate max_s_N, the highest s for which P(s,N) is > 0, which is (:N * max_s_1): -- for example, max_s_1=8, after 3 throws the max s observed will be 8*3=24. -- Hence we need to calculate P(s,N) over the range 0 .. max_s_N, -- that is, we need to calculate the FT over max_s_N * :N + 1 values. -- Since the Cooley-Tukey FFT variant needs to operate on vectors whose size -- is an exact power of two, we calculate and store in log_vector_size the -- logarithm of the smallest such power of two. select max_s_1, max_s_1 * cast (:N as binary_double) as max_s_N, ceil ( log (2D, max_s_1 * cast (:N as binary_double) + 1D) ) as log_vector_size from base_constants ), constants as ( -- add vector_size=2^log_vector_size and the useful PI constant select log_vector_size, power (2D, log_vector_size) as vector_size, 3.14159265358979323846264338327950288419716939937510D as PI from derived_constants ), s_sequence as ( -- build the sequence of all possible s values: 0..vector_size -- this is of course the number of frequencies as well select cast (rownum-1 as binary_double) as s from dual connect by level <= (select vector_size from constants) ), p_s_1 as ( -- set P(s,1) over all possible values of s (zero for non-existent face value) select s, nvl (die_ieee.probability, 0) as probability from s_sequence, die_ieee where s_sequence.s = die_ieee.face_value(+) and die_ieee.probability(+) != 0D ), dft_p_s_1 as ( -- calculate DFT [ P(s,1) ] using Cooley-Tukey decimation-in-frequency -- NB: outputs are scrambled in bit-reversal order -- see http://www.cmlab.csie.ntu.edu.tw/cml/dsp/training/coding/transform/fft.html select f, real, imag from (select s as f, probability as real, 0D as imag from p_s_1) model -- lookup tables for constants (import constants inside the model) reference constants on (select 0D dummy, c.* from constants c) dimension by (dummy) measures (log_vector_size) -- lookup table for butterfly parameters, indexed by iteration_number -- the "width" is how much the butterfly wings are "open" -- the "w_expon" is the component of the exponent of W that varies with -- each iteration reference butterfly on (select cast (rownum-1 as binary_double) as iter, cast (power (2, log_vector_size - rownum) as binary_double) as width, cast (power (2, rownum-1) as binary_double) as w_expon from dual, (select log_vector_size from constants) connect by level <= (select log_vector_size from constants) ) dimension by (iter) measures (width, w_expon) -- lookup table for the W (twiddle factors a.k.a. roots of unity) exponents reference W on (select cast (rownum-1 as binary_double) as r, cast (cos ( -2 * PI * (rownum-1) / vector_size ) as binary_double) as real, cast (sin ( -2 * PI * (rownum-1) / vector_size ) as binary_double) as imag from dual, (select PI, vector_size from constants) connect by level <= (select vector_size from constants) ) dimension by (r) measures (real, imag) main m dimension by (f) -- real(f) and imag(f) are the input and outputs -- aux_real(f) and aux_imag(f) are the O(N) auxiliary storage -- exp(f) is the exponent for the W factors -- butterfly_side(f) is 'up' if the current frequency has to be feeded with the up-pointing -- (sum-only) half of the butterfly, and 'down' if it must use the down-pointing (difference) side measures (real, imag, 0D aux_real, 0D aux_imag, 0D exp, 'xxxx' butterfly_side) rules sequential order iterate (999999) until ( iteration_number+1 = constants.log_vector_size[0] ) ( -- calc butterfly side (up or down) butterfly_side[any] = case when mod (trunc (cv(f) / butterfly.width[iteration_number]), 2D) = 0D then 'up' else 'down' end, -- crosses (or wings) of the butterflies: complex sum or difference (to auxiliary storage) aux_real[any] = case when butterfly_side[cv(f)] = 'up' then + real[cv(f)] + real[cv(f) + butterfly.width[iteration_number]] else - real[cv(f)] + real[cv(f) - butterfly.width[iteration_number]] end, aux_imag[any] = case when butterfly_side[cv(f)] = 'up' then + imag[cv(f)] + imag[cv(f) + butterfly.width[iteration_number]] else - imag[cv(f)] + imag[cv(f) - butterfly.width[iteration_number]] end, -- exponent of twiddle (or roots of unity) factors W exp[any] = case when butterfly_side[cv(f)] = 'up' then 0D else mod (cv(f), butterfly.width[iteration_number]) * butterfly.w_expon[iteration_number] end, -- multiplication by the twiddle (or roots of unity) factors W (from auxiliary storage) real[any] = case when butterfly_side[cv(f)] = 'up' then aux_real[cv(f)] else aux_real[cv(f)] * W.real[ exp[cv(f)] ] - aux_imag[cv(f)] * W.imag[ exp[cv(f)] ] end, imag[any] = case when butterfly_side[cv(f)] = 'up' then aux_imag[cv(f)] else aux_real[cv(f)] * W.imag[ exp[cv(f)] ] + aux_imag[cv(f)] * W.real[ exp[cv(f)] ] end ) ), dft_p_s_1_polar as ( -- transform DFT [ P(s,1) ] in polar form select f, sqrt ( real * real + imag * imag ) as modulus, case when abs(imag) < 1e-6D and abs(real) < 1e-6D then 0D else atan2 ( imag , real ) end as phase from dft_p_s_1 ), dft_p_s_N_polar as ( -- calc the polar form of DFT [ P(s,N) ] select f, power (modulus , cast (:N as binary_double)) as modulus, phase * cast (:N as binary_double) as phase from dft_p_s_1_polar ), dft_p_s_N as ( -- transform the polar form of DFT [ P(s,N) ] in cartesian form select f, modulus * cos (phase) as real, modulus * sin (phase) as imag from dft_p_s_N_polar ), p_s_N as ( -- calculate P(s,N) by using the inverse of Cooley-Tukey decimation-in-TIME -- NB: outputs are needed in bit-reversal order, which is exactly the order -- that the decimation-in-frequency outputs them, hence we save two very expensive bit-reversals -- See the comments above for the meaning of all steps, since -- a) decimation-in-time is the same as decimation-in-frequency, with different -- parameters for the butterflies and different orders for the operations -- (the pictures in the quoted URL to the CMLAB are worth ten thousand words) -- b) the W sign is the opposite since we are calculating the inverse select f as s, -- definition of the inverse need 1/vector_size as a normalization term round(real / (select vector_size from constants),6) as probability -- note: round() returns NUMBER not BINARY_DOUBLE -- imaginary part is of course zero from ( select f, real, imag from (select f, real, imag from dft_p_s_N) model reference constants on (select 0D dummy, c.* from constants c) dimension by (dummy) measures (log_vector_size) reference butterfly on (select cast (rownum-1 as binary_double) as iter, cast (power (2, rownum-1) as binary_double) as width, cast (power (2, log_vector_size - rownum) as binary_double) as w_expon from dual, (select log_vector_size from constants) connect by level <= (select log_vector_size from constants) ) dimension by (iter) measures (width, w_expon) reference W on (select cast (rownum-1 as binary_double) as r, cast (cos ( +2 * PI * (rownum-1) / vector_size ) as binary_double) as real, cast (sin ( +2 * PI * (rownum-1) / vector_size ) as binary_double) as imag from dual, (select PI, vector_size from constants) connect by level <= (select vector_size from constants) ) dimension by (r) measures (real, imag) main m dimension by (f) measures (real, imag, 0D aux_real, 0D aux_imag, 0D exp, 'xxxx' butterfly_side) rules sequential order iterate (999999) until ( iteration_number+1 = constants.log_vector_size[0] ) ( -- calc butterfly side (up or down) butterfly_side[any] = case when mod (trunc (cv(f) / butterfly.width[iteration_number]), 2D) = 0D then 'up' else 'down' end, -- exponent of twiddle (or roots of unity) factors W exp[any] = case when butterfly_side[cv(f)] = 'up' then 0D else mod (cv(f), butterfly.width[iteration_number]) * butterfly.w_expon[iteration_number] end, -- multiplication by the twiddle (or roots of unity) factors W (to auxiliary storage) aux_real[any] = case when butterfly_side[cv(f)] = 'up' then real[cv(f)] else real[cv(f)] * W.real[ exp[cv(f)] ] - imag[cv(f)] * W.imag[ exp[cv(f)] ] end, aux_imag[any] = case when butterfly_side[cv(f)] = 'up' then imag[cv(f)] else real[cv(f)] * W.imag[ exp[cv(f)] ] + imag[cv(f)] * W.real[ exp[cv(f)] ] end, -- crosses (or wings) of the butterflies: complex sum or difference (from auxiliary storage) real[any] = case when butterfly_side[cv(f)] = 'up' then + aux_real[cv(f)] + aux_real[cv(f) + butterfly.width[iteration_number]] else - aux_real[cv(f)] + aux_real[cv(f) - butterfly.width[iteration_number]] end, imag[any] = case when butterfly_side[cv(f)] = 'up' then + aux_imag[cv(f)] + aux_imag[cv(f) + butterfly.width[iteration_number]] else - aux_imag[cv(f)] + aux_imag[cv(f) - butterfly.width[iteration_number]] end ) ) ) select cast (s as number) as s, probability as probability from p_s_N order by s ;

## Functional testing

The output has been compared during development with the results of two Oracle brute-force solution known, namely the Rob van Wijk's and the Laurent Schneider ones, for many different P(s,1).The script compare_results.sql automatically compares the output with Rob van Wijk's solution for a random die (random P(s,1)), over a small range of N values (to keep the test duration manageable).

The script test_correctness.sql automatically performs automatic tests over these P(s,1) distributions set :

- compares the outputs for a classic die (face_value=1..6, not biased) with the exact formula - note that the exact formula, albeit built on the simplified classic case, is far from trivial and especially is derived using a completely different statistical reasoning, hence matching is a strong indicator of functional correctness;
- compares the outputs for a random die (random P(s,1)) with the output of the FT (ensures correctness of the FFT implementation).

## Performance testing

The script check_performance.sql executes and log the SQL statement execution time, and the ratio execution_time/(NlogN); the results on my machine areN MILLIS N*LOG(2,N) RATIO ---------- ---------- ---------- ---------- 512 18470 4608 4.00824653 1024 35320 10240 3.44921875 2048 82920 22528 3.68075284 4096 174230 49152 3.54471842 8192 383670 106496 3.60267052 16384 827230 229376 3.60643659 32768 1836970 491520 3.73732503 MIN(RATIO) AVG(RATIO) MAX(RATIO) ---------- ---------- ---------- 3.44921875 3.66133838 4.00824653

We note that the execution time is indeed asymptotically very well approximated by NlogN.

Note also that the absolute performance, already not bad at all, could be further optimized by making standard optimizations (e.g. avoiding multiplications by 1 or zero, using other more sophisticated variants instead of the Cooley-Tukey) that have not been made for the sake of clarity (and also because they provide only linear improvements).

## Scalability

O ( N log N ) is a very good scalability, very close to the ideal linear scalability, hence we can scale by adding CPU power very easily. Also, since the FFT is inherently recursive, the statement could be easily modified to be run in parallel - e.g. by combining the results of K sub-FFT executed in parallel on K processors.Also, notice that the FFT uses only O(N) additional memory, hence the available RAM should not be an issue; of course, as N increases the CPU/RAM caches could be used much less efficiently.

And yes, it should be noted that the Central Limit Theorem is going to kick in even for relatively small values of N, hence the problem could be easily solved for "large" N by calculating the asymptotic Gaussian and using it instead of the numeric calculation - but that would be cheating :)

## Limitations

If the face values are very sparse (e.g. 1,2,3,4,5,10000000), the FT approach is going to be not optimal, since it needs to work on vectors whose size is in the order of max(face_value). Of course, FFT algorithms specialized on sparse vectors do exist and can be implemented in SQL; but possibly other approaches might be more efficient in this special case.## Contacts

Please send comments and requests to alberto.dellera@gmail.com.