A FLEXIBLE IMPLEMENTATION OF A MATRIX LAURENT SERIES-BASED
16-POINT FAST FOURIER AND HARTLEY TRANSFORMS

R. C. de Oliveira
Computer Engineering Department
Amazon State University
Av. Darcy Vargas, 1200, Parque 10, 69065-020, Manaus-AM, Brazil.
email: rcorrea.oliveira@gmail.com

H. M. de Oliveira, R. Campello de Souza,
E.J.P. Santos
DES - Federal University of Pernambuco
Av. Ac. Hélio Ramos, s/n, CDU, 50740-530,
Recife-PE, Brazil
email: (hmo, ricardo)@ufpe.br, edval@ee.ufpe.br

ABSTRACT
This paper describes a flexible architecture for implementing a new fast computation of the discrete Fourier and Hartley transforms, which is based on a matrix Laurent series. The device calculates the transforms based on a single bit selection operator. The hardware structure and synthesis are presented, which handled a 16-point fast transform in 65 nsec, with a Xilinx SPARTAN 3E device.

1. INTRODUCTION
Fourier transforms play a major role in the fields related to Signal Processing [1, 2]. The successful application of transform techniques is mainly due to the existence of the so-called fast algorithms [3]. This paper proposes a new fast algorithm and its hardware implementation, for computing the discrete Fourier (DFT) [4] and Hartley (DHT) [5] transforms of sequences of particular lengths N, namely those for which \( N = 0 \pmod{4} \). Let \( N \) be the number of time-domain samples of a sequence \( v = \{v_n\} \), \( n = 0,1,2,...,N-1 \). The DFT of \( v \) is given by the sequence \( V = \{V_k\} \), of length \( N \), defined by Equation (1).

\[
V_k = \sum_{n=0}^{N-1} v_n \exp\left(-\frac{j2\pi kn}{N}\right). \tag{1}
\]

The Discrete Hartley Transform (DHT) is defined by Equation (2) [6].

\[
H_k = \sum_{n=0}^{N-1} v_n \left[\cos\left(\frac{2\pi kn}{N}\right) + \sin\left(\frac{2\pi kn}{N}\right)\right]. \tag{2}
\]

From equations (1) and (2), it is apparent that the DFT can be computed from the DHT by Equation (3).

\[
H_k = \Re(V_k) - \Im(V_k). \tag{3}
\]

In 1965, J.W. Cooley and J.W. Tukey introduced a revolutionary idea which later became known as the Fast Fourier Transform (FFT) [4]. The FFT is a milestone in the theory of algorithms [7-9].

2. THE FFT/FHT ALGORITHM
The fast algorithm is written according to the following DFT matrix decomposition: \( \Re(M) \) and \( \Im(M) \) denotes the real and imaginary parts of the matrix \( M \), equations (3) and (4), respectively [20].

\[
\Re(DFT) = \left\{ \Re(M_k) + \sum_{m=1}^{(N/4-1)/2} \Re(M_w + M_{-w}) \cos\frac{2\pi mn}{N} \right\} \nonumber \tag{4}
\]

\[
+ \left\{ \sum_{k=1}^{(N/4-1)/2} \Im(M_k - M_{-k}) \sin\frac{2\pi mn}{N} \right\}
\]

\[
\Im(DFT) = \left\{ \Im(M_k) + \sum_{m=1}^{(N/4-1)/2} \Im(M_w + M_{-w}) \cos\frac{2\pi mn}{N} \right\} \nonumber \tag{5}
\]

\[
- \left\{ \sum_{k=1}^{(N/4-1)/2} \Re(M_k - M_{-k}) \sin\frac{2\pi mn}{N} \right\}, \tag{5}
\]

where the matrices \( M_m \) are given by Equation (5).

\[
M_m := \sum_{l \in C_m} (-1)^{4l-m} \chi_l(M). \tag{6}
\]
The operator $\mathcal{X}_l$ acts over an $N \times N$ matrix, for each $l = 0, 1, 2, \ldots, N-1$, yielding a new $N \times N$ binary matrix whose elements are $\left( \delta_{l,m} \right)$ with $m := m_l$, where $\delta$ is the Kronecker symbol. The matrices $\Re(m)$, $\Re(M_l \pm M_{-l})$, $\Im(M_l)$ and $\Im(M_l \pm M_{-l})$ are then written in standard echelon form. From (4) and (5), the DHT components can be computed by Equation (3).

3. DESIGN METHODOLOGY AND ARCHITECTURE

The design was carried out through the steps: specification, VHDL description, behavioral simulation and synthesis.

3.1 Specification

The project aims at the production of a fast DFT/DHT computer, for real sequences with blocklength 16, every component of which is represented by a 16-bit word. The computations are to be made using fixed-point arithmetic (7 bits). Every component of the output sequence is represented by a 32-bit word (16 bits for its real part and 16 bits for its imaginary part). With a single bit selection, the user can choose which transform is to be computed.

3.2 VHDL Description

The device VHDL description was generated with the aid Matlab™ (simulink), according to Equations (5) and (6).

3.3 Behavioral Simulation

With Xilinx ISE tool, the VHDL description was compiled and simulated to check the output data. Syntax errors were removed at this point. The behavioral simulation was carried out by Xilinx ISE tool, and testbench files were generated. The stimulus (input values) were inserted and results collected according to diagrams from Fig. 3 (DHT) and Fig. 4 (DFT) (Appendix).

3.4 Synthesis

At this step, the VHDL code was analysed and optimized by the synthesis tool, in order to create an efficient implementation of the device. A Register Transfer Level (RTL) scheme was generated. The construction tool generates the file to be burn-in in the chosen device, namely, the Spartan 3E, xc3s500e-5-fg320 device. In the post-synthesis simulation, the finished circuit process occurs at 65 ns, and all obtained results corroborate the previous behavioral simulation. Main characteristics of the hardware: number of slices is 1611 out of 4656 (34%), number of slice flip-flops is 656 out of 9312 (7%), number of 4 input LUTs is 2894 out of 9312 (31%), number of IOs is 56, number of bonded IOBs is 56 out of 232 (24%), number of MULT18X18SIOs is 12 out of 20 (60%), and number of GCLKs is 2 out of 24 (8%).

3.5. Architecture

The device architecture is based on two main blocks, the memory management and the core block, as shown in Fig. 1. The block of memory management stores the components of the input signal and the output transform.

![Fig. 1 - Fast algorithm architecture FFT/FHT. The arithmetic computation block corresponds to the Core Block.](image)

The core block is responsible for calculating the DFT coefficients, according to Fig. 2. From these, the memory management block selects and computes the desired transform.

![Fig. 2 – Core Block architecture. This block is used both in DFT/DHT computation. (See Fig. 1)](image)

After storage in the memory, the device calculates the transforms based on the selected operator (DFT/DHT). Thereby, the transforms are stored in the memory, as follows. For the DFT: the first two bytes hold the real components and the last two hold the imaginary components; For the DHT: only the last 16 bits hold the real components. The core block was derived from a Simulink™ implementation. The arithmetic complexity for this 16-point fast algorithm is 12 fixed-point multiplications and 101 additions.
4. SIMULATION RESULTS

The implementation was simulated using Xilinx ISE tool to verify the accuracy of the device and the results were compared with the ones obtained from Simulink™. The results are shown in Figs. 3, 4. The simulation was made with several input sequences. For the particular sequence \( v = [0 \ 1 \ 2 \ 3 \ 4 \ 5 \ 6 \ 7 \ 0 \ 1 \ 2 \ 3 \ 4 \ 5 \ 6 \ 7] \), the results are shown in Table 1. The output DFT sequences of the simulation carried out by the device was exactly the same as that one obtained by Simulink™ (Column 3, Table 1). Both are slightly different than the true-DFT sequence computed by an internal MatLab™ routine (Column 3, into brackets). The quantization error (0.3%) is due to the use of fixed-point instead of floating-point arithmetic. Nevertheless, this error magnitude is acceptable for many applications.

<table>
<thead>
<tr>
<th>Index (k)</th>
<th>Input Data</th>
<th>DFT</th>
<th>Output Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>0</td>
<td>56</td>
<td>56</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>2</td>
<td>2</td>
<td>-8.0000 +19.3137j</td>
<td>27.3137</td>
</tr>
<tr>
<td>3</td>
<td>3</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>4</td>
<td>4</td>
<td>-8.0000 + 8.0000j</td>
<td>16.0000</td>
</tr>
<tr>
<td>5</td>
<td>5</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>6</td>
<td>6</td>
<td>-8.0000 + 3.3137j</td>
<td>11.3137</td>
</tr>
<tr>
<td>7</td>
<td>7</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>8</td>
<td>8</td>
<td>-8.0000</td>
<td>-8.0000</td>
</tr>
<tr>
<td>9</td>
<td>9</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>10</td>
<td>2</td>
<td>-8.0000 - 3.3137j</td>
<td>-4.625</td>
</tr>
<tr>
<td>11</td>
<td>3</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>12</td>
<td>4</td>
<td>-8.0000 - 8.0000j</td>
<td>0</td>
</tr>
<tr>
<td>13</td>
<td>5</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>14</td>
<td>6</td>
<td>-8.0000 -19.3137j</td>
<td>11.3137</td>
</tr>
<tr>
<td>15</td>
<td>7</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>

5. CONCLUDING REMARKS

This paper presented the design and implementation on the xc3s500e-5-fg320 device, of an algorithm that is capable of fast computing the DFT/DHT coefficients of a real sequence with blocklength \( N=16 \). The device computes either transform based on a single bit selection. The device computed the DFT/DHT at 65 nsec, which is acceptable for applications such as audio, speech processing, biomedical signal processing and xDSL.

REFERENCES

APPENDIX

Fig. 3 Results for behavioral simulation in operation DHT using Xilinx ISE tool.

Fig. 4 Results for behavioral simulation in operation DFT using Xilinx ISE tool.