next up previous
Next: The Wiedemann algorithm Up: No Title Previous: Machines and data

  
Lanczos and conjugate gradient methods

We first describe the Lanczos algorithm. Suppose that we have to solve the system

 
Ax = w (1)

for a column n-vector x, where A is a symmetric $n \times n$ matrix, and w is a given column n-vector. Let
 
w0 = w, (2)
v1 = A w0, (3)
w1 = $\displaystyle v_1 - \frac{(v_1,v_1)}{(w_0,v_1)}w_0,$ (4)

and then, for $i \geq 1$, define
 
vi+1 = Awi, (5)
wi+1 = $\displaystyle v_{i+1} - \frac{(v_{i+1},v_{i+1})}{(w_i,v_{i+1})}w_i
- \frac{(v_{i+1},v_i)}{(w_{i-1},v_i)}w_{i-1}.$ (6)

The algorithm stops when it finds a wj that is conjugate to itself, i.e. such that (wj,Awj) = 0. This happens for some $j \leq n$. If wj = 0, then

\begin{displaymath}x = \sum_{i=0}^{j-1} b_iw_i
\end{displaymath} (7)

is a solution to Equation 1, where

 \begin{displaymath}b_i = \frac{(w_i,w)}{(w_i,v_{i+1})}.
\end{displaymath} (8)

(If $w_j \neq 0$, the algorithm fails.)

The Lanczos algorithm was invented to solve systems with real coefficients [11]. To solve systems over finite fields, we simply take Equations 3 to 8 and apply them to a finite field situation. This causes possible problems, since over a finite field one can have a non-zero vector that is conjugate to itself. However, this difficulty, as well as some other ones that arise, can be overcome in practice.

In addition to dealing with self-conjugacy, we have to cope with the fact that the systems we need to solve are in general not symmetric, but rather are of the form

 
Bx = u, (9)

where B is $m \times n$, $m \geq n$, x is an unknown column n-vector, and u is a given column m-vector. (We assume that B has rank n, or at least that u is in the space generated by the columns of B.) We first embed the field over which we have to solve the equations in a possibly larger field F with |F| considerably larger than n. We then let D be an $m \times m$ diagonal matrix with the diagonal elements selected at random from $F\backslash\{0\}$, and we let
 
A = BTD2B,  
w = BTD2u. (10)

A solution to Equation 9 is then a solution to Equation 1, and we expect that with high probability a solution to Equation 1 will be a solution to Equation 9. The random choice of D ought to ensure that the rank of A will be the same as the rank of B (this is not always true), and that we will not run into a self-conjugate wiin the Lanczos algorithm. Experience has shown that this is indeed what happens.

The Lanczos algorithm is not restricted to dealing with sparse matrices, but that is where it is most useful. At iteration i, we need to compute the vector vi+1 = Awi ( vi = Awi-1 is already known from the previous iteration), the vector inner products (vi+1,vi+1), (wi,vi+1), and (vi+1,vi), and then form wi+1 using Equation 6. The matrix A will in general be dense, even when B is sparse. However, we do not need to form A, since to compute Awi we use Equation 10 and compute

BT(D2(Bwi)), (11)

Suppose that B has b non-zero entries. We will further assume, as is the case for the matrices arising in factorization and discrete logarithm computations, that almost all these b entries are $\pm 1$. Let c1 be the cost of an addition or subtraction in F, and c2the cost of a multiplication. Then computing Bwi costs approximately

\begin{displaymath}b\ c_1,\end{displaymath}

multiplying that by D2 costs

\begin{displaymath}n\ c_2,\end{displaymath}

and multiplying D2Bwi by BT costs approximately

\begin{displaymath}b\ c_1,\end{displaymath}

for a total cost of about

\begin{displaymath}2\ b\ c_1 + n\ c_2\end{displaymath}

for each evaluation of A wi. The computation of each vector inner product costs about n c1 + nc2, and so the cost of each iteration is about

\begin{displaymath}2\ b\ c_1 + 4\ n\ c_1 + 5\ n\ c_2,
\end{displaymath} (12)

so that the total cost of obtaining a solution is about

 \begin{displaymath}2\ n\ b\ c_1 + 4\ n^2\ c_1 + 5\ n^2\ c_2.
\end{displaymath} (13)

In this rough accounting we do not charge for the cost of accessing elements of arrays or lists, which will often be the dominant factor, especially when dealing with binary data. On the other hand, on some machines one can perform additions and multiplications in parallel. Therefore one should treat the estimate given by Equation 13 as a very rough approximation.

In practice, B will usually be stored with rows represented by lists of positions where that row has a non-zero entry, and (in the case of non-binary problems) by lists of non-zero coefficients. Thus we need about b pointers and (for non-binary data) about b bits to indicate whether that coefficient is +1 or -1. (Non-zero coefficients that are not $\pm 1$ are relatively rare and can be treated separately.) The pointers normally have to be of at least $\left\lceil \log_2 n
\right\rceil$ bits, but one can reduce that by taking advantage of the fact that most indices of positions where the coefficient is non-zero are very small. In any case, storage is not a problem even for very large systems. In the implementations described below, full 32-bit words were used for all pointers and coefficients. The largest of the systems in Table 1 had $b \lesssim 10^7$, so this did not cause any problems on the machine that was used. In fact, our algorithm had separate representations for both B and BT, which is wasteful.

In a typical situation, we expect that the Lanczos method will be applied to a system that was processed first by structured Gaussian elimination, and so it will not be too sparse. As a result, the cost of the vector inner products ought to be dominated by the cost of the matrix-vector products, 2nbc1. As we will describe later, memory access times will often be the dominant factor in determining the efficiency of the matrix-vector product.

In discrete logarithm applications, the system given by Equation 9 is usually overdetermined, and so the aim is to find the unique x that solves it. In applications to factoring, though, one is looking for linear dependencies in a set of vectors, and it is necessary to find several such dependencies. This can be stated as the problem of solving the system in Equation 9 with B fixed, but for several vectors u, so that we need $x_1,x_2,\ldots{},x_r$ such that

 \begin{displaymath}Bx_j = u_j,\;\;1 \leq j \leq r.
\end{displaymath} (14)

It is possible to solve all these systems at once, without rerunning the Lanczos algorithm r times. Let zj = BTD2uj. Apply the algorithm as presented above with w = z1. This produces the vectors $w_0,w_1,\ldots{},w_{n-1}$ which are conjugate to each other; i.e. (wi,Awj) = 0 for $i \neq j$. Now if

\begin{displaymath}x_k = \sum_{j=0}^{n-1} c_{k,j} w_j,
\end{displaymath} (15)

then

\begin{displaymath}(w_i,Ax_k) = \sum_{j=0}^{n-1} c_{k,j}(w_i,Aw_j) = c_{k,i}(w_i,Aw_i).
\end{displaymath} (16)

On the other hand, since Axk = zk, this gives

\begin{displaymath}c_{k,i} = \frac{(w_i,z_k)}{(w_i,Aw_i)}.
\end{displaymath} (17)

Since the terms on the right side above can be computed during the $i^{\rm th}$ iteration of the algorithm, the only substantial extra space that is needed is that for storing the partial sums $\tilde{x}_k$, which at the end of the $i^{\rm th}$ iteration equal

\begin{displaymath}\tilde{x}_k = \sum_{j=0}^i c_{k,j}w_j.
\end{displaymath} (18)

Although the above analysis was based on the Lanczos algorithm, the derived complexity bounds also serve as rough approximations for the conjugate gradient (CG) method [9]. The CG and Lanczos algorithms are very similar. The iterations for the two algorithms are slightly different, but the operation count is almost exactly the same.

Both the Lanczos and the CG algorithms were implemented, Lanczos for solving equations modulo a large prime, CG for solving equations modulo 2. Both worked effectively, as will be described below. Both usually required n iterations to solve a system of size n. One unexpected problem was encountered, though. In the standard Lanczos and CG algorithms, it is not necessary that the matrix A in Equation 1 be non-singular. As long as w is in the linear span of the columns of A, the algorithms will converge. In the case of equations over finite fields, however, we observed that a system of less than full rank often gave rise to a self-conjugate vector, and thus to an abort of the algorithm. This phenomenon has not been explored carefully. It was not very important in the cases where these algorithms were tried (computing discrete logarithms), since there the systems of linear equations are overdetermined. This issue might be much more important in the case of factoring algorithms, since in that case one needs to find many linear dependencies modulo 2.

The CG implementation uses auxiliary storage to carry out field operations fast. The field F is chosen to be GF(2r), with r = 19,20, or 21. F is defined as polynomials of degree $\leq r-1$over GF(2) modulo a fixed irreducible polynomial of degree r. Elements $\alpha \in F$ are represented by a full word, $\overline{\alpha}$, with the digits in the binary representation of $\overline{\alpha}$ corresponding to the coefficients of $\alpha$. This means that $\alpha + \beta$ is represented by the exclusive-or of $\overline{\alpha}$ and $\overline{\beta}$, and this operation is fast. Multiplication is carried out with the aid of an auxiliary array. Some primitive element $\gamma \in F$ is chosen, and then an array w(j), $0
\leq j \leq 2^r -1$ is generated, with $w(\overline{\alpha})$ for $\alpha \neq 0$ equal to the discrete logarithm of $\alpha$ to base $\gamma$. Thus for $\alpha \neq 0$

\begin{displaymath}\alpha = \gamma^{w(\overline{\alpha})},\;\;0 \leq w(\overline{\alpha})
\leq 2^r - 2.
\end{displaymath} (19)

For $\alpha = 0$, $w(\overline{\alpha}) = w(0) = 2^{r+1}-3$. Another auxiliary array t(j), $0 \leq j \leq 2^{r+2}-6$, is formed, with
t(j) = $\displaystyle \overline{\gamma^j},\;\; 0 \leq j \leq 2^{r+1}-4,$ (20)
t(j) = $\displaystyle 0,\;\; 2^{r+1} - 3 \leq j \leq 2^{r+2} - 6.$ (21)

As a result,

\begin{displaymath}\overline{\alpha \beta} = t(w(\overline{\alpha}) + w(\overline{\beta}))
\end{displaymath} (22)

for all $\alpha,\beta \in F$, and thus each product in F can be computed using one integer addition, which on the machine that was used takes about as much time as an exclusive-or. The total cost of a multiplication in F is still higher than that of addition, though, because of the extra table lookups. Given the large size of the tables that are required which will not fit in the cache memory, on many machines it is likely to be faster to implement an algorithm that uses more operations but smaller tables. In practice, the time devoted to multiplication of elements of F is so small that it does not matter what algorithm one uses.

The implementation of the Lanczos algorithm is less efficient than that of CG. The very portable, and reasonably fast, multiprecision code that A. Lenstra and M. Manasse distribute with their factoring package [12] was used. When the non-zero entry in the matrix was $\pm 1$ or $\pm 2$, addition or subtraction routines were invoked; approximately 90% of the non-zero entries in matrix B were $\pm 1, \pm 2$ in data set L. In the other rare cases, multiprecision multiplication was invoked, even when it would have been more efficient to perform several additions.

The implementation of the Lanczos algorithm solved system L modulo a prime of 191 bits in 44 hours. The CG algorithm solved that system modulo 2 in 1.9 hours. Timings of about 100 iterations of the CG algorithm on system E indicate that this system would require about 190 hours to solve.

While quite large systems can be solved with the CG algorithm, the timings reported above are rather slow. For system L, Equation 13 indicates that the total cost ought to be around

\begin{displaymath}7 \times 10^{9} c_1 + 1.8 \times 10^8 c_2\end{displaymath}

If each of c1and c2 were around 1 machine instruction, the total run time would be about 6 minutes. What causes the 1.9 hour run time is the fact that on the computer that was used c1 and c2 are much more expensive. This is due to problems of memory access, with the cache size too small to contain the full arrays. On different machines the performance would be quite different. Even on the SGI computer that was used, it is quite likely that one could obtain substantial speedups by arranging the data flow so that the caches are utilized more efficiently.


next up previous
Next: The Wiedemann algorithm Up: No Title Previous: Machines and data
Brian A. LaMacchia
1999-10-25