FirstIndexNext

[bug-gsl] Improving gsl_blas_dsyrk routine (2008-04-12)

From: "王一" <godspeed.wang@.........>
Subject: Improving gsl_blas_dsyrk routine
Date: Sat, 12 Apr 2008 16:27:26 +0800
To: "bug-gsl" <bug-gsl@gnu.org>

Dear GSL developers:
I benifits a lot recently by using your powerful routines. Thus I became interested in the detail of your routines.
When I look through the BLAS source code, I found gsl_blas_dsyrk routine can be significantly improved by a change of looping order.
As I'm just a Ph.D. student of Statistical Genetics, I'm not sure my finding is generally true.
I just wish to help the lovely GSL! Please let me know the result if possible.

The version number of GSL: 1.8
The hardware and operating system: T7100 CPU, Windows XP
The compiler used, including version number and compilation options: Dev-C++/MinGW , full optimization
A description of the bug behavior: I use a real dataset with 6238 rows by 617 columns of double precision. The gsl_blas_dsyrk routine has 1/5 speed than my routine in some cases.

Rows Columns Transpose Mine/sec GSL_BLAS/sec
6238 617 No 26.8 25.2
6238 617 Yes 3.4 15.9
617 6238 No 2.4 2.4
617 6238 Yes 54 256.5


A short program which exercises the bug:
My gsl_blas_dsyrk routine analog is listed below. It is a function of my C++ matrix template class. Hope this will not trouble you.

template <class type>
void matrix<type>::square(matrix& M, bool T){
register uint r, c, k;
register type s;
if(!T){
resize(M.rn, M.rn);
for(r=0; r<M.rn; r++) for(c=r; c<M.rn; c++){
s=type(0);
for(k=0; k<M.cn; k++) s+=M(r, k)*M(c, k);
(*this)(c, r)=(*this)(r, c)=s;
}
}
else{
resize(M.cn, M.cn);
// The follow code changes looping order, just like DGEMM does
for(k=0; k<M.rn; k++) for(r=0; r<M.cn; r++){
s=M(k, r);
for(c=r; c<M.cn; c++) (*this)(r, c)+=s*M(k, c);
}
for(r=0; r<M.cn; r++) for(c=r+1; c<M.cn; c++) (*this)(c, r)=(*this)(r, c);
}
}

And your corresponding codes (similar with the Fortan code) are:
if (uplo == CblasUpper && trans == CblasNoTrans) {
for (i = 0; i < N; i++) {
for (j = i; j < N; j++) {
BASE temp = 0.0;
for (k = 0; k < K; k++) {
temp += A[i * lda + k] * A[j * lda + k];
}
C[i * ldc + j] += alpha * temp;
}
}

} else if (uplo == CblasUpper && trans == CblasTrans) {
// The following code may be improved
for (i = 0; i < N; i++) {
for (j = i; j < N; j++) {
BASE temp = 0.0;
for (k = 0; k < K; k++) {
temp += A[k * lda + i] * A[k * lda + j];
}
C[i * ldc + j] += alpha * temp;
}
}





王一
2008-04-12