It's interesting that Data.Array.Repa is actually faster than hmatrix, which is unexpected since hmatrix is implemented using LAPACK. Is this because Repa uses the unboxed type?

```
import Data.Array.Repa
import Data.Array.Repa.Algorithms.Matrix
main = do
let
a = fromListUnboxed (Z:.1000:.1000::DIM2) $ replicate (1000*1000) 1.0 :: Array U DIM2 Double
b = fromListUnboxed (Z:.1000:.1000::DIM2) $ replicate (1000*1000) 1.0 :: Array U DIM2 Double
m <- (a `mmultP` b)
print $ m!(Z:.900:.900)
```

running time with 1 core: 7.011s

running time with 2 core: 3.975s

```
import Numeric.LinearAlgebra
import Numeric.LinearAlgebra.LAPACK
main = do
let
a = (1000><1000) $ replicate (1000*1000) 1.0
b = (1000><1000) $ replicate (1000*1000) 1.0
print $ (a `multiplyR` b) @@> (900,900)
```

Running time: 20.714s

Perhaps you are using a non-optimized LAPACK library. In my computer, using libatlas-base, the running time is ~0.4s.

$ cat matrixproduct.hs

```
import Numeric.LinearAlgebra
main = do
let a = (1000><1000) $ replicate (1000*1000) (1::Double)
b = konst 1 (1000,1000)
print $ a@@>(100,100)
print $ b@@>(100,100)
print $ (a <> b) @@> (900,900)
```

$ ghc matrixproduct.hs -O

$ time ./matrixproduct

```
1.0
1.0
1000.0
real 0m0.331s
user 0m0.512s
sys 0m0.016s
```

Similar Questions

For a brief report I have to do, our class ran code on a cluster using both gcc -O0 and icc -O0. We found that gcc was about 2.5 times faster than icc without any optimizations? Why is this? Does gcc

I have a very basic question for iPhone dev. Why is Core Data faster than SQLite even though CoreData uses SQLite under the hood?

In his book Even Faster Web Sites Steve Sounders writes that a simple way to improve the performance of a loop is to decrement the iterator toward 0 rather than incrementing toward the total length (a

Is it possible to express Matlab operation A.*B (multiplication element by element) A.*B = [a11 * b11, a12 * b12, ..., a1n * b1n; ...; am1 * bm1, ..., amn * bmn] by a common matrix algebra?

I would like to transform the matrix product AX-XB into vector form. That is Cx where x=vec(X) Yet I found the last term (XB) is very difficult to vectorize, it would be very sparsy. Any effective way

I wonder why searching in BST is faster than Binary search algorithm. I am talking about tree that have (almost) always the same numbers of vectors in sub tree (well balanced.) I have tested both of t

Abstract Hi, suppose you have two different independent 64-bit binary matrices A and T (T is a transposed version of itself, using the transposed version of matrix allows during multiplication to oper

I am taking a look at large matrix multiplication and ran the following experiment to form a baseline test: Randomly generate two 4096x4096 matrixes X, Y from std normal (0 mean, 1 stddev). Z = X*Y S

I previously posted a question regarding matrix-vector multiplication in CUDA and about writing my own kernel. After doing this, I decided to implement my problem using CUBLAS as suggested by some use

Check this test case: http://jsperf.com/n-n-1-or-n n = n + 1; is faster than n++; and ++n; Any clue about why the first writing is so much faster on many browsers ?

I've found that mov al, bl mov ah, bh is much faster than mov ax, bx Can anyone explain me why? I'm running on Core 2 Duo 3 Ghz, in 32-bit mode under Windows XP. Compiling using NASM and then linkin

I'm trying to multiply each row of a matrix by the column of another matrix. For example: mat1 <- matrix(rnorm(10), nrow=5, ncol=2) mat2 <- matrix(rnorm(5), nrow=5) I want to multiply each row

The author of this page: http://24ways.org/2011/your-jquery-now-with-less-suck asserts that the jQuery selector $('#id').find('p') is faster than $('#id p'), although that presumably produce the same

I came across this excerpt today: On most older microprocessors, bitwise operations are slightly faster than addition and subtraction operations and usually significantly faster than multiplication a

I'm using the function fromBlocks from hMatrix over a list whose elements are determined by functions of type Int -> Int -> Int -> Matrix Int. However, GHC complains saying that: No instance

In this code sample I benchmark scoped vs manual locking of a pthread mutex. I expected equal perfomance for both approaches. But much to my surprise the scoped lock solution seems a little faster. Ca

I am having trouble getting divide and conquer matrix multiplication to work. From what I understand, you split the matrices of size nxn into quadrants (each quadrant is n/2) and then you do: C11 = A

I'm trying to apply a Kalman filter to the data coming out from the iPhone accelerometer. I need to perform matrix multiplication and inversion as fast as possible, so I was curious about the possibil

Douglas Crockford, in JavaScript: The Good Parts, states that shift is usually much slower than pop. jsPerf confirms this. Does anyone know why this is the case? From an unsophisticated point of vie

I can already hear the wrenching guts of a thousand iOS developers. No, I am not noob. Why is -drawRect faster for UITableView performance than having multiple views? I understand that compositing ope

I am attempting to implement a shared memory based matrix multiplication kernel as outlined in the CUDA C Programming Guide. The following is the kernel: __global__ void matrixMultiplyShared(float *

Is there a relatively easy to implement or transparent way to multiply two large matrices in Matlab in parallel? Ideally, I would like to perform this parallel multiplication with at most a few lines

Suppose I have a matrix A=rand(2,14,24) and a vector x=10*ones(1,14) I want element wise multiplication of A and x, such that B(i,j,k)=A(i,j,k)*x(j) for all j=1,2,..14. I want to be able to do this wi

I'm reading a book where the author says that if( a < 901 ) is faster than if( a <= 900 ). Not exactly as in this simple example, but there are slight performance changes on loop complex code.

I've been doing some XNA programming (DirectX) on Windows Phone 7 and noticed that the VertexBuffer class is 30 times faster than the DynamicVertexBuffer class. What's the difference between them anyw

As we know, the quicksort performance is O(n*log(n)) in average but the merge- and heapsort performance is O(n*log(n)) in average too. So the question is why quicksort is faster in average.

I have a matrix X of dimensions nx2. Using this matrix I want to construct a tensor Y of dimensions 2x2xn. So that Y(:, :, i) = X(i, :)'*X(i, :) Can this be done in Matlab without a loop using some li

I am debugging a matrix multiplication in OpenGL and I am getting an execpted position vector in the resulting matrix. Here is my code : { glMatrixMode(GL_MODELVIEW); glPushMatrix(); glLoadIdentity();

have a problem making a Matrix Multiplication using cuda. I have to do A*A*A*A and save it in hB. With Cublas it's ok, but I can't make it with CUDA. Dimension can be a high value like 2000. This is m

Is this a fair test for comparing a vector with an array? The difference in speed seems too large. My test suggests the array is 10 to 100 times faster! #include stdafx.h #include <iostream> #

I'm writing a code that does N X N matrix multiplication using thread level parallelism. To get C = A X B, first I transposed matrix B, divided matrices into blocks. A thread takes a block from A an

I have to multiply 2 (most of the times) sparse matrix. Those matrix are pretty bit (about 10k*10k) and i've a two Xeon Quad core and just one thread for this job? is there any fast library for multi-

I'm performing matrix multiplication with this simple algorithm. To be more flexible I used objects for the matricies which contain dynamicly created arrays. Comparing this solution to my first one w

I am working on an assignment where I transpose a matrix to reduce cache misses for a matrix multiplication operation. From what I understand from a few classmates, I should get 8x improvement. Howeve

When I include an inline code trunk using \Sexpr{}, the matrix multiplication symbole %*% is interpreted as Latex commenting. How to get around this?

I read somewhere that snprintf is faster than ostringstream. Has anyone has any experiences with it? If yes why is it faster.

I was solving a problem that asked me to find the sum of all EVEN fibonacci numbers under 4,000,000 and I noticed that the below CoffeeScript executed faster than the below Ruby. CoffeeScript sum = 0

I'm now only need to show an intermediate progress of matrix multiplication. for(unsigned int col=0; col<mtxSize; col++) { unsigned tmp = 0; for(unsigned int row=0; row<mtxSize; row++) { for(uns

I have a 2D Matrix Multiplication program using the following kernel: __global__ void multKernel(int *a, int *b, int *c, int N) { int column = threadIdx.x + blockDim.x * blockIdx.x; int row = threadId

Shifting bits left and right is apparently faster than multiplication and division operations on most (all?) CPUs if you happen to be using a power of 2. However, it can reduce the clarity of code for

I try to directly copying matrix multiplication result to a subset of another matrix: cv::Mat a,b,c; //fill matrices a and b and set matrix c to correct size cv::Mat ab=a*b; ab.copyTo(c(cv::Rect(0,0,3

is var x = new Stuff(); x.DoStuff();faster than new Stuff().DoStuff(); ? I'm not sure why but I noticed in my code that the first method makes it faster, anybody knows which one is faster and why ?

I'm writing a C code including matrix multiplication and I'm using 3 nested loops for that operation. So, does anyone know how we can improve that code by removing one of the nested loops? for (i = 0;

I have the CSR coordinates of a matrix. /* alloc space for COO matrix */ int *coo_rows = (int*) malloc(K.n_rows * sizeof(int)); int *coo_cols = (int*) malloc(K.n_rows * sizeof(int)); float *coo_vals

I'm trying to write a matrix multiplication code in cuda, which is pretty similar to Nvidia's cuda programming guide, but it is not working. It is supposed to do C=alpha*A*B+beta*C , but for every A,B

say I want to multiply two matrices together, 50 by 50. I have 2 ways to arrange threads and blocks. a) one thread to calculate each element of the result matrix. So I have a loop in thread multiplies

I'm trying to compare different methods for matrix multiplication. The first one is normal method: do { for (j = 0; j < i; j++) { for (k = 0; k < i; k++) { suma = 0; for (l = 0; l < i; l++) s

I've been having trouble with this parallel matrix multiplication code, I keep getting an error when trying to access a data member in my structure. This is my main function: struct arg_struct { int*

this is my code for matrix multiplication, but when i run it i get correct result for first row but wrong ones for second and third(mostly big negative numbers). This is my first programm so i used so

If i have the following matrix: a=[10 1 0 1 1 50 1 0 0 0 60 0 0 0 1] how can i multiply first column in the matrix [10 50 60]' as vector multiplication to the rest of the matrix and get the following