A Simple Compressive Sensing Algorithm for Parallel Many-Core Architectures

Alexandre Borghi (1)

Jérôme Darbon (2)

Sylvain Peyronnet (1)

Tony F. Chan (2)

Stanley Osher (2)


(1) Laboratoire de Recherche en Informatique (LRI), Université Paris Sud, France

(2) Department of Mathematics, UCLA, Los Angeles, California, USA


Original Version September 2008; Revised August 5, 2009
The orginal report is UCLA Cam report 08-64

Abstract: In this paper we consider the l1-compressive sensing problem. We propose an algorithm specifically designed to take advantage of shared memory, vectorized, parallel and many-core microprocessors such as the Cell processor, new generation Graphics Processing Units (GPUs) and standard vectorized multi-core processors (e.g. quad-core CPUs). Besides its implementation is easy. We also give evidence of the efficiency of our approach and compare the algorithm on the three platforms, thus exhibiting pros and cons for each of them.

Keywords: compressive sensing parallel computing proximal point many-core architectures multi-core CPU GPU Cell

1  Introduction

An algorithm that relies on standard optimization techniques is proposed for solving compressive sensing problems involving l1 minimization. Its main characteristic is that it is designed for running efficiently on parallel many-core architectures and thus takes huge benefit from these technologies.

Let us briefly describe the compressive sensing problem. Given a sensing matrix Am× n  and some observed data fm , one wishes to find an optimal solution u*n of the following constraint minimization problem:







 
min
u
   ||u ||1  
s.t.     Au=f
    (1)

where the number of constraints m is typically much lower than the size of the signal n. This problem has received a lot of attention since it has application for instance in signal/image processing, data compression and more recently as compressive sensing [, , , , ]. The latter corresponds to identify matrices that allow for exact recovery of the original signal which is assumed to be sparse, from the observations. Such matrices are for instance proposed in [, , , , ]. This approach has led to several applications such as machine learning [, ], dimension reduction [], face recognition [], image processing [] and medical imaging []. There is a very large literature for algorithms to solve this problem [, , , , , , , , , ] for instance. We refer the reader to these for more detailed reviews of available algorithms. We note that many of them rely on iterative based thresholding/shrinkage approach that has been originally presented in [] and [] and with further developments and improvements such as [, , , , , , ]. This paper also falls into this category. However, contrary to these previous approaches, we focus on the practical considerations of such an approach (as opposed to pure theoretical ones) and our goal is to design an efficient algorithm that takes benefit from available computing technologies.

More precisely, we consider a standard Moreau-Yosida regularization as originally proposed in the seminal work of [] and []. The latter yields an iterative algorithm that is also known as the proximal point iterations [, , , ] that requires the inversion of a non-linear operator. We show that the latter can be efficiently performed by a shrinkage approach by defining an appropriate Moreau-Yosida regularization. The convergence of the method is assured by the theoretical properties of the Moreau-Yosida regularization [, ]. This approach allows for an efficient implementation on parallel, shared memory and vectorized architectures. The latter are now standardly available and lead to tremendous speedups compared to unary non-vectorized processors. Eventually, we note the work of [] that also performs proximal point iterations for solving the compressive sensing problem but on a dual formulation.

Note that in general, efficient implementations of a numerical algorithm on these parallel shared memory platforms are difficult to develop due to technical specificities of these architectures. However, our algorithm has been designed keeping in mind these technical issues so that it allows a short, easy and efficient implementation on these platforms. Experiments on three different parallel many-core architectures, namely vectorized multi-core processors, Graphic Processor Unit processors (GPUs) and the Cell Broadband Engine Architecture microprocessor (Cell), are presented and demonstrate the effectiveness of the approach.

The structure of the paper is as follows: section 2 describes main concepts of parallel computing and essential features of many-core architectures that must be taken into account while designing an algorithm intended to yield an easy and fast implementation to run on them. Our algorithm is described in section 3. Experiments along with comparisons with several different architectures are presented in section 4. Finally we draw some conclusions in section 5.

2  Parallel Architectures

We recall that our goal is to design an algorithm that benefits from current parallel architectures. In this section, we present general concepts about parallel computing along with their implementations in current parallel architectures we considered, before presenting the algorithm designed for them in section 3.

2.1  General parallel computing concepts

In order to design an efficient algorithm on current parallel architectures, we need to understand the main concepts they rely on. We now describe these concepts.
Coarse parallelism. Each platform features several computing units, thus tasks can be spread over them in order to improve performance. However, developers must take care of locks that are likely to arise in this context. A lock is used to prevent at least two computing units to update the same space of memory. It is well known that the use of locks to force unique access to data severely deteriorate performances. A good way to avoid the use of locks, and thus to enhance the performance, is to schedule operations so that they never interfere with each others. This general issue is known as the synchronisation problem and is one of the most important problem regarding performance.

Vectorization. Another level of parallelism, finer than the previous one, consists of processing the data as a vector. This technique receives the name of SIMD. Some architectures implement this technique and rely on the so-called vector units. Performances are greatly improved through the use of these vector units, provided the fact that data are well aligned (data must be accessed at specific adresses in memory). Note that not all algorithms can take benefit from this technique because they are not intrinsically built on vectorial primitives. Besides, even though the latter holds, the alignment property still needs to be fullfilled. In other words, it means that this requirement must be taken into account during the design phase of the algorithm. Our algorithm is taking advantage of this technique.

Memory considerations. Memory limitations are the main constraints of considered architectures, namely bandwidth, latency and cache memory issues. Recall that these platforms have two kinds of memory : cache memory, which is typically structured in different levels, and standard RAM. Cache memory has for goal to reduce the time needed to access data in RAM. Cache memory is an order of magnitude faster than RAM but also a lot smaller in terms of capacity storage. To perform computations on large inputs, data parts must be transfered back and forth between cache memory and RAM. This implies memory communications that can be a bottleneck of the method due to limited available bandwidth and latency. To efficiently achieve this memory management, implementations have to take into account the structure of the cache, i.e., the cache hierarchy. Indeed, an algorithm and its implementation should help a processor to maintain this structure and facilitate the prediction of data moves. Since several cores might require simultaneous memory accesses, performances can be limited by the available bandwidth. To cope with this problem, it is needed to satisfactorily schedule data transfers. If these memory issues are not taken into account then computing units may wait for the data to be processed. This is called computation starvation since a processor is not computing. To avoid this behavior, the frequency and the size of data transfers must be tuned to have no computing units without data.

2.2  Parallel architectures

We now explain how current parallel architectures implement concepts just described above. We consider three different architectures: standard vectorized multi-core CPU, modern GPU and the Cell Broadband Architecture (CBEA or Cell for short). These three platforms share common features. First, they are all based on a shared memory, i.e., a central memory accessible from all computing units. Second, they are parallel, i.e., composed of several computing units working together. However, these architectures do not implement parallelism in the same way. This leads to very different performances in both terms of computational power and memory transfer, which are two key aspects for global performance.

We now describe more precisely each of these architectures in view of the above considerations.
Multi-core CPU. A standard multi-core CPU is a processor made of several cores. Each core is superscalar, out-of-order and composed of vector computing units (such as Altivec [] or Intel Streaming SIMD Extensions). From a parallel computing point of view, main differences between current commercial multi-core CPUs are core interconnections, cache hierarchy and how they access RAM.

The Intel Core 2 has cores clustered by 2. Each core has its own very fast level 1 cache (L1) and each cluster features a shared L2 cache. Clusters are linked together through a Front-Side Bus (FSB) with motherboard northbridge, which integrates the memory controller used to access RAM. The Core 2 Quad features 2 clusters of 2 cores. Because the 2 clusters are not connected through an internal crossbar nor a shared cache, cache coherency is maintained thanks to the motherboard northbridge. This implies slower communications between cores of separate clusters than cores of a unique cluster and more generally nefast repercussions on performance in a parallel context.

On the contrary, the Intel Core i7 has a unique cluster for all of its cores and a L3 cache shared over them in addition to their own L2. Thanks to this shared L3 cache, better cache use can be achieved when several threads work on the same data. The FSB is replaced by a QuickPath Interconnect (QPI), which is a point-to-point interconnection, and the memory controller is now on-die. That allows both reduced latency and higher bandwidth. More, the Core i7 has 2 important features from a parallel computing point of view : Turbo Boost (TB) and Hyper-Threading (HT). The first one consists of a dynamic increase of frequency for active cores when some cores are inactive. When enabled this feature must be taken into accoung as it acts as a bias in parallel results. The second one allows the OS to see 2 logical cores for each physical core, which can yield better uses of computational units : as 2 threads have independant instructions the processor pipeline can be filled more efficently and the out-of-order engine can achieve better performance. However, handling 2 threads per core can increase pressure on cache (i.e., more cache misses per core) and then on RAM.

Core 2 and Core i7 are shown in figure 1 in their 4 cores configuration. Current high-end quad-core processors achieve a theoretical peak performance of 96 GFLOPS in single precision at 3GHz.


Figure 1: Architectural schemes of Intel multi-core CPUs (left : Core 2 Quad; right : Core i7). See text for details.

Graphics Processing Unit (GPU). A modern GPU (such as current NVIDIA GPUs) consists of a large set of the so-called stream processors that share a common memory []. These homogeneous processors have been originally designed for 3D graphics computations and thus had poor scientific computation units in terms of precision.

They now have the capabilities (in terms of precision) to achieve good performances for scientific computing. The latter is called General-Purpose GPU (GPGPU), see []. Note that a GPU is hosted by a standard CPU-based computer that embeds some RAM.

NVIDIA GPU architectures are depicted in figure 2. Contrary to CPU cores, GPU cores are scalar. They still use cache memory using several layers but in a very different way than CPUs. Cores are grouped into clusters with a shared L1 cache. Clusters are subdivided in groups of 8 cores which shares a local memory. These 8 cores execute the same instruction on multiple data. Therefore, vectorization is achieved using several scalar cores. The number of cores per cluster can variate according to models : clusters of 16 cores for G80 architecture or clusters of 24 cores for G200, which is an optimized version of G80.

Thanks to a dedicated bus, GPU cores have access to a very fast embedded memory of hundreds of MB. It allows an efficient use of the highly parallel and bandwidth-demanding units of GPUs. RAM can also be accessed through PCI Express. However, transfers between GPU embedded memory and RAM are very time consuming due to the PCI Express limited bandwidth. Therefore, they must be avoided as most as possible.

Thanks to their radically different architectures, GPUs have a theoretical peak performance an order of magnitude higher than current high-end CPUs. For instance a NVIDIA 8800 GTS (G80 architecture) with 96 stream processors at about 1.2GHz achieves a computational power of 345 GFLOPS in single precision and a NVIDIA GTX 275 (G200 architecture) with 240 stream processors at about 1.4GHz achieves 1010 GFLOPS, compared to the 96 GFLOPS of Intel quad-core processors at 3GHz.

Since GPUs were originally used for graphics purpose, they could only be accessed through dedicated graphic libraries. In order to implement numerical algorithms, NVIDIA has developped CUDA (Compute Unified Device Architecture), which is composed of a compiler and a set of tools organized as a library to help programmers. CUDA is currently the state-of-the-art for General-Purpose GPU. The last step for generalized General-Purpose GPU adoption is the release of GPU clusters specifically designed for scientific computing.


Figure 2: Architectural schemes of NVIDIA GPUs (left : G80; right : G200). See text for details.

Cell Broadband Engine Architecture (CBEA). The Cell Broadband Engine Architecture is a new architecture developed by Sony, Toshiba and IBM []. CBEA is designed for high computational throughput and bandwidth along with power efficiency. The general architecture is depicted in figure 3.

The Cell processor consists of an in-order 64-bit PowerPC core, also called Power Processing Element (PPE), that controls several vector computing cores called Synergistic Processing Element (SPE). Current implementations feature up to 8 SPEs. A SPE is an in-order core made of a 128-bit SIMD unit, i.e., a vector unit, and 256KB of local memory. These elements are linked together through a high bandwidth ring bus, denoted as EIB for Element Interconnect Bus.

Since only 256KB of local memory are available per SPE, it is unavoidable to prevent data parts transfer back and forth through the EIB. However, thanks to the high performance of this bus, some usually memory-bound operations, such as matrix/vector multiplication, do not suffer from this limitation on Cell architectures and therefore can be implemented efficiently in parallel []. Because no scalar unit is present in SPE cores, it is very important to fully use vectorization in order to achieve best performance.

The well known Sony Playstation 3 (PS3) features a 3.2GHz Cell with 7 SPEs and 256MB of RAM. However, only 6 SPEs and 192MB of RAM are available for scientific calculus. A single SPE at 3.2GHz has a peak performance of 25.6 GFLOPS in single precision. A PS3 Cell can therefore achieve a peak performance of about 150 GFLOPS thanks to its SPEs, i.e., about 150% of the theoretical peak performance of a high-end 3GHz quad-core from Intel. However, the performance increases at the cost of the ease of programming, as depicted by [, ]. Many efforts are currently underway to make it easier to exploit the Cell computing power, see [] for instance.


Figure 3: Architectural scheme of a Cell Broadband Engine Architecture. See text for details.

After these technical considerations about theoretical performance of these architectures, we propose in the next section a simple algorithm that takes benefit from all these features.

3  A Proximal based Algorithm

In this section, we describe our algorithm that allows an efficient implementation on any vectorized parallel many-core architectures. Our approach consists of using a proximal operator on a penalized version of the original problem (1) such that the minimization scheme is reduced to a series of matrix/vector multiplications followed by separable minimizations (i.e., independent point-wise optimizations). This overall process is embedded into a dyadic continuation scheme that speeds-up the approach.

3.1  Moreau-Yosida Regularization

Instead of minimizing the original problem (1) we wish to minimize for un the following energy:

Eµ(u) = ||u||1 + 
µ
2
 || Au −f||22  ,     (2)

where µ is a non-negative parameter whose role is to enforce the constraint Au=f . Note that the latter energy corresponds to a penalization method as described by []. It is worth to note that a solution of (2) is a solution of problem (1) provided µ=∞ . However, for practical applications one only needs to compute a solution for a finite µ since the measures are corrupted by some noise, or because only a precise-enough solution is wished. Let us denote by u a minimizer of Eµ.

Let us first introduce some notations. Given some non-negative N, the usual Euclidean product in N is denoted by ⟨ ·, · ⟩ and its associated norm by || · ||2. Assume B is a symmetric positive definite linear operator. We set ⟨ ·, ·⟩B = ⟨ B ·, ·⟩ and we note for any x∈ , ||x||B2=⟨ x, xB2. Let us introduce the Moreau-Yosida regularization Fµ of Eµ associated with the metric M as the following [, ]: at any point u (k)n,

Fµ(u (k)) = 
 
inf
u ∈ n
 



Eµ(u) +
1
2
 || u − u (k)||M2



 .     (3)

In other words, it corresponds to inf-convolve the function to minimize with the metric M. Note that Fµ is strictly convex, since it is the sum of a convex and a strictly convex function, and thus there is a unique point minimizing Fµ. Besides, it is not difficult to show that the infimum in (3) is reached and thus it can be replaced by a minimum. Following [], this unique minimizer is also called the proximal point of u with respect to Eµ and M and is denoted by pµ(u (k)) .

Given a function E: n → , let us denote by ∂ E the sub-differential of E defined by w ∈ ∂ E(u) ⇔ E(v) ≥ E(u) + ⟨ w, vu⟩ for every v (see [, ] for instance). This proximal point pµ(u) is characterized by the following Euler-Lagrange equation:

0 ∈ ∂


Eµ
1
2
|| · − u (k)||M2


(pµ(u (k)))   ·

The latter is equivalent to:

0 ∈ ∂ Eµ(pµ(u(k))) + M(pµ(u(k)) − u(k))  .

And thus we have:

Mu(k)  ∈  
M + ∂ Eµ
(pµ(u(k)))  .

Since pµ(u(k)) is uniquely defined, we get that:

pµ
u(k)
=
M + ∂ Eµ
−1
Mu(k)
 .     (4)

The idea of the Moreau-Yosida regularization approach to minimize Eµ is to iterate the update formula (4) until convergence. This is also known as the proximal point iterations.

The above approach yields the following generic algorithm for minimizing Eµ (for µ fixed).

Generic proximal point minimization algorithm

  1. Set k=0 and u(0) = 0
  2. Compute
    u (k+1) =  pµ
    u(k)
    =
    M + ∂ Eµ
    −1(Mu (k))
  3. If "Not converged" then set kk+1 and go to 2 else return u (k+1)

The proof of convergence of this algorithm is gieven in Appendix A.

3.2   Proximal operator choices

So far only the generic version of the approach has been described. Indeed, as we have seen, the Moreau-Yosida regularization with any metric M yields an iterative algorithm that converges toward a solution of our problem. The versatility of the approach gives us some freedom on the choice of the metric M. One may wish to consider metrics that yield good speed of convergence, i.e., few iterations [, ]. However, recall that our goal is to design an approach that is easy to implement on parallel multi-core architectures. Thus, this goal should lead us to the design of M.

The most tedious part in the optimization problem (4) is that we need to invert (M + ∂ Eµ) (for Mu(k)). The optimality condition of the solution u(k+1) writes as follows:

s(u(k+1)) + µ At A (u(k+1) − f) + M(u(k+1) − u(k))= 0 ,

where s(u(k+1)) is a sub-gradient of || · ||1 at u(k+1). In a more concise form, from an optimization point of view, we have:

s(u(k+1)) + (µ At A + M)u(k+1)  = µ At f + M u(k) .

This operation becomes algorithmically easy when it is separable, i.e., the optimization can be independently carried out dimension by dimension. For our problem, separability means that (µ At A + M) is a diagonal matrix. Besides, this separability property also leads to an implementation that enjoys the requirement properties of section 2. Indeed, since variables are decoupled, coarse parallelism is straightforward. The variables are easily arranged in an well-aligned array that allows vectorized computations. This approach is also wise for cache considerations since the prediction for the next data to process is straightforward.

In order to get a separable optimization problem, M should kill the off-diagonal elements of µ At A. Besides, to ensure the global convergence of the approach, M should be positive semi-definite. Recall that good compressive sensing matrices corresponds to submatrices of matrices that satisfies the Restricted Isometry Property (all eigenvalues belong to [1−ν, 1+ν] for ν>0 and small). In this paper, we assume that ν=0. Examples of such matrices are obtained by considering Discrete Fourier or Discrete Cosine Transform, or orthogonalized Gaussian matrices. Thus, we have that the eigenvalues of At A are 0 or 1. And since At A is always a non-negative definite matrix we can define M as follows:

M = (1+є) µ Id − µ AtA ,

where є is a small positive real number to make M positive definite. Using this M we need to solve the following problem: Find u(k+1) that satisfies:

s(u(k+1)) + ( 1+є ) µ u(k+1)  = µ At f + M u(k) .

The latter is well known to be solved by using a shrinkage approach. It has been used and described in many papers such as in [, , , , , , , , ] for instance. We have that:

u(k+1) = 
1
(1+є)µ




µ Atf + M u(k) − sign
µ Atf + M u(k)
if
µ Atf + M u(k)
> 1  ,
0otherwise ,
    (5)

where sign(x) = x/|x| if x ≠ 0 and 0 otherwise. As one can see, the update of the solution only involves matrix/vector multiplications and some standard scalar operations that can be implemented using vectorized instructions. Also note that one wishes to set є as small as possible to have faster convergence. Setting є to 0 empirically leads to convergence although the proof presented here does not hold for this case.

3.3  Our algorithm

We now fully describe our algorithm. So far, we have considered the optimization of Eµ when µ is set to some arbitrary positive value. Recall that one wishes to set µ to a large value in order to enforce the constraint Au=f. However, when µ is large then the procedure generates a series of signals u(k+1) that converges slowly to the solution. One standard way to speed the process is to use a continuation approach that consists of solving (approximately or not) for a series of increasing µ. This kind of approach has been successfully used for instance in [, , , ].

Recall that we choose the constant null signal as an initial guess. It turns out that if µ is too small Eµ does not decrease. We thus wish a µ large enough so that Eµ decreases. Besides, recall that for performance considerations, we would like the smallest one that generates a sequence. This can be achieved thanks a dichotomic-like approach. We assume that we have a lower bound on µ such that µ > µmin > 0 . The following procedure computes the smallest µ, up to a precision ebitonic, such that the update gives an energy decrease:


Bitonic Search for initial µ
Input: f, A, µmin

  1. Set µmax = 0
  2. While µmax = 0
    1. Compute û using formula (5) for µmin using the null signal as the previous solution
    2. If Eµmin(û) < Eµmin(0) then set µmax = µmin and µmin = µmin/2
    3. Else set µmin ← 2 µmin
  3. While |µmax − µmin| > ebitonic
    1. Compute û using formula (5) for µmin + µmax/2 using the null signal as the previous solution
    2. If Eµmin(û) < Eµmin(0) then set µmax = µmin + µmax/2
    3. Else set µmin = µmin + µmax/2
  4. Return µmax

This procedure first looks for µmax, an upper bound on µ, during the iterations of 2−(a) to 2−(c). Then a standard bitonic search is performed through iterations 3−(a) to 3−(c) .

Once a good initial µ is known, we embed the proximal iterations into a continuation process, i.e., µ will successively takes the values µ, 2µ, …, 2lmax µ, where lmax is a strictly positive integer. Note that other coefficients than 2 could be used but experiments have shown that this value gives good results. We also need to give the stopping criteria of our approach. These criteria are used for any optimization of Eµ when µ is fixed. The first criteria concerns the closeness of the reconstructed solution to the observed values. The process is stopped when the reconstructed solution u(k+1) is below some prescribed tolerance etol measured as follows: ||Au(k+1)f||2/||f||2 . Since the proximal iterations converge toward the solution but not necessarily in finite time, we also stop the process when the energy decrease between two consecutive solutions is below some prescribed value denoted by econsec, i.e., we stop as soon as (Eµ(u(k)) − Eµ(u(k+1))) < econsec . When one of this stopping criteria is met, the value of µ is updated and we start a new descent minimization for Eµ using the current solution as initial guess.


Minimization Algorithm
Input: f, A and µmin

  1. Set k = 0 and u(0) = 0
  2. Compute the initial value of µ using the above bitonic approach
  3. for l = 1 to lmax (number of continuation values for µ)
    1. Do
      1. Set kk+1
      2. Compute u(k+1) using formula (5)
    2. While ( Eµ(u(k)) −Eµ(u(k+1)) > econsec ) and ( ||Au(k+1)f||2/||f||2 > etol )
    3. µ ← 2µ
  4. Return u(k+1)

4  Experimental results

In this section, we shortly present general implementation choices we have made and more specific ones that depend on the architectures we have considered. Then we give experimental results that assess the efficiency of our method running on these parallel many-core architectures.

4.1  Implementation details

The most time consuming operations in our method are those that involve the sampling matrix A. These operations can be performed in two different ways depending on the nature of A. Either A is described explicitly, i.e., all the coefficients of the matrix are stored in memory, and one needs to perform a standard matrix/vector multiplication; or A can be represented implicitly with the help of a transform. A typical example for the latter case is when A is a sub-matrix of the Discrete Fourier Fransform (FFT), or Discrete Cosine Transform (DCT). Both representations enable operations to done in parallel, i.e., they are very good candidates for both coarse and fine-grained parallelism implementations.

Allowing matrices to be stored explicitly allows for flexibility in the design of the matrix. Unfortunately, storing such a matrix is memory consuming and since a matrix/vector multiplication needs O(n × m) operations where n × m is the size of the matrix, it is also time consuming. On the contrary, the use of the implicit form is memory-wise. Indeed, it is then no longer needed to store A (and its transpose AT). Besides, it generally allows for faster computations because of fast available transforms. According to the literature (see [] for instance) the time complexity for computing an FFT is O(n logn). However, the price to pay is that it drastically reduces the set of possible sampling matrices.

In this paper, we consider these two kinds of matrices. For the explicit case, we consider orthogonalized Gaussian matrices where their elements are generated using i.i.d normal distributions N(0,1) and where their rows are orthogonalized. The second class of matrices we have used corresponds the partial DCTs. They are generated by randomly choosing, with uniform sampling, m rows from the full Discrete Cosine matrix. Please note that operations that involve orthogonalized Gaussian matrices or partial DCT comply with the requirements we present in section 2.

Due to the large amount of memory needed, only the CPU platform can run a code that uses large orthogonalized Gaussian matrices. On the contrary, partial DCTs can be used on any of the architectures we are interested in. To allow a fair comparison, only the latter was implemented on all three platforms. This partial DCT have been implemented through a complex-to-complex Fast Fourier Transform (FFT) with an additionnal O(n) pre and postprocessing for converting the results to real numbers. Indeed, direct DCT implementations are currently subject to performance issues and cannot be used for our purpose. Using this complex-to-complex FFT means an overhead of 4 times the memory required by directly applying DCT but allows to keep good time performances.

All implementations use single precision floating point arithmetic because of GPU and Cell limitations. It is important to note that these limitations are currently partly addressed by developers of these platforms.

We now present architecture-dependent implementation details.
CPU implementation details. The code running on the CPU is parallelized using the OpenMP API (Open Multi-Processing Application Programming Interface). We refer the reader to [] for further details. Our implementation is vectorized using SSE instructions and is also tuned to ensure the most efficient use of the cache. Recall that the thresholding approach (see section 3) has been chosen for its separability property (i.e., each element can be processed independently), therefore parallel and vector computing are profitable to obtain an efficient implementation. FFT computations are performed thanks to the FFTW 3.1 (Fastest Fourier Transform in the West). We refer the reader to [] for further details. Last, the code is compiled using the Intel C compiler version 10.1.
Cell implementation details. Our Cell implementation uses the Cell SDK 3.0 (Software Development Kit) provided by IBM and the FFTW 3.2 alpha 3 library. To the best of our knowledge, FFTW is the current state-of-the-art FFT implementation in term of input size (for instance FFTC [] is the fastest Fourier transform, but only for up to 16K complex input samples) and still have good time performances. This alpha version of the FFTW library requires exclusive access to the SPEs it uses (i.e., SPEs cannot run FFT and others processes together). Therefore, since FFT computations are the most time-consuming tasks of our method, we chose to assign all SPEs to FFT computations. FFT/DCT conversion, thresholding and energy computations are done on the PPE (which is not intended for performance). In order to balance the cost of using the PPE for these tasks, vectorized Altivec [] code has been written to improve performances of the FFT/DCT conversions, which are the second most costly computations of our method. Nevertheless, our Cell implementation can still be optimized as a lot of computations could be done on SPEs instead of on PPE (SPEs are at least one order of magnitude faster than the PPE). However, this can only be done using a dynamic load balancing process, which will be time consuming and difficult to implement.
GPU implementation details. The GPU implementation is based on CUDA1 (Compute Unified Device Architecture) 2.0 and the CUFFT library (part of the CUDA package). Since the PCI Express bus suffers severe bandwidth limitations, we designed our implementation such that it uses very few data communications between central RAM and GPU embedded memory. Note that this is a very common problem in General-Purpose computing on GPUs. However, when absolutely needed and as explained in [], transfer sizes must be carefully chosen. Modern GPUs feature highly hierarchical set of computational units, memories and caches. The number of threads must be maximized in order to achieve best performance, but attention must be paid for avoiding bank conflicts, i.e., multiple threads accessing the same shared memory bank at the same time. For example, [] implements a memory manager to meet this goal.

4.2  Experiments

In this section, we first describe the experimental conditions we have considered for our experiments. Then, we present numerical results that show the effectiveness of our approach.

4.2.1  Experimental conditions

In order to prove the effectiveness of our approach and to compare the benefits and disadvantages of the different architectures, we use an experimental platform that consists of the following:

  • Two standard Intel quad-core processors. The first one is an Intel Core 2 Quad Q6600 2.4GHz with 8MB of L2 cache and 4GB of RAM. Theoretically, this processor can achieve a peak performance of 38 GFLOPS. For our specific case where only single precision computations are used, this means that we can hope a peak performance of 76 GFLOPS in single precision. The second processor is an Intel Core i7 920 2.66GHz with 1MB of L2 cache, 8MB of L3 cache and 6GB of RAM. This processor can achieve a theoretical peak performance of 43 GFLOPS in double precision and 86 GFLOPS in single precision. The Turbo Boost feature has been systematically disabled to avoid bias in our results.
  • A popular implementation of the Cell Broadband Engine Architecture that one can find in the Sony Playstation 3. The machine consists of a 3.2Ghz Cell processor with 6 usable SPEs and 192MB of RAM.
  • Two commercial video cards produced by NVIDIA, namely the NVIDIA GEFORCE 8800 GTS and the NVIDIA GEFORCE GTX 275. The first one has 96 cores and embeds 640MB of memory while the second has 240 cores and 896MB of memory.

We now give the values of the parameters we have used for our experiments. We set m=n/8 (recall that m is assumed to be much smaller than n). The number of non-zero values in the original sparse signal is set to k=m/10. Recall that there are two stopping criteria: the tolerance error etol with associated stopping criteria ||Au(k+1)f||2/||f||2 < etol, and the consecutive variation tolerance econsec associated with (E(u(k))−E(u(k+1))) < econsec. We set etol = 10−5 and econsec = 10−3. Concerning the bitonic search, we set µmin = 2 and ebitonic = 0.5  . The number of continuation steps, lmax, performed after the bitonic step is set to lmax=10.

4.2.2  Parallel speedup and caches issues

We now present preliminary experiments about parallel speedup in order to show particular issues arising with matrix/vector multiplication, which is a key operation in our method. Squared and rectangular single precision matrices are both used to run our benchmarks on the Intel Core i7 processor (with Turbo Boost disabled). The results are the average of 10 instances based on orthogonalized Gaussian matrices and randomly generated vectors. Note that results are not about absolute performance but parallel performance, therefore better results means better scaling but not necessarily better wall clock time.


Figure 4: Parallel speedup with respect to the number of threads for matrix/vector multiplications (Intel Core i7 920).

A super linear speedup is achieved for matrice sizes up to 512 × 512 and 256 × 1024, i.e., 512KB matrices. The shared L3 cache is very efficiently used and each core can take full advantage of its own very fast 256KB L2 cache.

A cut-off effect appears for 2048 × 2048 and 1024 × 4096 matrices, i.e., when the matrices use 8MB of memory, which is the size of the L3 cache. The whole data used by the program do not hold anymore into the L3 memory, which implies that L3 reuse is compromised and then a huge performance penalty.

The use of Hyper-Threading can yield positive or negative effects, as explained in section 2. HT allows to more efficiently use computing units inside a core. It can yield super linear parallel speedup, which can be seen principally before reaching 8MB matrices. However, handling 2 threads per core can increase pressure on cache. This phenomenon can be seen after reaching 8MB matrices. The graphs show a balance between these two effects.

4.2.3  Results

We now present experimental results of our method. Figure 5 represents the variation of errors with respect to time. We chose a representative example to illustrate the different steps of the optimization process. We consider two error criteria : etol (defined above) and the relative error of the reconstructed signal u(k+1) to the ground truth defined as relative=||u(k+1)u*||2/||u*||2.


Figure 5: Errors with respect to time for a representative example of orthogonalized Gaussian matrix of size 2048 × 16384.

We can see at the very beginning of the process the bitonic search. When an appropriate initial µ is found, the continuation process is launched. Each step of the continuation is shown at the curve level as a huge decrease of both error criteria. We can also observe that the ground truth error decreases almost linearly until 400 iterations, from then the decrease is far slower. On the contrary, the relative error does not show this first fast decrease process. This graph depicts the fact that out method could be stopped earlier using tighter parameters. This is a very common characteristic for this kind of optimization methods, such as those previously presented in section 1. The downside is that it makes comparison between methods very difficult because of the multiple biased involved. In fact, each method has its own set of parameters that do not necessarily have equivalents. In particular, our method has the advantage to not rely on a dt parameter representing the optimization process step, unlike methods based on linearization.

Our other results are presented as tables where the final values of etol (tolerance) and the relative error of the reconstructed signal to the ground truth (relative) are given. The tables also present the number of iterations (#iter) that have been performed to obtain the final solution and the wall clock time in seconds taken by the whole process.
Compressive sensing with orthogonalized Gaussian matrices. The first experiment (figure 1) uses orthogonalized Gaussian matrices and standard Intel quad-core processors. The tables give the time spent for the computation with respect to the number of threads used by the program.


inouttime (s)
mntolerancerelative#iter1 thread2 threads4 threads
645121.24e-031.40e-031163.20.0760.0440.029
12810244.81e-045.22e-041155.20.2380.1280.072
25620482.99e-043.26e-041326.41.1440.5620.293
51240961.93e-042.15e-041461.75.6593.5843.386
102481921.29e-041.43e-041505.023.22415.50015.352
2048163848.66e-059.62e-051611.197.12364.52764.778
Table 1: Results for orthogonalized Gaussian matrices on an Intel Core 2 Quad Q6600.


inouttime (s)
mntolerancerelative#iter1 thread2 threads4 threads8 threads
645121.24e-031.39e-031162.00.0420.0250.0170.252
12810244.81e-045.22e-041155.50.1150.0680.0460.350
25620482.99e-043.26e-041321.50.4230.2270.1320.375
51240961.93e-042.15e-041465.82.3621.4951.1301.359
102481921.29e-041.43e-041505.69.9336.0354.5744.885
2048163848.66e-059.62e-051604.941.84225.28419.57619.997
Table 2: Results for orthogonalized Gaussian matrices on a Intel Core i7 920.

Table 1 presents results for an Intel Core 2 Quad and table 2 results for an Intel Core i7. The Core 2 Quad shows an interesting performance scaling until n = 4096, where the situation begins to change. When this value is reached, there is no more benefit to use 4 threads compared to 2. This is due to Core 2 Quad core interconnection using FSB and its cache hierarchy, which has huge repercutions on available bandwidth, as pointed by []. Note that the scaling ratio from 1 thread to 2 is only slightly affected as the 2 threads are executed on 2 cores of the same cluster.

Concerning the same experiment with the Core i7, let us recall that HT enables to handle up to 8 threads for a 4-core processor. For our particular experiment, the HT implies a loss in performance from 4 to 8 threads due to a too high cache pressure. The phenomenon beginning at n = 4096 with the Core 2 Quad does not happen with the Core i7 thanks to its new core interconnection and its optimized cache hierarchy. These results are coherent with our previous results of section 4.2.2. Above n = 4096 the scaling is not as good as before but remains constant and still yield a performance increase.
Compressing sensing with partial DCT. The second experiment (table 3 for an Intel Core 2 Quad and table 4 for an Intel Core i7) uses partial DCTs instead of orthogonalized Gaussian matrices. Since DCT and Inverse DCT are used instead of matrix/vector multiplications, memory can be saved and faster computations can be achieved. This means that n can take higher value in this experiment compared to the previous one.

One can see that performance scaling of multithreading for the Core 2 Quad is poor when n is small. The benefit of using more cores only comes up when enough computations are done on each core. For n = 2048, the best wall clock time happens when using 2 threads. Inside a cluster of 2 cores, the shared L2 cache allows to scale for this problem size. However, the core interconnection and the cache hierarchy penalize the scaling over 2 cores. In our case, the benefit of using 4 cores happens when n reaches 4096. Larger n means better scaling ratio.

For the Core i7, there is no penalty using 2 threads instead of 1, nor 4 instead of 2 whatever the problem size. The bigger the problem is, the better the scaling is. Concerning HT, like for orthogonalized Gaussian matrices, it implies a loss in performance when using 8 threads instead of 4.


inouttime (s)
mntolerancerelative#iter1 thread2 threads4 threads
645121.02e-031.11e-031029.90.0340.0440.050
12810244.74e-045.08e-041075.10.0720.0810.089
25620482.81e-043.09e-041219.60.1800.1690.185
51240961.92e-042.15e-041415.60.5010.4720.405
102481921.27e-041.40e-041456.61.1100.8840.818
2048163848.68e-059.61e-051584.72.5441.8401.797
4096327686.20e-056.91e-051780.56.3664.6474.424
8192655364.33e-054.82e-052016.016.57111.54310.797
163841310723.01e-053.35e-052256.546.51232.62727.568
327682621442.15e-052.40e-052672.1199.254117.57494.599
655365242881.52e-051.70e-053261.0508.388298.467204.657
13107210485761.11e-051.25e-054067.51321.440825.617567.315
Table 3: Results for partial DCTs on a Intel Core 2 Quad Q6600.


inouttime (s)
mntolerancerelative#iter1 thread2 threads4 threads8 threads
645121.02e-031.11e-031031.30.0250.0230.0200.055
12810244.74e-045.08e-041076.80.0550.0380.0340.068
25620482.81e-043.09e-041223.20.1200.0820.0700.126
51240961.92e-042.15e-041419.60.2880.1970.1570.220
102481921.25e-041.38e-041438.50.5980.4130.3160.391
2048163848.76e-059.70e-051597.21.4520.9540.7180.911
4096327686.21e-056.91e-051783.23.4942.1121.6411.887
8192655364.33e-054.81e-052004.88.2935.2803.7804.137
163841310723.00e-053.35e-052253.220.02611.9618.75511.017
327682621442.13e-052.38e-052625.087.03149.50832.41134.852
655365242881.52e-051.70e-053262.0225.341129.46782.43291.006
13107210485761.12e-051.26e-054074.6628.962337.811222.477239.236
Table 4: Results for partial DCTs on a Intel Core i7 920.

The next two experiments (figures 5 and 6/7) are specific implementations respectively for PS3 Cell and NVIDIA GPU platforms. Only partial DCT results are provided for these platforms as they have strong memory limitations. Note that, as explained in previous section and contrary to our CPU implementation, these specific implementations are yet not optimal and only reflect what kind of performances can be achieved with very little development effort.


mntolerancerelative#itertime (s)
645121.02e-031.12e-031033.30.071
12810244.74e-045.08e-041077.70.153
25620482.81e-043.09e-041231.10.359
51240961.92e-042.15e-041420.60.913
102481921.27e-041.39e-041458.81.935
2048163848.67e-059.61e-051581.14.340
4096327686.21e-056.91e-051772.410.445
8192655364.33e-054.82e-052006.829.523
163841310723.00e-053.34e-052232.560.178
327682621442.14e-052.38e-052663.0149.679
655365242881.51e-051.68e-053236.5367.822
13107210485761.09e-051.23e-054066.7934.656
Table 5: Results for partial DCTs on a PS3 Cell.


mntolerancerelative#itertime (s)
645121.02e-031.12e-031014.60.309
12810244.72e-045.03e-041034.10.344
25620482.83e-043.12e-041213.10.461
51240961.92e-042.14e-041411.30.623
102481921.26e-041.39e-041437.80.978
2048163848.71e-059.64e-051573.01.985
4096327686.25e-056.97e-051760.54.532
8192655364.37e-054.88e-052017.211.036
163841310723.09e-053.46e-052259.528.337
Table 6: Results for partial DCTs on a NVIDIA GPU 8800 GTS.


mntolerancerelative#itertime (s)
645121.02e-031.12e-031015.60.177
12810244.72e-045.03e-041033.80.222
25620482.83e-043.12e-041215.60.298
51240961.92e-042.14e-041408.50.404
102481921.26e-041.39e-041428.10.500
2048163848.71e-059.65e-051563.10.957
4096327686.25e-056.97e-051757.22.011
8192655364.37e-054.88e-052001.54.586
163841310723.09e-053.46e-052251.211.274
Table 7: Results for partial DCTs on a NVIDIA GPU GTX 275.

Figure 6 gives a comparison between the three implementations on the five hardwares. GPUs are currently limited by the dimension size they can work on but our basic GPU implementation is as fast as our optimized multi-core CPU one. More precisely, the NVIDIA 8800 GTS is head to head with the Core 2 Quad whereas the NVIDIA GTX 275 is head to head with the Core i7, the quad-core CPUs being always slightly faster that GPUs. Our Cell implementation running on the PS3 is slower than our multi-core CPU implementation but this is due to the fact that a lot of computations are done on the PPE, which is very slow (even if vectorized Altivec code reduces a little this overhead), although they could be fully done on SPEs to reach maximal performance. The reader should also note that the Cell is now quite old and here tested in a 6 SPEs configuration. Current GPU and Cell implementations give a lower bound on performances that can be achieved on these platforms and an idea about standard implementation performances. In particular, a better Cell implementation that fully takes advantage of the characteristics of this processor (i.e., use also the SPE not only for the FFT but also for the shrinkage process) is left as future work.


Figure 6: Results for partial DCTs on various platforms.

5  Conclusion

In this paper we presented a standard algorithm for solving compressive sensing problems involving l1 minimization. This algorithm has been especially designed to take benefit of current parallel many-core architectures and achieves noticeable speedups. Besides, it is easy to implement on these architectures. To validate our approach, we proposed implementations on various current high-end platforms, such as vectorized multi-core CPU, GPU and Cell. Pros and cons of both platforms and implementations have been discussed. In particular, we have seen that multi-core CPUs can offer comparable performances with GPUs when their parallel features are used. The results are promising and allow to hope very fast implementations on new architectures such as the next generation many-core x86 architecture of the Intel Larrabee [].

A  Proof of Convergence

We briefly show the standard elements of the proof of the convergence of the proximal iterations. Recall that the approach is quite standard and some proof can be adapted for [, ] for instance.

First, let us note that the series {Eµ(p(u(k)))} is clearly non-decreasing and bounbded by below, and thus converges toward some value referred to as η.

Then, let us recall a standard convex optimality result. Assume that g:N→ is convex and differentiable and h:N → is convex, then u is a global minimizer of (g+h) if and only if the following holds:

∀ u ∈ N   ⟨ ∇ g(u), u − u⟩ + h(u) − h(u) ≥ 0 .     (6)

For our case, let us have g(·) = 1/2|| · − u (k)||M2 and h(·) = Eµ(·) and recall that pµ(u (k)) is the global minimum of the inf-convolution when it is fed with u (k):

∀ u ∈ N  
p
u (k)
− u (k) , u −  p
u (k)

 

M
 + Eµ(u) − Eµ
p
(u (k)

≥ 0 .     (7)

Now, we consider this inequality for the two points û and ū with associated proximal points p(û) and p(ū), respectively. Some simple calculus lead to:


p(û) − û,  p(ū) − p(û)
M +
p(ū) − ū,  p(û) − p(ū)
M ≥ 0  ,

and thus we get:


ū −     û,  p(ū) − p(û)
M ≥ || p(û) − p(ū)||M2  .

The latter is equivalent to:

⎪⎪
⎪⎪
û − ū ⎪⎪
⎪⎪
M2 − || p(û) − û − p(ū) + ū ||M2 ≥ || p(û) − p(ū) ||M2  .

Now, let us set ū to a global minimizer of Rµ, i.e., ū = u, and û = u(k), in the previous inequality. Note that p(u) = u, and recall that u(k+1) = p(u(k)). The following inequality holds:

⎪⎪
⎪⎪
u (k) − u⎪⎪
⎪⎪
M2 − ⎪⎪
⎪⎪
u(k+1) − u(k) ⎪⎪
⎪⎪
M2⎪⎪
⎪⎪
u(k+1) − u⎪⎪
⎪⎪
M2     (8)

With this inequality we can conclude using the following points.

The series || u(k)u||M2 is non-decreasing and bounded by below (by Eµ(u ))) and thus does converge. Thus, we deduce that limk → ∞ ( || u(k+1)u||M2 − || u(k)u||M2 ) = 0 . Using this result and (8) we get that limk→ ∞ || u(k+1)u(k)||M2 = 0 .

Note that by the convexity of the energy Eµ, we have:

Eµ
u 
E
u (k+1)
+
∂ Eµu − u(k+1)
    (9)

Since u(k+1) is the global minimizer of F(u(k)), we have that:


∂ Eµu − u(k+1)
+
u(k+1) − u(k)u− u(k+1)
≥ 0 .     (10)

Recall that, as it has been shown above, limk→ + ∞ || u(k+1)u(k)||2 = 0, and since ||u(k+1)u||2 is bounded we have that liminfk→ + ∞ ⟨∂ Eµ, uu(k+1)⟩≥ 0 . Injecting this information into (9), we obtain that limk → + ∞ Eµ(u(k)) ≤ η, and thus we conclude that limk→ + ∞ Eµ(u(k)) = η .

Acknowledgments

The research of A. Borghi on this work has been done while being at the mathematics department of UCLA and being supported by the ONR grant N000140710810. The research of J. Darbon and S. Osher was supported by ONR grant N000140710810. The research of T. Chan was supported by DMS-0610079 and ONR N00014-06-1-0345.


1
More informations can be found on the website : http://www.nvidia.com/object/cuda_home.html

This document was translated from LATEX by HEVEA.