Alexandre Borghi (1)
Jérôme Darbon (2)
Sylvain Peyronnet (1)
Tony F. Chan (2)
Stanley Osher (2)
(1) Laboratoire de Recherche en
Informatique (LRI), Université Paris Sud, France
(2) Department of Mathematics, UCLA,
Los Angeles, California, USA
Original Version September 2008; Revised August 5, 2009
The
orginal report is
UCLA Cam report 0864
Abstract: In this paper we consider the l^{1}compressive sensing problem. We propose an algorithm specifically designed to take advantage of shared memory, vectorized, parallel and manycore microprocessors such as the Cell processor, new generation Graphics Processing Units (GPUs) and standard vectorized multicore processors (e.g. quadcore CPUs). Besides its implementation is easy. We also give evidence of the efficiency of our approach and compare the algorithm on the three platforms, thus exhibiting pros and cons for each of them.
Keywords: compressive sensing parallel computing proximal point manycore architectures multicore CPU GPU Cell
An algorithm that relies on standard optimization techniques is proposed for solving compressive sensing problems involving l^{1} minimization. Its main characteristic is that it is designed for running efficiently on parallel manycore architectures and thus takes huge benefit from these technologies.
Let us briefly describe the compressive sensing problem. Given a sensing matrix A ∈ ^{m× n} and some observed data f ∈ ^{m} , one wishes to find an optimal solution u^{*} ∈ ^{n} of the following constraint minimization problem:
 (1) 
where the number of constraints m is typically much lower than the size of the signal n. This problem has received a lot of attention since it has application for instance in signal/image processing, data compression and more recently as compressive sensing [, , , , ]. The latter corresponds to identify matrices that allow for exact recovery of the original signal which is assumed to be sparse, from the observations. Such matrices are for instance proposed in [, , , , ]. This approach has led to several applications such as machine learning [, ], dimension reduction [], face recognition [], image processing [] and medical imaging []. There is a very large literature for algorithms to solve this problem [, , , , , , , , , ] for instance. We refer the reader to these for more detailed reviews of available algorithms. We note that many of them rely on iterative based thresholding/shrinkage approach that has been originally presented in [] and [] and with further developments and improvements such as [, , , , , , ]. This paper also falls into this category. However, contrary to these previous approaches, we focus on the practical considerations of such an approach (as opposed to pure theoretical ones) and our goal is to design an efficient algorithm that takes benefit from available computing technologies.
More precisely, we consider a standard MoreauYosida regularization as originally proposed in the seminal work of [] and []. The latter yields an iterative algorithm that is also known as the proximal point iterations [, , , ] that requires the inversion of a nonlinear operator. We show that the latter can be efficiently performed by a shrinkage approach by defining an appropriate MoreauYosida regularization. The convergence of the method is assured by the theoretical properties of the MoreauYosida regularization [, ]. This approach allows for an efficient implementation on parallel, shared memory and vectorized architectures. The latter are now standardly available and lead to tremendous speedups compared to unary nonvectorized processors. Eventually, we note the work of [] that also performs proximal point iterations for solving the compressive sensing problem but on a dual formulation.
Note that in general, efficient implementations of a numerical algorithm on these parallel shared memory platforms are difficult to develop due to technical specificities of these architectures. However, our algorithm has been designed keeping in mind these technical issues so that it allows a short, easy and efficient implementation on these platforms. Experiments on three different parallel manycore architectures, namely vectorized multicore processors, Graphic Processor Unit processors (GPUs) and the Cell Broadband Engine Architecture microprocessor (Cell), are presented and demonstrate the effectiveness of the approach.
The structure of the paper is as follows: section 2 describes main concepts of parallel computing and essential features of manycore architectures that must be taken into account while designing an algorithm intended to yield an easy and fast implementation to run on them. Our algorithm is described in section 3. Experiments along with comparisons with several different architectures are presented in section 4. Finally we draw some conclusions in section 5.
We recall that our goal is to design an algorithm that benefits from current parallel architectures. In this section, we present general concepts about parallel computing along with their implementations in current parallel architectures we considered, before presenting the algorithm designed for them in section 3.
In order to design an efficient algorithm on current parallel
architectures, we need to understand the main concepts they rely on.
We now describe these concepts.
Coarse parallelism. Each platform features several
computing units, thus tasks can be spread over them in order to
improve performance. However, developers must take care of locks that
are likely to arise in this context. A lock is used to prevent at least
two computing units to update the same space of memory. It is well
known that the use of locks to force unique access to data severely
deteriorate performances. A good way to avoid the use of locks, and
thus to enhance the performance, is to schedule operations so that
they never interfere with each others. This general issue is known as
the synchronisation problem and is one of the most important problem
regarding performance.
Vectorization.
Another level of parallelism, finer than the previous one, consists of
processing the data as a vector. This technique receives the name of
SIMD. Some architectures implement this technique and rely on the
socalled vector units. Performances are greatly
improved through the use of these vector units, provided the fact that
data are well aligned (data must be accessed at specific adresses in
memory). Note that not all algorithms can take benefit from this
technique because they are not intrinsically built on vectorial
primitives. Besides, even though the latter holds, the alignment
property still needs to be fullfilled. In other words, it means that
this requirement must be taken into account during the design phase of
the algorithm. Our algorithm is taking advantage of this technique.
Memory considerations. Memory limitations are the main constraints of considered architectures, namely bandwidth, latency and cache memory issues. Recall that these platforms have two kinds of memory : cache memory, which is typically structured in different levels, and standard RAM. Cache memory has for goal to reduce the time needed to access data in RAM. Cache memory is an order of magnitude faster than RAM but also a lot smaller in terms of capacity storage. To perform computations on large inputs, data parts must be transfered back and forth between cache memory and RAM. This implies memory communications that can be a bottleneck of the method due to limited available bandwidth and latency. To efficiently achieve this memory management, implementations have to take into account the structure of the cache, i.e., the cache hierarchy. Indeed, an algorithm and its implementation should help a processor to maintain this structure and facilitate the prediction of data moves. Since several cores might require simultaneous memory accesses, performances can be limited by the available bandwidth. To cope with this problem, it is needed to satisfactorily schedule data transfers. If these memory issues are not taken into account then computing units may wait for the data to be processed. This is called computation starvation since a processor is not computing. To avoid this behavior, the frequency and the size of data transfers must be tuned to have no computing units without data.
We now explain how current parallel architectures implement concepts just described above. We consider three different architectures: standard vectorized multicore CPU, modern GPU and the Cell Broadband Architecture (CBEA or Cell for short). These three platforms share common features. First, they are all based on a shared memory, i.e., a central memory accessible from all computing units. Second, they are parallel, i.e., composed of several computing units working together. However, these architectures do not implement parallelism in the same way. This leads to very different performances in both terms of computational power and memory transfer, which are two key aspects for global performance.
We now describe more precisely each of these architectures in view
of the above considerations.
Multicore CPU.
A standard multicore CPU is a processor made of several cores.
Each core is superscalar, outoforder and composed of vector
computing units (such as Altivec [] or Intel Streaming
SIMD Extensions). From a parallel computing point of view, main
differences between current commercial multicore CPUs are
core interconnections, cache hierarchy and how they access RAM.
The Intel Core 2 has cores clustered by 2. Each core has its own very fast level 1 cache (L1) and each cluster features a shared L2 cache. Clusters are linked together through a FrontSide Bus (FSB) with motherboard northbridge, which integrates the memory controller used to access RAM. The Core 2 Quad features 2 clusters of 2 cores. Because the 2 clusters are not connected through an internal crossbar nor a shared cache, cache coherency is maintained thanks to the motherboard northbridge. This implies slower communications between cores of separate clusters than cores of a unique cluster and more generally nefast repercussions on performance in a parallel context.
On the contrary, the Intel Core i7 has a unique cluster for all of its cores and a L3 cache shared over them in addition to their own L2. Thanks to this shared L3 cache, better cache use can be achieved when several threads work on the same data. The FSB is replaced by a QuickPath Interconnect (QPI), which is a pointtopoint interconnection, and the memory controller is now ondie. That allows both reduced latency and higher bandwidth. More, the Core i7 has 2 important features from a parallel computing point of view : Turbo Boost (TB) and HyperThreading (HT). The first one consists of a dynamic increase of frequency for active cores when some cores are inactive. When enabled this feature must be taken into accoung as it acts as a bias in parallel results. The second one allows the OS to see 2 logical cores for each physical core, which can yield better uses of computational units : as 2 threads have independant instructions the processor pipeline can be filled more efficently and the outoforder engine can achieve better performance. However, handling 2 threads per core can increase pressure on cache (i.e., more cache misses per core) and then on RAM.
Core 2 and Core i7 are shown in figure 1 in their 4 cores configuration. Current highend quadcore processors achieve a theoretical peak performance of 96 GFLOPS in single precision at 3GHz.
Graphics Processing Unit (GPU). A modern GPU (such as current NVIDIA GPUs) consists of a large set of the socalled stream processors that share a common memory []. These homogeneous processors have been originally designed for 3D graphics computations and thus had poor scientific computation units in terms of precision.
They now have the capabilities (in terms of precision) to achieve good performances for scientific computing. The latter is called GeneralPurpose GPU (GPGPU), see []. Note that a GPU is hosted by a standard CPUbased computer that embeds some RAM.
NVIDIA GPU architectures are depicted in figure 2. Contrary to CPU cores, GPU cores are scalar. They still use cache memory using several layers but in a very different way than CPUs. Cores are grouped into clusters with a shared L1 cache. Clusters are subdivided in groups of 8 cores which shares a local memory. These 8 cores execute the same instruction on multiple data. Therefore, vectorization is achieved using several scalar cores. The number of cores per cluster can variate according to models : clusters of 16 cores for G80 architecture or clusters of 24 cores for G200, which is an optimized version of G80.
Thanks to a dedicated bus, GPU cores have access to a very fast embedded memory of hundreds of MB. It allows an efficient use of the highly parallel and bandwidthdemanding units of GPUs. RAM can also be accessed through PCI Express. However, transfers between GPU embedded memory and RAM are very time consuming due to the PCI Express limited bandwidth. Therefore, they must be avoided as most as possible.
Thanks to their radically different architectures, GPUs have a theoretical peak performance an order of magnitude higher than current highend CPUs. For instance a NVIDIA 8800 GTS (G80 architecture) with 96 stream processors at about 1.2GHz achieves a computational power of 345 GFLOPS in single precision and a NVIDIA GTX 275 (G200 architecture) with 240 stream processors at about 1.4GHz achieves 1010 GFLOPS, compared to the 96 GFLOPS of Intel quadcore processors at 3GHz.
Since GPUs were originally used for graphics purpose, they could only be
accessed through dedicated graphic libraries. In order to implement
numerical algorithms, NVIDIA has developped CUDA (Compute Unified
Device Architecture), which is composed of a compiler and a set of
tools organized as a library to help programmers. CUDA is currently
the stateoftheart for GeneralPurpose GPU. The last step for
generalized GeneralPurpose GPU adoption is the release of GPU
clusters specifically designed for scientific computing.
Cell Broadband Engine Architecture (CBEA). The Cell Broadband Engine Architecture is a new architecture developed by Sony, Toshiba and IBM []. CBEA is designed for high computational throughput and bandwidth along with power efficiency. The general architecture is depicted in figure 3.
The Cell processor consists of an inorder 64bit PowerPC core, also called Power Processing Element (PPE), that controls several vector computing cores called Synergistic Processing Element (SPE). Current implementations feature up to 8 SPEs. A SPE is an inorder core made of a 128bit SIMD unit, i.e., a vector unit, and 256KB of local memory. These elements are linked together through a high bandwidth ring bus, denoted as EIB for Element Interconnect Bus.
Since only 256KB of local memory are available per SPE, it is unavoidable to prevent data parts transfer back and forth through the EIB. However, thanks to the high performance of this bus, some usually memorybound operations, such as matrix/vector multiplication, do not suffer from this limitation on Cell architectures and therefore can be implemented efficiently in parallel []. Because no scalar unit is present in SPE cores, it is very important to fully use vectorization in order to achieve best performance.
The well known Sony Playstation 3 (PS3) features a 3.2GHz Cell with
7 SPEs and 256MB of RAM. However, only 6 SPEs and 192MB of RAM are
available for scientific calculus. A single SPE at 3.2GHz has a
peak performance of 25.6 GFLOPS in single precision. A PS3 Cell can
therefore achieve a peak performance of about 150 GFLOPS thanks to its
SPEs, i.e., about 150% of the theoretical peak performance of a
highend 3GHz quadcore from Intel. However, the performance increases
at the cost of the ease of programming, as depicted by
[, ].
Many efforts are currently underway to make it easier to exploit the Cell
computing power, see [] for instance.
After these technical considerations about theoretical performance of these architectures, we propose in the next section a simple algorithm that takes benefit from all these features.
In this section, we describe our algorithm that allows an efficient implementation on any vectorized parallel manycore architectures. Our approach consists of using a proximal operator on a penalized version of the original problem (1) such that the minimization scheme is reduced to a series of matrix/vector multiplications followed by separable minimizations (i.e., independent pointwise optimizations). This overall process is embedded into a dyadic continuation scheme that speedsup the approach.
Instead of minimizing the original problem (1) we wish to minimize for u∈ ^{n} the following energy:
E_{µ}(u) = u_{1} + 
  Au −f_{2}^{2} , (2) 
where µ is a nonnegative parameter whose role is to enforce the constraint Au=f . Note that the latter energy corresponds to a penalization method as described by []. It is worth to note that a solution of (2) is a solution of problem (1) provided µ=∞ . However, for practical applications one only needs to compute a solution for a finite µ since the measures are corrupted by some noise, or because only a preciseenough solution is wished. Let us denote by u^{⋆} a minimizer of E_{µ}.
Let us first introduce some notations. Given some nonnegative N, the usual Euclidean product in ^{N} is denoted by ⟨ ·, · ⟩ and its associated norm by  · _{2}. Assume B is a symmetric positive definite linear operator. We set ⟨ ·, ·⟩_{B} = ⟨ B ·, ·⟩ and we note for any x∈ , x_{B}^{2}=⟨ x, x ⟩_{B}^{2}. Let us introduce the MoreauYosida regularization F_{µ} of E_{µ} associated with the metric M as the following [, ]: at any point u ^{(k)}∈ ^{n},
F_{µ}(u ^{(k)}) = 
 ⎧ ⎪ ⎨ ⎪ ⎩  E_{µ}(u) + 
  u − u ^{(k)}_{M}^{2}  ⎫ ⎪ ⎬ ⎪ ⎭  . (3) 
In other words, it corresponds to infconvolve the function to minimize with the metric M. Note that F_{µ} is strictly convex, since it is the sum of a convex and a strictly convex function, and thus there is a unique point minimizing F_{µ}. Besides, it is not difficult to show that the infimum in (3) is reached and thus it can be replaced by a minimum. Following [], this unique minimizer is also called the proximal point of u with respect to E_{µ} and M and is denoted by p_{µ}(u ^{(k)}) .
Given a function E: ^{n} → , let us denote by ∂ E the subdifferential of E defined by w ∈ ∂ E(u) ⇔ E(v) ≥ E(u) + ⟨ w, v−u⟩ for every v (see [, ] for instance). This proximal point p_{µ}(u) is characterized by the following EulerLagrange equation:
0 ∈ ∂  ⎛ ⎜ ⎜ ⎝  E_{µ}+ 
  · − u ^{(k)}_{M}^{2}  ⎞ ⎟ ⎟ ⎠  (p_{µ}(u ^{(k)})) · 
The latter is equivalent to:
0 ∈ ∂ E_{µ}(p_{µ}(u^{(k)})) + M(p_{µ}(u^{(k)}) − u^{(k)}) . 
And thus we have:
Mu^{(k)} ∈  ⎛ ⎝  M + ∂ E_{µ}  ⎞ ⎠  (p_{µ}(u^{(k)})) . 
Since p_{µ}(u^{(k)}) is uniquely defined, we get that:
p_{µ}  ⎛ ⎝  u^{(k)}  ⎞ ⎠  =  ⎛ ⎝  M + ∂ E_{µ}  ⎞ ⎠  ^{−1}  ⎛ ⎝  Mu^{(k)}  ⎞ ⎠  . (4) 
The idea of the MoreauYosida regularization approach to minimize E_{µ} is to iterate the update formula (4) until convergence. This is also known as the proximal point iterations.
The above approach yields the following generic algorithm for minimizing
E_{µ} (for µ fixed).
Generic proximal point minimization algorithm
u ^{(k+1)} = p_{µ}  ⎛ ⎝  u^{(k)}  ⎞ ⎠  =  ⎛ ⎝  M + ∂ E_{µ}  ⎞ ⎠  ^{−1}(Mu ^{(k)}) 
The proof of convergence of this algorithm is gieven in Appendix A.
So far only the generic version of the approach has been described. Indeed, as we have seen, the MoreauYosida regularization with any metric M yields an iterative algorithm that converges toward a solution of our problem. The versatility of the approach gives us some freedom on the choice of the metric M. One may wish to consider metrics that yield good speed of convergence, i.e., few iterations [, ]. However, recall that our goal is to design an approach that is easy to implement on parallel multicore architectures. Thus, this goal should lead us to the design of M.
The most tedious part in the optimization problem (4) is that we need to invert (M + ∂ E_{µ}) (for Mu^{(k)}). The optimality condition of the solution u^{(k+1)} writes as follows:
s(u^{(k+1)}) + µ A^{t} A (u^{(k+1)} − f) + M(u^{(k+1)} − u^{(k)})= 0 , 
where s(u^{(k+1)}) is a subgradient of  · _{1} at u^{(k+1)}. In a more concise form, from an optimization point of view, we have:
s(u^{(k+1)}) + (µ A^{t} A + M)u^{(k+1)} = µ A^{t} f + M u^{(k)} . 
This operation becomes algorithmically easy when it is separable, i.e., the optimization can be independently carried out dimension by dimension. For our problem, separability means that (µ A^{t} A + M) is a diagonal matrix. Besides, this separability property also leads to an implementation that enjoys the requirement properties of section 2. Indeed, since variables are decoupled, coarse parallelism is straightforward. The variables are easily arranged in an wellaligned array that allows vectorized computations. This approach is also wise for cache considerations since the prediction for the next data to process is straightforward.
In order to get a separable optimization problem, M should kill the offdiagonal elements of µ A^{t} A. Besides, to ensure the global convergence of the approach, M should be positive semidefinite. Recall that good compressive sensing matrices corresponds to submatrices of matrices that satisfies the Restricted Isometry Property (all eigenvalues belong to [1−ν, 1+ν] for ν>0 and small). In this paper, we assume that ν=0. Examples of such matrices are obtained by considering Discrete Fourier or Discrete Cosine Transform, or orthogonalized Gaussian matrices. Thus, we have that the eigenvalues of A^{t} A are 0 or 1. And since A^{t} A is always a nonnegative definite matrix we can define M as follows:
M = (1+є) µ Id − µ A^{t}A , 
where є is a small positive real number to make M positive definite. Using this M we need to solve the following problem: Find u^{(k+1)} that satisfies:
s(u^{(k+1)}) + ( 1+є ) µ u^{(k+1)} = µ A^{t} f + M u^{(k)} . 
The latter is well known to be solved by using a shrinkage approach. It has been used and described in many papers such as in [, , , , , , , , ] for instance. We have that:
u^{(k+1)} = 

 (5) 
where sign(x) = x/x if x ≠ 0 and 0 otherwise. As one can see, the update of the solution only involves matrix/vector multiplications and some standard scalar operations that can be implemented using vectorized instructions. Also note that one wishes to set є as small as possible to have faster convergence. Setting є to 0 empirically leads to convergence although the proof presented here does not hold for this case.
We now fully describe our algorithm. So far, we have considered the optimization of E_{µ} when µ is set to some arbitrary positive value. Recall that one wishes to set µ to a large value in order to enforce the constraint Au=f. However, when µ is large then the procedure generates a series of signals u^{(k+1)} that converges slowly to the solution. One standard way to speed the process is to use a continuation approach that consists of solving (approximately or not) for a series of increasing µ. This kind of approach has been successfully used for instance in [, , , ].
Recall that we choose the constant null signal as an initial guess. It turns out that if µ is too small E_{µ} does not decrease. We thus wish a µ large enough so that E_{µ} decreases. Besides, recall that for performance considerations, we would like the smallest one that generates a sequence. This can be achieved thanks a dichotomiclike approach. We assume that we have a lower bound on µ such that µ > µ_{min} > 0 . The following procedure computes the smallest µ, up to a precision e_{bitonic}, such that the update gives an energy decrease:
Bitonic Search for initial µ
Input: f, A, µ_{min}
This procedure first looks for µ_{max}, an upper bound on µ, during the iterations of 2−(a) to 2−(c). Then a standard bitonic search is performed through iterations 3−(a) to 3−(c) .
Once a good initial µ is known, we embed the proximal iterations into a continuation process, i.e., µ will successively takes the values µ, 2µ, …, 2^{lmax} µ, where l_{max} is a strictly positive integer. Note that other coefficients than 2 could be used but experiments have shown that this value gives good results. We also need to give the stopping criteria of our approach. These criteria are used for any optimization of E_{µ} when µ is fixed. The first criteria concerns the closeness of the reconstructed solution to the observed values. The process is stopped when the reconstructed solution u^{(k+1)} is below some prescribed tolerance e_{tol} measured as follows: Au^{(k+1)}−f_{2}/f_{2} . Since the proximal iterations converge toward the solution but not necessarily in finite time, we also stop the process when the energy decrease between two consecutive solutions is below some prescribed value denoted by e_{consec}, i.e., we stop as soon as (E_{µ}(u^{(k)}) − E_{µ}(u^{(k+1)})) < e_{consec} . When one of this stopping criteria is met, the value of µ is updated and we start a new descent minimization for E_{µ} using the current solution as initial guess.
Minimization Algorithm
Input: f, A and µ_{min}
In this section, we shortly present general implementation choices we have made and more specific ones that depend on the architectures we have considered. Then we give experimental results that assess the efficiency of our method running on these parallel manycore architectures.
The most time consuming operations in our method are those that involve the sampling matrix A. These operations can be performed in two different ways depending on the nature of A. Either A is described explicitly, i.e., all the coefficients of the matrix are stored in memory, and one needs to perform a standard matrix/vector multiplication; or A can be represented implicitly with the help of a transform. A typical example for the latter case is when A is a submatrix of the Discrete Fourier Fransform (FFT), or Discrete Cosine Transform (DCT). Both representations enable operations to done in parallel, i.e., they are very good candidates for both coarse and finegrained parallelism implementations.
Allowing matrices to be stored explicitly allows for flexibility in the design of the matrix. Unfortunately, storing such a matrix is memory consuming and since a matrix/vector multiplication needs O(n × m) operations where n × m is the size of the matrix, it is also time consuming. On the contrary, the use of the implicit form is memorywise. Indeed, it is then no longer needed to store A (and its transpose A^{T}). Besides, it generally allows for faster computations because of fast available transforms. According to the literature (see [] for instance) the time complexity for computing an FFT is O(n logn). However, the price to pay is that it drastically reduces the set of possible sampling matrices.
In this paper, we consider these two kinds of matrices. For the explicit case, we consider orthogonalized Gaussian matrices where their elements are generated using i.i.d normal distributions N(0,1) and where their rows are orthogonalized. The second class of matrices we have used corresponds the partial DCTs. They are generated by randomly choosing, with uniform sampling, m rows from the full Discrete Cosine matrix. Please note that operations that involve orthogonalized Gaussian matrices or partial DCT comply with the requirements we present in section 2.
Due to the large amount of memory needed, only the CPU platform can run a code that uses large orthogonalized Gaussian matrices. On the contrary, partial DCTs can be used on any of the architectures we are interested in. To allow a fair comparison, only the latter was implemented on all three platforms. This partial DCT have been implemented through a complextocomplex Fast Fourier Transform (FFT) with an additionnal O(n) pre and postprocessing for converting the results to real numbers. Indeed, direct DCT implementations are currently subject to performance issues and cannot be used for our purpose. Using this complextocomplex FFT means an overhead of 4 times the memory required by directly applying DCT but allows to keep good time performances.
All implementations use single precision floating point arithmetic because of GPU and Cell limitations. It is important to note that these limitations are currently partly addressed by developers of these platforms.
We now present architecturedependent implementation details.
CPU implementation details.
The code running on the CPU is parallelized using the OpenMP API (Open
MultiProcessing Application Programming Interface). We refer the
reader to [] for further details. Our implementation is
vectorized using SSE instructions and is also tuned to ensure the most
efficient use of the cache. Recall that the thresholding
approach (see section 3) has been chosen for its
separability property (i.e., each element can be processed
independently), therefore parallel and vector computing are profitable
to obtain an efficient implementation. FFT computations are performed
thanks to the FFTW 3.1 (Fastest Fourier Transform in the West). We
refer the reader to [] for further details. Last, the code is
compiled using the Intel C compiler version 10.1.
Cell implementation details.
Our Cell implementation uses the Cell SDK 3.0 (Software Development
Kit) provided by IBM and the FFTW 3.2 alpha 3 library. To the best of
our knowledge, FFTW is the current stateoftheart FFT implementation
in term of input size (for instance FFTC [] is the fastest
Fourier transform, but only for up to 16K complex input samples) and
still have good time performances. This alpha version of the FFTW
library requires exclusive access to the SPEs it uses (i.e., SPEs
cannot run FFT and others processes together). Therefore, since FFT
computations are the most timeconsuming tasks of our method, we chose
to assign all SPEs to FFT computations. FFT/DCT conversion,
thresholding and energy computations are done on the PPE (which is not
intended for performance). In order to balance the cost of using the
PPE for these tasks, vectorized Altivec [] code has been
written to improve performances of the FFT/DCT conversions, which are
the second most costly computations of our method. Nevertheless, our
Cell implementation can still be optimized as a lot of computations
could be done on SPEs instead of on PPE (SPEs are at least one order
of magnitude faster than the PPE). However, this can only be done using
a dynamic load balancing process, which will be time consuming and
difficult to implement.
GPU implementation details.
The GPU implementation is based on CUDA^{1} (Compute Unified
Device Architecture) 2.0 and the CUFFT library (part of the CUDA
package). Since the PCI Express bus suffers severe bandwidth
limitations, we designed our implementation such that it uses very few
data communications between central RAM and GPU embedded memory. Note that
this is a very common problem in GeneralPurpose computing on GPUs.
However, when absolutely needed and as explained in
[], transfer sizes must be carefully chosen. Modern
GPUs feature highly hierarchical set of computational units, memories
and caches. The number of threads must be maximized in order to
achieve best performance, but attention must be paid for avoiding bank
conflicts, i.e., multiple threads accessing the same shared memory bank
at the same time. For example, []
implements a memory manager to meet this goal.
In this section, we first describe the experimental conditions we have considered for our experiments. Then, we present numerical results that show the effectiveness of our approach.
In order to prove the effectiveness of our approach and to compare the benefits and disadvantages of the different architectures, we use an experimental platform that consists of the following:
We now give the values of the parameters we have used for our experiments. We set m=n/8 (recall that m is assumed to be much smaller than n). The number of nonzero values in the original sparse signal is set to k=m/10. Recall that there are two stopping criteria: the tolerance error e_{tol} with associated stopping criteria Au^{(k+1)}−f_{2}/f_{2} < e_{tol}, and the consecutive variation tolerance e_{consec} associated with (E(u^{(k)})−E(u^{(k+1)})) < e_{consec}. We set e_{tol} = 10^{−5} and e_{consec} = 10^{−3}. Concerning the bitonic search, we set µ_{min} = 2 and e_{bitonic} = 0.5 . The number of continuation steps, l_{max}, performed after the bitonic step is set to l_{max}=10.
We now present preliminary experiments about parallel speedup in order to show particular issues arising with matrix/vector multiplication, which is a key operation in our method. Squared and rectangular single precision matrices are both used to run our benchmarks on the Intel Core i7 processor (with Turbo Boost disabled). The results are the average of 10 instances based on orthogonalized Gaussian matrices and randomly generated vectors. Note that results are not about absolute performance but parallel performance, therefore better results means better scaling but not necessarily better wall clock time.
A super linear speedup is achieved for matrice sizes up to 512 × 512 and 256 × 1024, i.e., 512KB matrices. The shared L3 cache is very efficiently used and each core can take full advantage of its own very fast 256KB L2 cache.
A cutoff effect appears for 2048 × 2048 and 1024 × 4096 matrices, i.e., when the matrices use 8MB of memory, which is the size of the L3 cache. The whole data used by the program do not hold anymore into the L3 memory, which implies that L3 reuse is compromised and then a huge performance penalty.
The use of HyperThreading can yield positive or negative effects, as explained in section 2. HT allows to more efficiently use computing units inside a core. It can yield super linear parallel speedup, which can be seen principally before reaching 8MB matrices. However, handling 2 threads per core can increase pressure on cache. This phenomenon can be seen after reaching 8MB matrices. The graphs show a balance between these two effects.
We now present experimental results of our method. Figure 5 represents the variation of errors with respect to time. We chose a representative example to illustrate the different steps of the optimization process. We consider two error criteria : e_{tol} (defined above) and the relative error of the reconstructed signal u^{(k+1)} to the ground truth defined as relative=u^{(k+1)} − u^{*}_{2}/u^{*}_{2}.
We can see at the very beginning of the process the bitonic search. When an appropriate initial µ is found, the continuation process is launched. Each step of the continuation is shown at the curve level as a huge decrease of both error criteria. We can also observe that the ground truth error decreases almost linearly until 400 iterations, from then the decrease is far slower. On the contrary, the relative error does not show this first fast decrease process. This graph depicts the fact that out method could be stopped earlier using tighter parameters. This is a very common characteristic for this kind of optimization methods, such as those previously presented in section 1. The downside is that it makes comparison between methods very difficult because of the multiple biased involved. In fact, each method has its own set of parameters that do not necessarily have equivalents. In particular, our method has the advantage to not rely on a dt parameter representing the optimization process step, unlike methods based on linearization.
Our other results are presented as tables where the final values of
e_{tol} (tolerance) and the relative error of the reconstructed
signal to the ground truth (relative) are given. The tables also present
the number of iterations (#iter) that have been performed to obtain
the final solution and the wall clock time in seconds taken by the whole
process.
Compressive sensing with orthogonalized Gaussian matrices.
The first experiment (figure 1) uses orthogonalized
Gaussian matrices and standard Intel quadcore processors. The tables
give the time spent for the computation with respect to the number of
threads used by the program.
in out time (s) m n tolerance relative #iter 1 thread 2 threads 4 threads 64 512 1.24e03 1.40e03 1163.2 0.076 0.044 0.029 128 1024 4.81e04 5.22e04 1155.2 0.238 0.128 0.072 256 2048 2.99e04 3.26e04 1326.4 1.144 0.562 0.293 512 4096 1.93e04 2.15e04 1461.7 5.659 3.584 3.386 1024 8192 1.29e04 1.43e04 1505.0 23.224 15.500 15.352 2048 16384 8.66e05 9.62e05 1611.1 97.123 64.527 64.778
in out time (s) m n tolerance relative #iter 1 thread 2 threads 4 threads 8 threads 64 512 1.24e03 1.39e03 1162.0 0.042 0.025 0.017 0.252 128 1024 4.81e04 5.22e04 1155.5 0.115 0.068 0.046 0.350 256 2048 2.99e04 3.26e04 1321.5 0.423 0.227 0.132 0.375 512 4096 1.93e04 2.15e04 1465.8 2.362 1.495 1.130 1.359 1024 8192 1.29e04 1.43e04 1505.6 9.933 6.035 4.574 4.885 2048 16384 8.66e05 9.62e05 1604.9 41.842 25.284 19.576 19.997
Table 1 presents results for an Intel Core 2 Quad and table 2 results for an Intel Core i7. The Core 2 Quad shows an interesting performance scaling until n = 4096, where the situation begins to change. When this value is reached, there is no more benefit to use 4 threads compared to 2. This is due to Core 2 Quad core interconnection using FSB and its cache hierarchy, which has huge repercutions on available bandwidth, as pointed by []. Note that the scaling ratio from 1 thread to 2 is only slightly affected as the 2 threads are executed on 2 cores of the same cluster.
Concerning the same experiment with the Core i7, let us recall that HT
enables to handle up to 8 threads for a 4core processor. For our
particular experiment, the HT implies a loss in performance from 4 to 8
threads due to a too high cache pressure. The phenomenon beginning at
n = 4096 with the Core 2 Quad does not happen with the Core i7 thanks
to its new core interconnection and its optimized cache hierarchy. These
results are coherent with our previous results of section
4.2.2. Above n = 4096 the scaling is not as good as
before but remains constant and still yield a performance increase.
Compressing sensing with partial DCT.
The second experiment (table 3 for an Intel Core 2 Quad
and table 4 for an Intel Core i7) uses partial DCTs
instead of orthogonalized Gaussian matrices. Since DCT and Inverse DCT are
used instead of matrix/vector multiplications, memory can be saved and faster
computations can be achieved. This means that n can take higher
value in this experiment compared to the previous one.
One can see that performance scaling of multithreading for the Core 2 Quad is poor when n is small. The benefit of using more cores only comes up when enough computations are done on each core. For n = 2048, the best wall clock time happens when using 2 threads. Inside a cluster of 2 cores, the shared L2 cache allows to scale for this problem size. However, the core interconnection and the cache hierarchy penalize the scaling over 2 cores. In our case, the benefit of using 4 cores happens when n reaches 4096. Larger n means better scaling ratio.
For the Core i7, there is no penalty using 2 threads instead of 1, nor 4 instead of 2 whatever the problem size. The bigger the problem is, the better the scaling is. Concerning HT, like for orthogonalized Gaussian matrices, it implies a loss in performance when using 8 threads instead of 4.
in out time (s) m n tolerance relative #iter 1 thread 2 threads 4 threads 64 512 1.02e03 1.11e03 1029.9 0.034 0.044 0.050 128 1024 4.74e04 5.08e04 1075.1 0.072 0.081 0.089 256 2048 2.81e04 3.09e04 1219.6 0.180 0.169 0.185 512 4096 1.92e04 2.15e04 1415.6 0.501 0.472 0.405 1024 8192 1.27e04 1.40e04 1456.6 1.110 0.884 0.818 2048 16384 8.68e05 9.61e05 1584.7 2.544 1.840 1.797 4096 32768 6.20e05 6.91e05 1780.5 6.366 4.647 4.424 8192 65536 4.33e05 4.82e05 2016.0 16.571 11.543 10.797 16384 131072 3.01e05 3.35e05 2256.5 46.512 32.627 27.568 32768 262144 2.15e05 2.40e05 2672.1 199.254 117.574 94.599 65536 524288 1.52e05 1.70e05 3261.0 508.388 298.467 204.657 131072 1048576 1.11e05 1.25e05 4067.5 1321.440 825.617 567.315
in out time (s) m n tolerance relative #iter 1 thread 2 threads 4 threads 8 threads 64 512 1.02e03 1.11e03 1031.3 0.025 0.023 0.020 0.055 128 1024 4.74e04 5.08e04 1076.8 0.055 0.038 0.034 0.068 256 2048 2.81e04 3.09e04 1223.2 0.120 0.082 0.070 0.126 512 4096 1.92e04 2.15e04 1419.6 0.288 0.197 0.157 0.220 1024 8192 1.25e04 1.38e04 1438.5 0.598 0.413 0.316 0.391 2048 16384 8.76e05 9.70e05 1597.2 1.452 0.954 0.718 0.911 4096 32768 6.21e05 6.91e05 1783.2 3.494 2.112 1.641 1.887 8192 65536 4.33e05 4.81e05 2004.8 8.293 5.280 3.780 4.137 16384 131072 3.00e05 3.35e05 2253.2 20.026 11.961 8.755 11.017 32768 262144 2.13e05 2.38e05 2625.0 87.031 49.508 32.411 34.852 65536 524288 1.52e05 1.70e05 3262.0 225.341 129.467 82.432 91.006 131072 1048576 1.12e05 1.26e05 4074.6 628.962 337.811 222.477 239.236
The next two experiments (figures 5 and 6/7) are specific implementations respectively for PS3 Cell and NVIDIA GPU platforms. Only partial DCT results are provided for these platforms as they have strong memory limitations. Note that, as explained in previous section and contrary to our CPU implementation, these specific implementations are yet not optimal and only reflect what kind of performances can be achieved with very little development effort.
m n tolerance relative #iter time (s) 64 512 1.02e03 1.12e03 1033.3 0.071 128 1024 4.74e04 5.08e04 1077.7 0.153 256 2048 2.81e04 3.09e04 1231.1 0.359 512 4096 1.92e04 2.15e04 1420.6 0.913 1024 8192 1.27e04 1.39e04 1458.8 1.935 2048 16384 8.67e05 9.61e05 1581.1 4.340 4096 32768 6.21e05 6.91e05 1772.4 10.445 8192 65536 4.33e05 4.82e05 2006.8 29.523 16384 131072 3.00e05 3.34e05 2232.5 60.178 32768 262144 2.14e05 2.38e05 2663.0 149.679 65536 524288 1.51e05 1.68e05 3236.5 367.822 131072 1048576 1.09e05 1.23e05 4066.7 934.656
m n tolerance relative #iter time (s) 64 512 1.02e03 1.12e03 1014.6 0.309 128 1024 4.72e04 5.03e04 1034.1 0.344 256 2048 2.83e04 3.12e04 1213.1 0.461 512 4096 1.92e04 2.14e04 1411.3 0.623 1024 8192 1.26e04 1.39e04 1437.8 0.978 2048 16384 8.71e05 9.64e05 1573.0 1.985 4096 32768 6.25e05 6.97e05 1760.5 4.532 8192 65536 4.37e05 4.88e05 2017.2 11.036 16384 131072 3.09e05 3.46e05 2259.5 28.337
m n tolerance relative #iter time (s) 64 512 1.02e03 1.12e03 1015.6 0.177 128 1024 4.72e04 5.03e04 1033.8 0.222 256 2048 2.83e04 3.12e04 1215.6 0.298 512 4096 1.92e04 2.14e04 1408.5 0.404 1024 8192 1.26e04 1.39e04 1428.1 0.500 2048 16384 8.71e05 9.65e05 1563.1 0.957 4096 32768 6.25e05 6.97e05 1757.2 2.011 8192 65536 4.37e05 4.88e05 2001.5 4.586 16384 131072 3.09e05 3.46e05 2251.2 11.274
Figure 6 gives a comparison between the three
implementations on the five hardwares. GPUs are currently limited by
the dimension size they can work on but our basic GPU implementation
is as fast as our optimized multicore CPU one. More precisely, the
NVIDIA 8800 GTS is head to head with the Core 2 Quad whereas the
NVIDIA GTX 275 is head to head with the Core i7, the quadcore
CPUs being always slightly faster that GPUs. Our Cell implementation
running on the PS3 is slower than our multicore CPU implementation
but this is due to the fact that a lot of computations are done on the
PPE, which is very slow (even if vectorized Altivec code reduces a little
this overhead), although they could be fully done on SPEs to reach
maximal performance. The reader should also note that the Cell is now
quite old and here tested in a 6 SPEs configuration. Current GPU and Cell
implementations give a lower bound on performances that can be
achieved on these platforms and an idea about standard implementation
performances. In particular, a better Cell implementation that fully
takes advantage of the characteristics of this processor (i.e., use also
the SPE not only for the FFT but also for the shrinkage process) is left
as future work.
In this paper we presented a standard algorithm for solving compressive sensing problems involving l^{1} minimization. This algorithm has been especially designed to take benefit of current parallel manycore architectures and achieves noticeable speedups. Besides, it is easy to implement on these architectures. To validate our approach, we proposed implementations on various current highend platforms, such as vectorized multicore CPU, GPU and Cell. Pros and cons of both platforms and implementations have been discussed. In particular, we have seen that multicore CPUs can offer comparable performances with GPUs when their parallel features are used. The results are promising and allow to hope very fast implementations on new architectures such as the next generation manycore x86 architecture of the Intel Larrabee [].
We briefly show the standard elements of the proof of the convergence of the proximal iterations. Recall that the approach is quite standard and some proof can be adapted for [, ] for instance.
First, let us note that the series {E_{µ}(p(u^{(k)}))} is clearly nondecreasing and bounbded by below, and thus converges toward some value referred to as η.
Then, let us recall a standard convex optimality result. Assume that g:^{N}→ is convex and differentiable and h:^{N} → is convex, then u ^{⋆} is a global minimizer of (g+h) if and only if the following holds:
∀ u ∈ ^{N} ⟨ ∇ g(u^{⋆}), u − u^{⋆}⟩ + h(u) − h(u^{⋆}) ≥ 0 . (6) 
For our case, let us have g(·) = 1/2 · − u ^{(k)}_{M}^{2} and h(·) = E_{µ}(·) and recall that p_{µ}(u ^{(k)}) is the global minimum of the infconvolution when it is fed with u ^{(k)}:
∀ u ∈ ^{N}  ⟨ ⟨  p  ⎛ ⎝  u ^{(k)}  ⎞ ⎠  − u ^{(k)} , u − p  ⎛ ⎝  u ^{(k)}  ⎞ ⎠  ⟩ ⟩ 
 + E_{µ}(u) − E_{µ}  ⎛ ⎝  p  ⎛ ⎝  (u ^{(k)}  ⎞ ⎠  ⎞ ⎠  ≥ 0 . (7) 
Now, we consider this inequality for the two points û and ū with associated proximal points p(û) and p(ū), respectively. Some simple calculus lead to:
⟨ ⟨  p(û) − û, p(ū) − p(û)  ⟩ ⟩  _{M} +  ⟨ ⟨  p(ū) − ū, p(û) − p(ū)  ⟩ ⟩  _{M} ≥ 0 , 
and thus we get:
⟨ ⟨  ū − û, p(ū) − p(û)  ⟩ ⟩  _{M} ≥  p(û) − p(ū)_{M}^{2} . 
The latter is equivalent to:
⎪⎪ ⎪⎪  û − ū  ⎪⎪ ⎪⎪  _{M}^{2} −  p(û) − û − p(ū) + ū _{M}^{2} ≥  p(û) − p(ū) _{M}^{2} . 
Now, let us set ū to a global minimizer of R_{µ}, i.e., ū = u^{⋆}, and û = u^{(k)}, in the previous inequality. Note that p(u^{⋆}) = u^{⋆}, and recall that u^{(k+1)} = p(u^{(k)}). The following inequality holds:
⎪⎪ ⎪⎪  u ^{(k)} − u^{⋆}  ⎪⎪ ⎪⎪  _{M}^{2} −  ⎪⎪ ⎪⎪  u^{(k+1)} − u^{(k)}  ⎪⎪ ⎪⎪  _{M}^{2} ≥  ⎪⎪ ⎪⎪  u^{(k+1)} − u^{⋆}  ⎪⎪ ⎪⎪  _{M}^{2} (8) 
With this inequality we can conclude using the following points.
The series  u^{(k)} − u^{⋆}_{M}^{2} is nondecreasing and bounded by below (by E_{µ}(u ^{⋆}))) and thus does converge. Thus, we deduce that lim_{k → ∞} (  u^{(k+1)} − u^{⋆}_{M}^{2} −  u^{(k)} − u^{⋆}_{M}^{2} ) = 0 . Using this result and (8) we get that lim_{k→ ∞}  u^{(k+1)} − u^{(k)}_{M}^{2} = 0 .
Note that by the convexity of the energy E_{µ}, we have:
E_{µ}  ⎛ ⎝  u ^{⋆}  ⎞ ⎠  ≥ E  ⎛ ⎝  u ^{(k+1)}  ⎞ ⎠  +  ⟨ ⟨  ∂ E_{µ}, u^{⋆} − u^{(k+1)}  ⟩ ⟩  (9) 
Since u^{(k+1)} is the global minimizer of F(u^{(k)}), we have that:
⟨ ⟨  ∂ E_{µ}, u^{⋆} − u^{(k+1)}  ⟩ ⟩  +  ⟨ ⟨  u^{(k+1)} − u^{(k)}, u^{⋆}− u^{(k+1)}  ⟩ ⟩  ≥ 0 . (10) 
Recall that, as it has been shown above, lim_{k→ + ∞}  u^{(k+1)} − u^{(k)}^{2} = 0, and since u^{(k+1)} − u^{⋆}^{2} is bounded we have that liminf_{k→ + ∞} ⟨∂ E_{µ}, u^{⋆} − u^{(k+1)}⟩≥ 0 . Injecting this information into (9), we obtain that lim_{k → + ∞} E_{µ}(u^{(k)}) ≤ η, and thus we conclude that lim_{k→ + ∞} E_{µ}(u^{(k)}) = η .
The research of A. Borghi on this work has been done while being at the mathematics department of UCLA and being supported by the ONR grant N000140710810. The research of J. Darbon and S. Osher was supported by ONR grant N000140710810. The research of T. Chan was supported by DMS0610079 and ONR N000140610345.
This document was translated from L^{A}T_{E}X by H^{E}V^{E}A.