Lessons learned on porting CRBLASTER to the 64-core TILE64 processor

Ken Mighell, NOAO

I discuss the lessons learned while porting a NASA-funded parallel-processing cosmic-ray rejection application, called CRBLASTER, to Tilera's 64-core (8x8 mesh processor architecture) TILE64 processor.

CRBLASTER identifies and removes cosmic-rays on space-based CCD (charge-coupled device) camera images using the embarrassingly-parallel Laplacian-edge-detection L.A.Cosmic algorithm. CRBLASTER is written in C using the industry standard Message Passing Interface (MPI) library.

Processing a single 800 x 800 pixel Hubble Space Telescope WFPC2 camera image takes 6.00 seconds running on an Apple Mac Pro computer with two 2.8-GHz quad-core Intel Xeon processors. Processing the same image on the same machine with CRBLASTER running with 8 processors takes 0.915 seconds, which translates to a computational efficiency 82.0% for 8 processors.

The initial port of CRBLASTER to the 64-core TILE64 platform took only 8 hours. The biggest challenge was the compilation of NASA's CFITSIO C library (130,000+ lines of C code) for the reading and writing of data files in the FITS (Flexible Image Transport System) data format; once I learned how to get CFITSIO to configure and compile *natively* on the TILExpress-20G card, the compilation of CRBLASTER with Tilera's tile-cc cross compiler was a snap.

CRBLASTER running on a single tile of a 700-MHz TILExpress-20G development card takes 91.4 seconds; CRBLASTER running on 8 tiles with 2 x 4 subimages has a computational efficiency of 87.2%. Using a square number of tiles is an efficient way of using the TILE64 processor. For example, CRBLASTER running on 49 tiles with 7 x 7 subimages has a computational efficiency of 66.8% but with 47 tiles (a prime number) the application has a computational efficiency of 49.9%. This is due to edge effects of the L.A.Cosmic algorithm.

The L.A.Cosmic algorithm is a non-linear process principally due to the random nature of cosmic-ray damage on a CCD image; somewhere a portion of the image will be more harmed than anywhere else and CRBLASTER must wait until that subimage is cleaned before it can finish.

CRBLASTER has been designed to be used by others as a parallel-processing computational framework that enables the easy development of other parallel-processing image-analysis programs based on embarrassingly-parallel algorithms. CRBLASTER running a linear process yields significantly better computational efficiencies with large numbers of tiles.

Document date April 25, 2010.