Upload
anna-cameron
View
216
Download
2
Tags:
Embed Size (px)
Citation preview
Tuning DiFX2 for performance
Adam DellerASTRON
6th DiFX workshop, CSIRO ATNF, Sydney AUS
Adam Deller 6th DiFX workshop, CSIRO ATNF
Outline
I/O bottlenecks and solutions Communication with the real world (reading raw data, writing visibilities)
Interprocess communication Keeping out of memory trouble Minimizing CPU load in various corners of parameter space
For more information and pictures:http://cira.ivec.org/dokuwiki/doku.php/difx/mpifxcorr/
Adam Deller 6th DiFX workshop, CSIRO ATNF
Getting data into DiFX
Master Node
Core 1DataStream 1
DataStream 2
DataStream N
Core 2
Core M
… …
Timerange, destination
Baseband data
Visibilities
Source dataSource data
Large, segmented ring buffer
Visibility buffer
Visibility buffer
Visibility buffer
processing buffer
processing buffer
processing buffer
Adam Deller 6th DiFX workshop, CSIRO ATNF
Getting data into DiFX
How to test? neutered_difx, with a small number of channels
Fundamental limit: native transfer speed (disk read, network pipe) If this is the problem, buy a RAID or get infiniband or …
Potential troublemaker: CPU utilisation on datastream node (competition) Can come from tsys estimation
Tweaking: datastream databuffer
Adam Deller 6th DiFX workshop, CSIRO ATNF
Datastream databuffer
QuickTime™ and a decompressor
are needed to see this picture.
Key parameters:dataBufferFactornDataSegmentssubintNS
/“Subint”
Only real potential problem I/O-wise: buffer too short (databufferfactor)
Adam Deller 6th DiFX workshop, CSIRO ATNF
Getting visibilities out of DiFX
Master Node
Core 1DataStream 1
DataStream 2
DataStream N
Core 2
Core M
… …
Timerange, destination
Baseband data
Visibilities
Source dataSource data
Large, segmented ring buffer
Visibility buffer
Visibility buffer
Visibility buffer
processing buffer
processing buffer
processing buffer
To disk
Adam Deller 6th DiFX workshop, CSIRO ATNF
Getting visibilities out of DiFX FxManager writes the visibilities to disk
This is very rarely a problem unless you have a dying disk or very large and/or frequent visibility dumps
Testing: neutered_difx + fake data source (ensures good input speeds)
Tweaking: none If you want to write out visibilities faster, put a fast disk (probably RAID) on the manager node!
Adam Deller 6th DiFX workshop, CSIRO ATNF
Interprocess @ the Datastream
Master Node
Core 1DataStream 1
DataStream 2
DataStream N
Core 2
Core M
… …
Timerange, destination
Baseband data
Visibilities
Source dataSource data
Large, segmented ring buffer
Visibility buffer
Visibility buffer
Visibility buffer
processing buffer
processing buffer
processing buffer
Adam Deller 6th DiFX workshop, CSIRO ATNF
Interprocess @ the Datastream Generally not a problem Tweaking: dataBufferFactor, ensure reasonable size (avoids latency issues)
Default (32) generally okbut couldusually bebigger w/oproblems(increasenSegmentsalso)
QuickTime™ and a decompressor
are needed to see this picture.
Adam Deller 6th DiFX workshop, CSIRO ATNF
Interprocess @ the Core
Master Node
Core 1DataStream 1
DataStream 2
DataStream N
Core 2
Core M
… …
Timerange, destination
Baseband data
Visibilities
Source dataSource data
Large, segmented ring buffer
Visibility buffer
Visibility buffer
Visibility buffer
processing buffer
processing buffer
processing buffer
Tweaking:• subintNS• Output visibility size (nChan / nBaselines)
Adam Deller 6th DiFX workshop, CSIRO ATNF
Interprocess @ the Core
QuickTime™ and a decompressor
are needed to see this picture.
Adam Deller 6th DiFX workshop, CSIRO ATNF
Interprocess @ the Core
In terms of reducing data transmission, increasing subintNS is the only real knob to turn Unimportant for continuum, single phase centre - it’s only very high spectral resolution and/or multiphase centre where this is relevant
In those cases, bigger is better; but be careful about memory (later)
Adam Deller 6th DiFX workshop, CSIRO ATNF
Interprocess @ the FxManager
Master Node
Core 1DataStream 1
DataStream 2
DataStream N
Core 2
Core M
… …
Timerange, destination
Baseband data
Visibilities
Source dataSource data
Large, segmented ring buffer
Visibility buffer
Visibility buffer
Visibility buffer
processing buffer
processing buffer
processing buffer
The most common trouble point! Must aggregatedata from all Core nodes, can lead to high data rates
Adam Deller 6th DiFX workshop, CSIRO ATNF
Interprocess @ the FxManager
QuickTime™ and a decompressor
are needed to see this picture.
Adam Deller 6th DiFX workshop, CSIRO ATNF
Interprocess @ the FxManager To calculate the rate into FxManager, work out the rate for one Core node and scale
Tweaking: maximise subintNS! Or (although this is usually not possible) reduce visibility size (via nChan or the number of phase centers)
Adam Deller 6th DiFX workshop, CSIRO ATNF
Memory @ the Datastream
Just don’t make the combination of dataBufferFactor and subintNS too big (can also control via “sendSize”)
QuickTime™ and a decompressor
are needed to see this picture.
Adam Deller 6th DiFX workshop, CSIRO ATNF
Memory @ the Core
Usually the biggest problem, memory-wise
QuickTime™ and a decompressor
are needed to see this picture.
Adam Deller 6th DiFX workshop, CSIRO ATNF
Memory @ the Core
Usually the biggest problem, memory-wise
Never used to be a problem, but multi-field center jobs hit hard
Bigger subint means more memory (storing datastream baseband)
More threads means more memory - at the pre-average spectral resolution
Buffering more FFTs costs more (x the number of threads, too!)
Adam Deller 6th DiFX workshop, CSIRO ATNF
Memory @ the Core
Tweaking: subintNS nThreads (threads file) numBufferedFFTs
And be aware of: nFFTChans (for multiphase centre/high spectral resolution)
Number of phase centres
Adam Deller 6th DiFX workshop, CSIRO ATNF
QuickTime™ and a decompressor
are needed to see this picture.
Memory @ the FxManager
Tweaking: visBufferLength Multiplies the size of a single visibility (nChan, nBaselines, nPhaseCentres)
Adam Deller 6th DiFX workshop, CSIRO ATNF
Memory @ the FxManager
Tweaking: visBufferLength Multiplies the size of a single visibility (nChan, nBaselines, nPhaseCentres)
Generally not a problem Note: visBufferLength should not be too short, especially if you have many (esp. heterogeneous) Core nodes, as the subints can come in out of order
Adam Deller 6th DiFX workshop, CSIRO ATNF
CPU @ the Datastream
Loading of Datastream is usually pretty light But, Datastream often runs on old hardware (e.g. Mk5 units) with limited CPU capacity
A couple of options can cause problematically high loads: Tsys extraction (.v2d: tcalFreq = xx) Interlaced VDIF formats (used with multi-thread VDIF data, e.g. phased EVLA)
More efficient implementations coming; for now, buy faster CPU if needed!
Adam Deller 6th DiFX workshop, CSIRO ATNF
CPU @ the Core
Many considerations here, including parameters usually fixed by the science Number of phase centres Spectral resolution (nChan/nFFTChan)
Plus several on array management strideLength numBufferedFFTs xmacLength
And then a few others as well: nThreads fringe rotation order
Adam Deller 6th DiFX workshop, CSIRO ATNF
CPU @ the Core
Number of phase centers For each phase centre, phase rotation and separate accumulation from thread to main buffer
QuickTime™ and a decompressor
are needed to see this picture.
Adam Deller 6th DiFX workshop, CSIRO ATNF
CPU @ the Core
Number of phase centers For each phase centre, phase rotation and separate accumulation from thread to main buffer
That costs CPU (proportional to number of baselines and number of phase centres), but also ensures that results don’t fit in cache (more later)
Adam Deller 6th DiFX workshop, CSIRO ATNF
CPU @ the Core
Spectral resolution More channels means a bigger FFT, and that costs CPU
Doesn’t typically follow a logN law like it should - bigger gets worse fast beyond ~1024 due to cache performance
Really big (>=8192 channels/subband) gets very expensive
Worst thing: typically comes in combination with multiple phase centres! (required to avoiding bandwidth smearing)
Adam Deller 6th DiFX workshop, CSIRO ATNF
CPU @ the Core
Array management #1: strideLength (auto setting usually best)
-180°
180°One FFT of data
sin/cos the first “strideLength” samples, and every “strideLength”’th after that
Adam Deller 6th DiFX workshop, CSIRO ATNF
CPU @ the Core
Array management #2: numBufferedFFTs (auto=10 usually ok)
Mitigates the cache miss problem by x10Mode 1 Mode 2 Mode 3 … Mode N
Visibility buffer(too big for cache)
But one slot fits in cache!
Precompute numBufferedFFTs FFT results, one station at a time
Adam Deller 6th DiFX workshop, CSIRO ATNF
CPU @ the Core
Array management #3: xmacLength (auto setting of 128 usually fine; further subdivides XMAC step)
Mode 1 Mode 2 Mode 3 … Mode N
Visibility buffer(too big for cache)
But one slot fits in cache!
Precompute numBufferedFFTs FFT results, one station at a time
Adam Deller 6th DiFX workshop, CSIRO ATNF
CPU @ the Core
nThreads Usually, set nThreads = n(CPU cores) - 1
Occasionally, can be advantageous to use fewer threads (avoiding swap memory / cache contention)
Adam Deller 6th DiFX workshop, CSIRO ATNF
CPU @ the Core
Fringe Rotation Order Default is 1, and this is almost always fine
2nd order only ever needed for very high fringe rates with very long FFTs (space VLBI of masers?)
BUT: 0th order could often be used, and almost never is: it can be about 25% faster
Fringerotationphase time
1st FFT 2nd FFT
Here, fringe rate is too high for 0th order
Adam Deller 6th DiFX workshop, CSIRO ATNF
CPU @ the Core
Fringe Rotation Order Default is 1, and this is almost always fine
2nd order only ever needed for very high fringe rates with very long FFTs (space VLBI of masers?)
BUT: 0th order could often be used, and almost never is: it can be about 25% faster
Fringerotationphase time
1st FFT 2nd FFT
But at low fringe rate, 0th order approximation can be acceptable
Adam Deller 6th DiFX workshop, CSIRO ATNF
CPU @ the Core
Fringe Rotation Order Default is 1, and this is almost always fine
2nd order only ever needed for very high fringe rates with very long FFTs (space VLBI of masers?)
BUT: 0th order could often be used, and almost never is: it can be about 25% faster
.v2d: fringeRotOrder = [0, 1, 2]
Adam Deller 6th DiFX workshop, CSIRO ATNF
CPU @ the FxManager
CPU load at the FxManager is typically light - it only does low-cadence accumulation and scaling of visibilities
Very short subintNS can potentially lead to problems (although network issues are more likely)
Adam Deller 6th DiFX workshop, CSIRO ATNF
Questions?